├── LICENSE
├── README.md
├── illustration.png
├── logo.png
├── mt_task.py
├── predictions
└── llama_flores
│ ├── en-de
│ ├── Llama-2-13b-chat-0shot-baseline.de
│ ├── Llama-2-13b-chat-0shot-contrastive-en-0.1.de
│ ├── Llama-2-13b-chat-0shot-contrastive-en-0.3.de
│ ├── Llama-2-13b-chat-0shot-contrastive-en-0.5.de
│ ├── Llama-2-13b-chat-0shot-contrastive-en-0.7.de
│ ├── Llama-2-13b-chat-0shot-contrastive-en-0.9.de
│ ├── Llama-2-13b-chat-1shot-baseline.de
│ ├── Llama-2-13b-chat-1shot-contrastive-en-0.1.de
│ ├── Llama-2-13b-chat-1shot-contrastive-en-0.3.de
│ ├── Llama-2-13b-chat-1shot-contrastive-en-0.5.de
│ ├── Llama-2-13b-chat-1shot-contrastive-en-0.7.de
│ ├── Llama-2-13b-chat-1shot-contrastive-en-0.9.de
│ ├── Llama-2-7b-chat-0shot-baseline.de
│ ├── Llama-2-7b-chat-0shot-contrastive-en-0.1.de
│ ├── Llama-2-7b-chat-0shot-contrastive-en-0.3.de
│ ├── Llama-2-7b-chat-0shot-contrastive-en-0.5.de
│ ├── Llama-2-7b-chat-0shot-contrastive-en-0.7.de
│ ├── Llama-2-7b-chat-0shot-contrastive-en-0.9.de
│ ├── Llama-2-7b-chat-1shot-baseline.de
│ ├── Llama-2-7b-chat-1shot-contrastive-en-0.1.de
│ ├── Llama-2-7b-chat-1shot-contrastive-en-0.3.de
│ ├── Llama-2-7b-chat-1shot-contrastive-en-0.5.de
│ ├── Llama-2-7b-chat-1shot-contrastive-en-0.7.de
│ └── Llama-2-7b-chat-1shot-contrastive-en-0.9.de
│ └── en-fr
│ ├── Llama-2-13b-chat-0shot-baseline.fr
│ ├── Llama-2-13b-chat-0shot-contrastive-en-0.1.fr
│ ├── Llama-2-13b-chat-0shot-contrastive-en-0.3.fr
│ ├── Llama-2-13b-chat-0shot-contrastive-en-0.5.fr
│ ├── Llama-2-13b-chat-0shot-contrastive-en-0.7.fr
│ ├── Llama-2-13b-chat-0shot-contrastive-en-0.9.fr
│ ├── Llama-2-13b-chat-1shot-baseline.fr
│ ├── Llama-2-13b-chat-1shot-contrastive-en-0.1.fr
│ ├── Llama-2-13b-chat-1shot-contrastive-en-0.3.fr
│ ├── Llama-2-13b-chat-1shot-contrastive-en-0.5.fr
│ ├── Llama-2-13b-chat-1shot-contrastive-en-0.7.fr
│ ├── Llama-2-13b-chat-1shot-contrastive-en-0.9.fr
│ ├── Llama-2-7b-chat-0shot-baseline.fr
│ ├── Llama-2-7b-chat-0shot-contrastive-en-0.1.fr
│ ├── Llama-2-7b-chat-0shot-contrastive-en-0.3.fr
│ ├── Llama-2-7b-chat-0shot-contrastive-en-0.5.fr
│ ├── Llama-2-7b-chat-0shot-contrastive-en-0.7.fr
│ ├── Llama-2-7b-chat-0shot-contrastive-en-0.9.fr
│ ├── Llama-2-7b-chat-1shot-baseline.fr
│ ├── Llama-2-7b-chat-1shot-contrastive-en-0.1.fr
│ ├── Llama-2-7b-chat-1shot-contrastive-en-0.3.fr
│ ├── Llama-2-7b-chat-1shot-contrastive-en-0.5.fr
│ ├── Llama-2-7b-chat-1shot-contrastive-en-0.7.fr
│ └── Llama-2-7b-chat-1shot-contrastive-en-0.9.fr
├── requirements.txt
├── scripts
├── run.py
└── utils_run.py
├── tests
├── __init__.py
├── test_contrastive_target_id.py
└── test_llama.py
└── translation_models
├── __init__.py
├── llama.py
├── m2m100.py
├── small100.py
├── tokenization_small100.py
├── utils.py
└── utils_llama.py
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2023 Alireza Mohammadshahi
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Contrastive Decoding
2 |
3 | [](#python)
4 | [](https://arxiv.org/abs/2309.07098)
5 | [](https://opensource.org/licenses/MIT)
6 |
7 |
8 |
9 |
10 |
11 | This repository implements source-contrastive and language-contrastive decoding, as described in [Sennrich et al. (EACL 2024)](https://arxiv.org/abs/2309.07098).
12 |
13 | - In **source-contrastive decoding**, we search for a translation that maximizes P(_Y_|_X_) - λ·P(_Y_|_X'_), where _X'_ is a random source segment. This penalizes hallucinations.
14 |
15 | - In **language-contrastive decoding**, we search for a translation that maximizes P(_Y_|_X_,_l_y_) - λ·P(_Y_|_X_,_l_y'_), where _l_y_ is the language indicator for the desired target language, _l_y'_ the indicator for some undesired language (such as English or the source language). This penalizes off-target translations.
16 |
17 |
18 |
19 |
20 |
21 | ## Installation
22 |
23 | - `pip install -r requirements.txt`
24 |
25 | ## Usage
26 |
27 | **Example commands**
28 |
29 | Source-contrastive decoding with [M2M-100 (418M)](https://arxiv.org/abs/2010.11125) on Asturian–Croatian, with λ_src=0.7:
30 | - `python -m scripts.run --model_path m2m100_418M --language_pairs ast-hr --source_contrastive --source_weight -0.7`
31 |
32 | Source-contrastive and language-contrastive decoding with [SMaLL-100](https://arxiv.org/abs/2210.11621) on Pashto–Asturian, with 2 random source segments, λ_src=0.7, λ_lang=0.1, and English and Pashto as contrastive target languages:
33 | - `python -m scripts.run --model_path small100 --language_pairs ps-ast --source_contrastive 2 --source_weight -0.7 --language_contrastive en ps --language_weight -0.1`
34 |
35 | Language-contrastive decoding with [Llama 2 Chat (7B)](https://arxiv.org/abs/2307.09288) on English–German, with λ_lang=0.5 and English as contrastive target language, using prompting with a one-shot example:
36 | - `python -m scripts.run --model_path llama-2-7b-chat --language_pairs en-de --language_contrastive en --language_weight -0.5 --oneshot`
37 |
38 | ## Dataset and Models:
39 |
40 | This repository automatically downloads and uses [FLORES-101](https://huggingface.co/datasets/gsarti/flores_101) for evaluation. ```devtest``` section is used for the evaluation.
41 |
42 | Multiple models are implemented:
43 |
44 | - [M2M-100 (418M)](https://huggingface.co/facebook/m2m100_418M). Use `--model_path m2m100_418M`
45 | - [SMaLL-100](https://huggingface.co/alirezamsh/small100). Use `--model_path small100`
46 | - [Llama 2 7B Chat](https://huggingface.co/meta-llama). Use `--model_path llama-2-7b-chat` or `llama-2-13b-chat`
47 |
48 |
49 | ## Evaluation
50 |
51 | ChrF2:
52 | ```
53 | sacrebleu ref.txt < output.txt --metrics chrf
54 | ```
55 |
56 |
57 | spBLEU:
58 | ```
59 | sacrebleu ref.txt < output.txt --tokenize flores101
60 | ```
61 |
62 |
63 | ## Reference
64 |
65 | ```bibtex
66 | @inproceedings{sennrich-etal-2024-mitigating,
67 | title={Mitigating Hallucinations and Off-target Machine Translation with Source-Contrastive and Language-Contrastive Decoding},
68 | author={Rico Sennrich and Jannis Vamvas and Alireza Mohammadshahi},
69 | booktitle={18th Conference of the European Chapter of the Association for Computational Linguistics},
70 | year={2024}
71 | }
72 | ```
73 |
--------------------------------------------------------------------------------
/illustration.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ZurichNLP/ContraDecode/261cb0de634414faf2d1d5b856b437988c2e462b/illustration.png
--------------------------------------------------------------------------------
/logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ZurichNLP/ContraDecode/261cb0de634414faf2d1d5b856b437988c2e462b/logo.png
--------------------------------------------------------------------------------
/mt_task.py:
--------------------------------------------------------------------------------
1 | import logging
2 | import subprocess
3 | import tempfile
4 | import random
5 | import copy
6 | from pathlib import Path
7 | from scripts.utils_run import FLORES101_CONVERT
8 | from sacrebleu import get_source_file
9 | from datasets import load_dataset
10 | from tqdm import tqdm
11 | import os
12 |
13 | class MTTask:
14 |
15 | def __init__(self,
16 | src_lang: str,
17 | tgt_lang: str,
18 | testset: str,
19 | ):
20 | self.src_lang = src_lang
21 | self.tgt_lang = tgt_lang
22 | self.language_pair = f"{src_lang}-{tgt_lang}"
23 | self.testset = testset
24 | base_out_dir = Path(__file__).parent / "out"
25 | print(base_out_dir)
26 | assert base_out_dir.exists()
27 | self.out_dir = base_out_dir / self.testset
28 | self.out_dir.mkdir(exist_ok=True)
29 |
30 | self.out_dir = self.out_dir / self.language_pair
31 | self.out_dir.mkdir(exist_ok=True)
32 | self.load_converter = FLORES101_CONVERT
33 |
34 | def __str__(self):
35 | return f"{self.testset}-{self.src_lang}-{self.tgt_lang}"
36 |
37 | def evaluate(self, translation_method: callable, type='direct', source_contrastive=1, source_weight=None, language_contrastive=None, language_weight=None) -> Path:
38 |
39 | ## load FLORES dataset
40 | source_sentences = load_dataset('gsarti/flores_101',self.load_converter[self.src_lang])['devtest']['sentence']
41 |
42 | if type == 'direct':
43 | translations = translation_method(
44 | src_lang=self.src_lang,
45 | tgt_lang=self.tgt_lang,
46 | source_sentences=source_sentences,
47 | )
48 | elif type == 'contrastive':
49 | multi_source_sentences = [source_sentences]
50 | src_weights = [1]
51 | tgt_langs=[self.tgt_lang]
52 | src_langs=[self.src_lang]
53 |
54 | # randomly shuffled input to suppress hallucinations
55 | if source_contrastive:
56 | for i in range(source_contrastive):
57 | shuffled_sentences = copy.copy(source_sentences)
58 | random.shuffle(shuffled_sentences)
59 | multi_source_sentences.append(shuffled_sentences)
60 | src_weights.append(source_weight/source_contrastive)
61 | tgt_langs.append(self.tgt_lang)
62 | src_langs.append(self.src_lang)
63 |
64 | # input with wrong target language indicator to suppress off-target translation
65 | if language_contrastive:
66 | for offtarget in language_contrastive:
67 | # ignore contrastive variants that are identical to true translation direction
68 | if offtarget == self.tgt_lang:
69 | continue
70 | # don't create contrastive variant for src language if language is already listed (avoid duplicates)
71 | if offtarget == 'src' and self.src_lang in language_contrastive:
72 | continue
73 | multi_source_sentences.append(source_sentences)
74 | src_weights.append(language_weight)
75 | if offtarget == 'src':
76 | tgt_langs.append(self.src_lang)
77 | else:
78 | tgt_langs.append(offtarget)
79 | src_langs.append(self.src_lang)
80 |
81 | translations = []
82 | for pair in tqdm(list(zip(*multi_source_sentences))):
83 | translation = translation_method(
84 | src_langs=src_langs,
85 | tgt_langs=tgt_langs,
86 | src_weights=src_weights,
87 | multi_source_sentences=pair,
88 | )
89 | translations.append(translation)
90 | else:
91 | raise NotImplementedError
92 |
93 | if type == 'direct':
94 | file_name = 'direct'
95 | elif type == 'contrastive':
96 | file_name = 'contrastive-{0}-{1}'.format(source_contrastive, source_weight)
97 | if language_contrastive:
98 | file_name += "-lang-{0}-{1}".format('+'.join(language_contrastive), language_weight)
99 | else:
100 | raise NotImplementedError
101 |
102 | with open(str(self.out_dir)+"/"+file_name+".txt", 'w') as f:
103 | f.write("\n".join(translations))
104 |
105 | if not os.path.isfile(str(self.out_dir)+"/"+"ref.text"):
106 | target_sentences = load_dataset('gsarti/flores_101', self.load_converter[self.tgt_lang])['devtest'][
107 | 'sentence']
108 | with open(str(self.out_dir) + "/" + "ref.txt", 'w') as f:
109 | f.write("\n".join(target_sentences))
110 |
111 | return Path(f.name)
112 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | datasets==2.14.4
2 | transformers==4.33.1
3 | sentencepiece==0.1.97
4 | sacrebleu==2.3.1
5 | accelerate==0.22.0
6 | bitsandbytes==0.41.1 # LLaMa 4bit
7 | scipy==1.11.2
8 |
--------------------------------------------------------------------------------
/scripts/run.py:
--------------------------------------------------------------------------------
1 | import logging
2 |
3 | import translation_models
4 | from mt_task import MTTask
5 | from translation_models import load_translation_model
6 | from scripts.utils_run import FLORES101_CONVERT, M2M100_TYPES, IGNORE_FOR_FLORES, FAMILIES
7 | from itertools import combinations
8 | import argparse
9 |
10 | logging.basicConfig(level=logging.INFO)
11 |
12 | def main(args):
13 |
14 | model = load_translation_model(args.model_path, device=0)
15 | language_pairs = args.language_pairs.split(',')
16 | language_pairs = [x.split("-") for x in language_pairs]
17 |
18 | if args.oneshot:
19 | assert isinstance(model, translation_models.llama.LLaMaTranslationModel)
20 | model.one_shot = True
21 |
22 | tasks = []
23 |
24 | for lang_pair in language_pairs:
25 | tasks.append(MTTask(lang_pair[0],lang_pair[1],'flores'))
26 | print(f"Task added {lang_pair[0]} - {lang_pair[1]}")
27 |
28 | for task in tasks:
29 | if args.source_contrastive or args.language_contrastive:
30 | print(f"Evaluating {task} multi_source")
31 | out_path = task.evaluate(model.translate_multi_source, 'contrastive', args.source_contrastive, args.source_weight, args.language_contrastive, args.language_weight)
32 | print(f"Translations saved in {out_path}")
33 | else:
34 | print(f"Evaluating {task} direct")
35 | out_path = task.evaluate(model.translate, 'direct')
36 | print(f"Translations saved in {out_path}")
37 |
38 | if __name__ == "__main__":
39 | parser = argparse.ArgumentParser()
40 | parser.add_argument("--model_path", type=str, default="",
41 | help="The HF model path")
42 | parser.add_argument("--language_pairs", type=str, default="",
43 | help="language pairs")
44 | parser.add_argument("--source_contrastive", nargs='?', const=1, type=int,
45 | help="enable contrastive input (randomly shuffled segments from input file). Optional INT defines number of contrastive inputs (default: 1).")
46 | parser.add_argument("--source_weight", type=float, default=-0.1,
47 | help="weight of source-contrastive variant. Default -0.1. If multiple contrastive inputs are used, this defines total weight assigned to them.")
48 | parser.add_argument("--language_contrastive", type=str, nargs='+', default=None,
49 | help="language codes of languages for which to construct contrastive variants. Can be multiple (space-separated); 'src' will be mapped to language code of source language. Example: '--language_contrastive en src' will create two contrastive inputs, one with English, one with the source language as desired output language.")
50 | parser.add_argument("--language_weight", type=float, default=-0.1,
51 | help="weight of contrastive variants with wrong language indicator. Default -0.1. If multiple contrastive inputs are used, this specifies weight assigned to each of them individually.")
52 | parser.add_argument("--oneshot", action='store_true', default=False,
53 | help="For LLaMa: provide one-shot translation example")
54 | args = parser.parse_args()
55 | main(args)
56 |
--------------------------------------------------------------------------------
/scripts/utils_run.py:
--------------------------------------------------------------------------------
1 | FLORES101_CONVERT = {
2 | "af": "afr",
3 | "am": "amh",
4 | "ar": "ara",
5 | "hy": "hye",
6 | "as": "asm",
7 | "ast": "ast",
8 | "az": "azj",
9 | "be": "bel",
10 | "bn": "ben",
11 | "bs": "bos",
12 | "bg": "bul",
13 | "my": "mya",
14 | "ca": "cat",
15 | "ceb": "ceb",
16 | "zh": "zho_simpl",
17 | "hr": "hrv",
18 | "cs": "ces",
19 | "da": "dan",
20 | "nl": "nld",
21 | "en": "eng",
22 | "et": "est",
23 | "tl": "tgl",
24 | "fi": "fin",
25 | "fr": "fra",
26 | "ff": "ful",
27 | "gl": "glg",
28 | "lg": "lug",
29 | "ka": "kat",
30 | "de": "deu",
31 | "el": "ell",
32 | "gu": "guj",
33 | "ha": "hau",
34 | "he": "heb",
35 | "hi": "hin",
36 | "hu": "hun",
37 | "is": "isl",
38 | "ig": "ibo",
39 | "id": "ind",
40 | "ga": "gle",
41 | "it": "ita",
42 | "ja": "jpn",
43 | "jv": "jav",
44 | "kea": "kea",
45 | "kam": "kam",
46 | "kn": "kan",
47 | "kk": "kaz",
48 | "km": "khm",
49 | "ko": "kor",
50 | "ky": "kir",
51 | "lo": "lao",
52 | "lv": "lav",
53 | "ln": "lin",
54 | "lt": "lit",
55 | "luo": "luo",
56 | "lb": "ltz",
57 | "mk": "mkd",
58 | "ms": "msa",
59 | "ml": "mal",
60 | "mt": "mlt",
61 | "mi": "mri",
62 | "mr": "mar",
63 | "mn": "mon",
64 | "ne": "npi",
65 | "nso": "nso",
66 | "no": "nob",
67 | "ny": "nya",
68 | "oc": "oci",
69 | "or": "ory",
70 | "om": "orm",
71 | "ps": "pus",
72 | "fa": "fas",
73 | "pl": "pol",
74 | "pt": "por",
75 | "pa": "pan",
76 | "ro": "ron",
77 | "ru": "rus",
78 | "sr": "srp",
79 | "sn": "sna",
80 | "sd": "snd",
81 | "sk": "slk",
82 | "sl": "slv",
83 | "so": "som",
84 | "ku": "ckb",
85 | "es": "spa",
86 | "sw": "swh",
87 | "sv": "swe",
88 | "tg": "tgk",
89 | "ta": "tam",
90 | "te": "tel",
91 | "th": "tha",
92 | "tr": "tur",
93 | "uk": "ukr",
94 | "umb": "umb",
95 | "ur": "urd",
96 | "uz": "uzb",
97 | "vi": "vie",
98 | "cy": "cym",
99 | "wo": "wol",
100 | "xh": "xho",
101 | "yo": "yor",
102 | "zu": "zul"
103 | }
104 |
105 | M2M100_TYPES = {
106 | 'af': 1,
107 | 'am': 1,
108 | 'ar': 2,
109 | 'hy': 1,
110 | 'as': 0,
111 | 'ast': 1,
112 | 'az': 1,
113 | 'be': 0,
114 | 'bn': 2,
115 | 'bs': 1,
116 | 'bg': 2,
117 | 'my': 1,
118 | 'ca': 2,
119 | 'ceb': 1,
120 | 'zh': 2,
121 | 'hr': 0,
122 | 'cs': 2,
123 | 'da': 2,
124 | 'nl': 2,
125 | 'en': 3,
126 | 'et': 2,
127 | 'tl': 0,
128 | 'fi': 2,
129 | 'fr': 3,
130 | 'ff': 0,
131 | 'gl': 2,
132 | 'lg': 0,
133 | 'ka': 2,
134 | 'de': 3,
135 | 'el': 2,
136 | 'gu': 1,
137 | 'ha': 1,
138 | 'he': 2,
139 | 'hi': 2,
140 | 'hu': 2,
141 | 'is': 2,
142 | 'ig': 1,
143 | 'id': 2,
144 | 'ga': 1,
145 | 'it': 3,
146 | 'ja': 2,
147 | 'jv': 2,
148 | 'kea': 0,
149 | 'kam': 0,
150 | 'kn': 1,
151 | 'kk': 1,
152 | 'km': 1,
153 | 'ko': 2,
154 | 'ky': 1,
155 | 'lo': 1,
156 | 'lv': 2,
157 | 'ln': 0,
158 | 'lt': 2,
159 | 'luo': 1,
160 | 'lb': 2,
161 | 'mk': 2,
162 | 'ms': 1,
163 | 'ml': 1,
164 | 'mt': 2,
165 | 'mr': 1,
166 | 'mi': 1,
167 | 'mn': 1,
168 | 'ne': 0,
169 | 'nso': 0,
170 | 'no': 2,
171 | 'ny': 1,
172 | 'oc': 0,
173 | 'or': 0,
174 | 'om': 1,
175 | 'ps': 1,
176 | 'fa': 2,
177 | 'pl': 2,
178 | 'pt': 3,
179 | 'pa': 1,
180 | 'ro': 2,
181 | 'ru': 3,
182 | 'sr': 2,
183 | 'sn': 1,
184 | 'sd': 0,
185 | 'sk': 2,
186 | 'sl': 2,
187 | 'so': 1,
188 | 'ku': 1,
189 | 'es': 3,
190 | 'sw': 1,
191 | 'sv': 2,
192 | 'tg': 1,
193 | 'ta': 1,
194 | 'te': 1,
195 | 'th': 2,
196 | 'tr': 2,
197 | 'uk': 2,
198 | 'umb': 1,
199 | 'ur': 1,
200 | 'uz': 0,
201 | 'vi': 2,
202 | 'cy': 1,
203 | 'wo': 0,
204 | 'xh': 1,
205 | 'yo': 1,
206 | 'zu': 1,
207 | }
208 |
209 | IGNORE_FOR_FLORES = ['as','kea','kam','ky','luo','mt','mr','ny','om','sn','ku','tg','te','umb']
210 |
211 | FAMILIES = {"Germanic":['af','da','nl','de','en','is','lb','no','sv','fy','yi'],'Slavic':['be','bs','bg','hr','cs','mk','pl','ru','sr','sk','sl','uk'],'Turkic':['az','ba','kk','tr','uz']}
212 |
--------------------------------------------------------------------------------
/tests/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ZurichNLP/ContraDecode/261cb0de634414faf2d1d5b856b437988c2e462b/tests/__init__.py
--------------------------------------------------------------------------------
/tests/test_contrastive_target_id.py:
--------------------------------------------------------------------------------
1 | from unittest import TestCase
2 |
3 | from translation_models import load_translation_model
4 |
5 |
6 | class ContrastiveTargetIDTestCase(TestCase):
7 |
8 | def setUp(self) -> None:
9 | self.model = load_translation_model("m2m100_418M")
10 |
11 | def test_translate_multi_source_greedy(self):
12 | translation = self.model.translate_multi_source(
13 | multi_source_sentences=[
14 | "I like apples",
15 | "I like apples",
16 | ],
17 | src_langs=2 * ["en"],
18 | tgt_langs=[
19 | "de",
20 | "fr",
21 | ],
22 | src_weights=[0.8, 0.2], # Upweight German target ID
23 | num_beams=1,
24 | )
25 | self.assertEqual(translation, "Ich mag Äpfel")
26 |
27 | translation = self.model.translate_multi_source(
28 | multi_source_sentences=[
29 | "I like apples",
30 | "I like apples",
31 | ],
32 | src_langs=2 * ["en"],
33 | tgt_langs=[
34 | "de",
35 | "fr",
36 | ],
37 | src_weights=[0.2, 0.8], # Upweight French target ID
38 | num_beams=1,
39 | )
40 | self.assertEqual(translation, "J’aime les pommes")
41 |
42 | def test_translate_multi_source_beam_search(self):
43 | translation = self.model.translate_multi_source(
44 | multi_source_sentences=[
45 | "I like apples",
46 | "I like apples",
47 | ],
48 | src_langs=2 * ["en"],
49 | tgt_langs=[
50 | "de",
51 | "fr",
52 | ],
53 | src_weights=[0.8, 0.2], # Upweight German target ID
54 | num_beams=5,
55 | )
56 | self.assertEqual(translation, "Ich mag Äpfel")
57 |
58 | translation = self.model.translate_multi_source(
59 | multi_source_sentences=[
60 | "I like apples",
61 | "I like apples",
62 | ],
63 | src_langs=2 * ["en"],
64 | tgt_langs=[
65 | "de",
66 | "fr",
67 | ],
68 | src_weights=[0.2, 0.8], # Upweight French target ID
69 | num_beams=5,
70 | )
71 | self.assertEqual(translation, "J’aime les pommes")
72 |
--------------------------------------------------------------------------------
/tests/test_llama.py:
--------------------------------------------------------------------------------
1 | import logging
2 | from unittest import TestCase
3 |
4 | from translation_models import load_translation_model
5 | from translation_models.llama import LLaMaTranslationModel, PromptTemplate
6 |
7 |
8 | # logging.basicConfig(level=logging.INFO)
9 |
10 |
11 | class PromptTemplateTestCase(TestCase):
12 |
13 | def setUp(self) -> None:
14 | self.template = PromptTemplate(system_prompt="You are an assistant.")
15 | self.template.add_user_message("Hello, how are you?")
16 |
17 | def test_build_prompt(self):
18 | prompt = self.template.build_prompt()
19 | print(prompt)
20 | self.assertEqual(prompt, """\
21 | [INST] <>
22 | You are an assistant.
23 | <> Hello, how are you? [/INST]""")
24 |
25 | def test_build_prompt__initial_inst(self):
26 | self.template.add_initial_inst = True
27 | prompt = self.template.build_prompt()
28 | print(prompt)
29 | self.assertEqual(prompt, """\
30 | [INST] <>
31 | You are an assistant.
32 | <>[INST] Hello, how are you? [/INST]""")
33 |
34 | def test_get_user_messages(self):
35 | self.template.add_model_reply("I am fine, thank you.", includes_history=False)
36 | user_messages = self.template.get_user_messages()
37 | self.assertEqual(user_messages, ["Hello, how are you?"])
38 |
39 | def test_get_model_replies(self):
40 | self.template.add_model_reply("I am fine, thank you.", includes_history=False)
41 | model_replies = self.template.get_model_replies()
42 | self.assertEqual(model_replies, ["I am fine, thank you."])
43 |
44 |
45 | class LLaMaTranslationModelTestCase(TestCase):
46 |
47 | def setUp(self) -> None:
48 | self.llama: LLaMaTranslationModel = load_translation_model("llama-2-7b-chat")
49 | # self.llama.one_shot = True
50 | self.assertEqual(self.llama._lang_code_to_name("en"), "English")
51 | self.assertEqual(self.llama._lang_code_to_name("de"), "German")
52 |
53 | def test_translate(self):
54 | source_sentences = [
55 | "Hello, how are you?",
56 | "An inquiry was established to investigate.",
57 | "On Monday, scientists from the Stanford University School of Medicine announced the invention of a new "
58 | "diagnostic tool that can sort cells by type: a tiny printable chip that can be manufactured using "
59 | "standard inkjet printers for possibly about one U.S. cent each.",
60 | ]
61 |
62 | for tgt_lang in [
63 | "de",
64 | "fr",
65 | "ru",
66 | ]:
67 | translations = self.llama.translate(
68 | src_lang="en",
69 | tgt_lang=tgt_lang,
70 | source_sentences=source_sentences,
71 | num_beams=1,
72 | )
73 | for translation in translations:
74 | print(translation)
75 |
76 | def test_translate_multi_source(self):
77 | translation = self.llama.translate_multi_source(
78 | multi_source_sentences=[
79 | "I like apples",
80 | "I like apples",
81 | ],
82 | src_langs=2 * ["en"],
83 | tgt_langs=[
84 | "de",
85 | "fr",
86 | ],
87 | src_weights=[0.8, 0.2], # Upweight German target ID
88 | num_beams=1,
89 | )
90 | print(translation)
91 |
92 | translation = self.llama.translate_multi_source(
93 | multi_source_sentences=[
94 | "I like apples",
95 | "I like apples",
96 | ],
97 | src_langs=2 * ["en"],
98 | tgt_langs=[
99 | "de",
100 | "fr",
101 | ],
102 | src_weights=[0.2, 0.8], # Upweight French target ID
103 | num_beams=1,
104 | )
105 | print(translation)
106 |
--------------------------------------------------------------------------------
/translation_models/__init__.py:
--------------------------------------------------------------------------------
1 | import json
2 | import logging
3 | import os
4 | import warnings
5 |
6 | from pathlib import Path
7 | from typing import List, Union, Tuple, Set, Optional
8 |
9 | from tqdm import tqdm
10 |
11 |
12 | class TranslationModel:
13 |
14 | def __str__(self):
15 | raise NotImplementedError
16 |
17 | def translate(self,
18 | tgt_lang: str,
19 | source_sentences: Union[str, List[str]],
20 | src_lang: str = None,
21 | return_score: bool = False,
22 | batch_size: int = 8,
23 | num_beams: int = 5,
24 | **kwargs,
25 | ) -> Union[str, List[str], Tuple[str, float], List[Tuple[str, float]]]:
26 | """
27 | :param tgt_lang: Language code of the target language
28 | :param source_sentences: A sentence or list of sentences
29 | :param src_lang: Language code of the source language (not needed for some multilingual models)
30 | :param return score: If true, return a tuple where the second element is sequence-level score of the translation
31 | :param batch_size
32 | :param kwargs
33 | :return: A sentence or list of sentences
34 | """
35 | if isinstance(source_sentences, str):
36 | source_sentences_list = [source_sentences]
37 | elif isinstance(source_sentences, list):
38 | source_sentences_list = source_sentences
39 | else:
40 | raise ValueError
41 |
42 | self._set_tgt_lang(tgt_lang)
43 | if self.requires_src_lang:
44 | if src_lang is None:
45 | warnings.warn(f"NMT model {self} requires the src language. Assuming 'en'; override with `src_lang`")
46 | src_lang = "en"
47 | self._set_src_lang(src_lang)
48 | translations_list = self._translate(source_sentences_list, return_score, batch_size, num_beams=num_beams, **kwargs)
49 | assert len(translations_list) == len(source_sentences_list)
50 |
51 | if isinstance(source_sentences, str):
52 | translations = translations_list[0]
53 | else:
54 | translations = translations_list
55 | return translations
56 |
57 | def _translate_multi_source(self,
58 | multi_source_sentences: List[str],
59 | src_langs: List[str],
60 | tgt_langs: List[str],
61 | src_weights: Optional[List[float]] = None,
62 | **kwargs,
63 | ) -> str:
64 | raise NotImplementedError
65 |
66 |
67 | def translate_multi_source(self,
68 | multi_source_sentences: List[str],
69 | tgt_langs: List[str],
70 | src_langs: Optional[List[str]] = None,
71 | src_weights: Optional[List[float]] = None,
72 | num_beams: int = 5,
73 | **kwargs,
74 | ) -> str:
75 | translation = None
76 |
77 | if translation is None:
78 | self._set_tgt_lang(tgt_langs[0])
79 | if self.requires_src_lang:
80 | assert src_langs is not None
81 | translation = self._translate_multi_source(multi_source_sentences, src_langs, tgt_langs, src_weights=src_weights, num_beams=num_beams, **kwargs)
82 |
83 | return translation
84 |
85 |
86 | def load_translation_model(name: str, **kwargs) -> TranslationModel:
87 | """
88 | Convenience function to load a :class: TranslationModel using a shorthand name of the model
89 | """
90 | if name == "m2m100_418M":
91 | from translation_models.m2m100 import M2M100Model
92 | translation_model = M2M100Model(model_name_or_path="facebook/m2m100_418M", **kwargs)
93 | elif name == "m2m100_1.2B":
94 | from translation_models.m2m100 import M2M100Model
95 | translation_model = M2M100Model(model_name_or_path="facebook/m2m100_1.2B", **kwargs)
96 | elif name == "small100":
97 | from translation_models.small100 import SMaLL100Model
98 | translation_model = SMaLL100Model(model_name_or_path="alirezamsh/small100", **kwargs)
99 | elif name == "llama-2-7b-chat":
100 | from translation_models.llama import LLaMaTranslationModel
101 | translation_model = LLaMaTranslationModel(model_name_or_path="meta-llama/Llama-2-7b-chat-hf", **kwargs)
102 | elif name == "llama-2-13b-chat":
103 | from translation_models.llama import LLaMaTranslationModel
104 | translation_model = LLaMaTranslationModel(model_name_or_path="meta-llama/Llama-2-13b-chat-hf", **kwargs)
105 | elif name == "llama-2-70b-chat":
106 | from translation_models.llama import LLaMaTranslationModel
107 | translation_model = LLaMaTranslationModel(model_name_or_path="meta-llama/Llama-2-70b-chat-hf", **kwargs)
108 | else:
109 | raise NotImplementedError
110 | return translation_model
111 |
--------------------------------------------------------------------------------
/translation_models/llama.py:
--------------------------------------------------------------------------------
1 | import logging
2 | from typing import Set, List, Union, Tuple, Optional
3 |
4 | import torch
5 | from tqdm import tqdm
6 | from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, LogitsProcessorList
7 |
8 | from scripts.utils_run import FLORES101_CONVERT
9 | from translation_models import TranslationModel
10 | from translation_models.m2m100 import EnsembleLogitsProcessor
11 | from translation_models.utils_llama import language_names, one_shot_sentences
12 |
13 |
14 | class LLaMaTranslationModel(TranslationModel):
15 |
16 | # Official templates used during instruction tuning of LLaMA
17 | TEMPLATE_0 = "{src_sent}\n\nTranslate to {tgt_lang}"
18 | TEMPLATE_1 = "{src_sent}\n\nCould you please translate this to {tgt_lang}?"
19 | TEMPLATE_2 = "{src_sent}\n\nTranslate this to {tgt_lang}?"
20 | TEMPLATE_3 = "Translate to {tgt_lang}:\n\n{src_sent}"
21 | TEMPLATE_4 = "Translate the following sentence to {tgt_lang}:\n{src_sent}"
22 | TEMPLATE_5 = "How is \"{src_sent}\" said in {tgt_lang}?"
23 | TEMPLATE_6 = "Translate \"{src_sent}\" to {tgt_lang}?"
24 |
25 | SYSTEM_PROMPT = """You are a machine translation system that translates sentences from {src_lang} to {tgt_lang}. You just respond with the translation, without any additional comments."""
26 |
27 | def __init__(self,
28 | model_name_or_path: str,
29 | message_template: str = TEMPLATE_0,
30 | one_shot: bool = False,
31 | padding: str = "before_system_prompt",
32 | **kwargs,
33 | ):
34 | super().__init__()
35 | self.model_name_or_path = model_name_or_path
36 | self.model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map='auto', load_in_4bit=True,
37 | torch_dtype=torch.bfloat16)
38 | self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
39 | self.pipeline = pipeline('text-generation', model=self.model, tokenizer=self.tokenizer)
40 | self.message_template = message_template
41 | self.one_shot = one_shot
42 | assert padding in ["before_system_prompt", "after_system_prompt"]
43 | self.padding = padding
44 | self.src_lang = None
45 | self.tgt_lang = None
46 |
47 | def __str__(self):
48 | return str(self.model_name_or_path).replace("/", "_")
49 |
50 | @property
51 | def supported_languages(self) -> Set[str]:
52 | return {code for code, code3 in FLORES101_CONVERT.items() if code3 in language_names}
53 |
54 | def requires_src_lang(self):
55 | return True
56 |
57 | def _set_src_lang(self, src_lang: str):
58 | assert src_lang in self.supported_languages
59 | self.src_lang = src_lang
60 |
61 | def _set_tgt_lang(self, tgt_lang: str):
62 | assert tgt_lang in self.supported_languages
63 | self.tgt_lang = tgt_lang
64 |
65 | def _lang_code_to_name(self, lang_code: str) -> str:
66 | lang_code3 = FLORES101_CONVERT.get(lang_code, lang_code)
67 | return language_names[lang_code3]
68 |
69 | @torch.no_grad()
70 | def _translate(self,
71 | source_sentences: List[str],
72 | return_score: bool = False,
73 | batch_size: int = 1,
74 | num_beams: int = 1,
75 | **kwargs,
76 | ) -> Union[List[str], List[Tuple[str, float]]]:
77 | if return_score:
78 | raise NotImplementedError
79 | if batch_size != 1:
80 | logging.warning(
81 | f"Batch size {batch_size} is not supported by LLaMaTranslationModel. Setting batch size to 1.")
82 | batch_size = 1
83 | if num_beams != 1:
84 | logging.warning(f"Beam search is not supported by LLaMaTranslationModel. Setting num_beams to 1.")
85 | num_beams = 1
86 |
87 | assert self.src_lang is not None
88 | assert self.tgt_lang is not None
89 | system_prompt = self.SYSTEM_PROMPT.format(
90 | src_lang=self._lang_code_to_name(self.src_lang),
91 | tgt_lang=self._lang_code_to_name(self.tgt_lang),
92 | )
93 |
94 | if self.one_shot:
95 | system_prompt += "\n\nExample instruction:\n{instruction}\n\nExample response:\nSure, here's the translation:\n{response}".format(
96 | instruction=self.message_template.format(
97 | src_lang=self._lang_code_to_name(self.src_lang),
98 | tgt_lang=self._lang_code_to_name(self.tgt_lang),
99 | src_sent=one_shot_sentences[FLORES101_CONVERT.get(self.src_lang, self.src_lang)],
100 | ),
101 | response=one_shot_sentences[FLORES101_CONVERT.get(self.tgt_lang, self.tgt_lang)],
102 | )
103 |
104 | translations = []
105 | for source_sentence in tqdm(source_sentences):
106 | prompt_template = PromptTemplate(system_prompt=system_prompt)
107 | message = self.message_template.format(
108 | src_lang=self._lang_code_to_name(self.src_lang),
109 | tgt_lang=self._lang_code_to_name(self.tgt_lang),
110 | src_sent=source_sentence,
111 | )
112 | logging.info(message)
113 | prompt_template.add_user_message(message)
114 | prompt = prompt_template.build_prompt()
115 | prompt += "Sure, here's the translation:\n"
116 | inputs = self.pipeline.preprocess(prompt)
117 | output = self.pipeline.forward(
118 | inputs,
119 | eos_token_id=self.tokenizer.eos_token_id,
120 | max_length=1200, # Max ref length across Flores-101 is 960
121 | remove_invalid_values=True,
122 | num_beams=num_beams,
123 | # Disable sampling
124 | do_sample=False,
125 | temperature=1.0,
126 | top_p=1.0,
127 | )
128 | output = self.pipeline.postprocess(output)
129 | output = output[0]['generated_text']
130 | logging.info(output)
131 | prompt_template.add_model_reply(output, includes_history=True)
132 | response = prompt_template.get_model_replies(strip=True)[0]
133 | response_lines = response.replace("Sure, here's the translation:", "").strip().split("\n")
134 | if not response_lines:
135 | translation = ""
136 | else:
137 | translation = response_lines[0].strip()
138 | translations.append(translation)
139 | return translations
140 |
141 | def _translate_multi_source(self,
142 | multi_source_sentences: List[str],
143 | src_langs: List[str],
144 | tgt_langs: List[str],
145 | src_weights: Optional[List[float]] = None,
146 | num_beams: int = 1,
147 | **kwargs,
148 | ) -> str:
149 | assert len(multi_source_sentences) == len(src_langs) == len(tgt_langs)
150 | if src_weights is not None:
151 | assert len(src_weights) == len(multi_source_sentences)
152 | if num_beams != 1:
153 | logging.warning(f"Beam search is not supported by LLaMaTranslationModel. Setting num_beams to 1.")
154 | num_beams = 1
155 |
156 | prompts = []
157 | prompt_templates = []
158 | for src_sent, src_lang, tgt_lang in zip(multi_source_sentences, src_langs, tgt_langs):
159 | system_prompt = self.SYSTEM_PROMPT.format(
160 | src_lang=self._lang_code_to_name(src_lang),
161 | tgt_lang=self._lang_code_to_name(tgt_lang),
162 | )
163 | if self.one_shot:
164 | system_prompt += "\n\nExample instruction:\n{instruction}\n\nExample response:\nSure, here's the translation:\n{response}".format(
165 | instruction=self.message_template.format(
166 | src_lang=self._lang_code_to_name(src_lang),
167 | tgt_lang=self._lang_code_to_name(tgt_lang),
168 | src_sent=one_shot_sentences[FLORES101_CONVERT.get(src_lang, src_lang)],
169 | ),
170 | response=one_shot_sentences[FLORES101_CONVERT.get(tgt_lang, tgt_lang)],
171 | )
172 | prompt_template = PromptTemplate(system_prompt=system_prompt)
173 | message = self.message_template.format(
174 | src_lang=self._lang_code_to_name(src_lang),
175 | tgt_lang=self._lang_code_to_name(tgt_lang),
176 | src_sent=src_sent,
177 | )
178 | prompt_template.add_user_message(message)
179 | prompt = prompt_template.build_prompt()
180 | prompt += "Sure, here's the translation:\n"
181 | prompts.append(prompt)
182 | prompt_templates.append(prompt_template)
183 |
184 | inputs = [self.pipeline.preprocess(prompt) for prompt in prompts]
185 | input_ids = [x['input_ids'][0].tolist() for x in inputs]
186 | attention_mask = [x['attention_mask'][0].tolist() for x in inputs]
187 |
188 | pad_token_id = self.tokenizer.get_vocab()["▁"]
189 | max_len = max(len(x) for x in input_ids)
190 | if self.padding == "before_system_prompt":
191 | input_ids = [[pad_token_id] * (max_len - len(x)) + x for x in input_ids]
192 | attention_mask = [[0] * (max_len - len(x)) + x for x in attention_mask]
193 | elif self.padding == "after_system_prompt":
194 | sys_end_id = self.tokenizer.get_vocab()[">>"]
195 | for i in range(len(input_ids)):
196 | second_inst_idx = input_ids[i].index(sys_end_id, 1)
197 | input_ids[i] = (input_ids[i][:second_inst_idx + 1] +
198 | [pad_token_id] * (max_len - len(input_ids[i])) +
199 | input_ids[i][second_inst_idx + 1:])
200 | attention_mask[i] = (attention_mask[i][:second_inst_idx + 1] +
201 | [0] * (max_len - len(attention_mask[i])) +
202 | attention_mask[i][second_inst_idx + 1:])
203 |
204 | input_ids = torch.tensor(input_ids).to(self.model.device)
205 | attention_mask = torch.tensor(attention_mask).to(self.model.device)
206 | logits_processor = LogitsProcessorList([
207 | EnsembleLogitsProcessor(num_beams=num_beams, source_weights=src_weights),
208 | ])
209 | output = self.model.generate(
210 | input_ids=input_ids,
211 | attention_mask=attention_mask,
212 | num_beams=num_beams,
213 | eos_token_id=self.tokenizer.eos_token_id,
214 | max_length=1200,
215 | logits_processor=logits_processor,
216 | remove_invalid_values=True,
217 | # Disable sampling
218 | do_sample=False,
219 | temperature=1.0,
220 | top_p=1.0,
221 | **kwargs,
222 | )
223 | output = output.reshape(1, output.shape[0], *output.shape[1:])
224 | output = {
225 | "generated_sequence": output,
226 | "input_ids": input_ids[0],
227 | "prompt_text": prompts[0],
228 | }
229 | output = self.pipeline._ensure_tensor_on_device(output, device=torch.device("cpu"))
230 | output = self.pipeline.postprocess(output)
231 | output = output[0]['generated_text']
232 | _, output = output.rsplit("[/INST]", maxsplit=1)
233 | logging.info(output)
234 | prompt_templates[0].add_model_reply(output, includes_history=False)
235 | response = prompt_templates[0].get_model_replies(strip=True)[0]
236 | response_lines = response.replace("Sure, here's the translation:", "").strip().split("\n")
237 | if not response_lines:
238 | translation = ""
239 | else:
240 | translation = response_lines[0].strip()
241 | return translation
242 |
243 |
244 | class PromptTemplate:
245 | """
246 | Manages the conversation with a LLaMa chat model.
247 |
248 | Adapted from https://github.com/samrawal/llama2_chat_templater
249 | (c) Sam Rawal
250 |
251 | Adapted to be more similar to https://huggingface.co/blog/llama2#how-to-prompt-llama-2
252 | """
253 |
254 | def __init__(self, system_prompt=None, add_initial_inst=True):
255 | self.system_prompt = system_prompt
256 | self.add_initial_inst = add_initial_inst
257 | self.user_messages = []
258 | self.model_replies = []
259 |
260 | def add_user_message(self, message: str, return_prompt=True):
261 | self.user_messages.append(message)
262 | if return_prompt:
263 | return self.build_prompt()
264 |
265 | def add_model_reply(self, reply: str, includes_history=True, return_reply=True):
266 | reply_ = reply.replace(self.build_prompt(), "") if includes_history else reply
267 | self.model_replies.append(reply_)
268 | if len(self.user_messages) != len(self.model_replies):
269 | raise ValueError(
270 | "Number of user messages does not equal number of system replies."
271 | )
272 | if return_reply:
273 | return reply_
274 |
275 | def get_user_messages(self, strip=True):
276 | return [x.strip() for x in self.user_messages] if strip else self.user_messages
277 |
278 | def get_model_replies(self, strip=True):
279 | return [x.strip() for x in self.model_replies] if strip else self.model_replies
280 |
281 | def build_prompt(self):
282 | if len(self.user_messages) != len(self.model_replies) + 1:
283 | raise ValueError(
284 | "Error: Expected len(user_messages) = len(model_replies) + 1. Add a new user message!"
285 | )
286 |
287 | if self.system_prompt is not None:
288 | SYS = f"[INST] <>\n{self.system_prompt}\n<>"
289 | else:
290 | SYS = ""
291 |
292 | CONVO = ""
293 | SYS = "" + SYS
294 | for i in range(len(self.user_messages) - 1):
295 | user_message, model_reply = self.user_messages[i], self.model_replies[i]
296 | conversation_ = f"{user_message} [/INST] {model_reply} "
297 | if i != 0:
298 | conversation_ = "[INST] " + conversation_
299 | CONVO += conversation_
300 |
301 | if self.add_initial_inst:
302 | CONVO += f"[INST] {self.user_messages[-1]} [/INST]"
303 | else:
304 | if len(self.user_messages) <= 1:
305 | CONVO += f" {self.user_messages[-1]} [/INST]"
306 | else:
307 | raise NotImplementedError
308 |
309 | return SYS + CONVO
310 |
--------------------------------------------------------------------------------
/translation_models/m2m100.py:
--------------------------------------------------------------------------------
1 | from typing import List, Union, Tuple, Set, Optional
2 |
3 | import torch
4 | import sys
5 | from tqdm import tqdm
6 | from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer, LogitsProcessorList, LogitsProcessor, \
7 | ForcedBOSTokenLogitsProcessor
8 | from transformers.file_utils import PaddingStrategy
9 |
10 | from translation_models import TranslationModel
11 | from translation_models.utils import batch
12 | import torch.nn.functional as F
13 |
14 |
15 | def zero_out_max(x):
16 | max_num, max_index = x.max(dim=0)
17 | x[max_index] = 0
18 | return x
19 |
20 |
21 | class EnsembleLogitsProcessor(LogitsProcessor):
22 |
23 | def __init__(self, num_beams: int, source_weights: List[float] = None, preserve_bos_token: bool = False):
24 | self.num_beams = num_beams
25 | self.source_weights = source_weights
26 | self.preserve_bos_token = preserve_bos_token
27 |
28 | def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
29 | cur_len = input_ids.shape[-1]
30 | if self.preserve_bos_token and cur_len <= 1:
31 | return scores
32 |
33 | scores = F.softmax(scores, dim=-1)
34 |
35 | batch_size = int(input_ids.size(0) / self.num_beams)
36 | if self.source_weights is not None:
37 | assert len(self.source_weights) == batch_size
38 | source_weights = torch.Tensor(self.source_weights).to(scores.device)
39 | else:
40 | source_weights = 1/(batch_size-1) * torch.ones((batch_size,), device=scores.device)
41 | for i in range(self.num_beams):
42 | beam_indices = self.num_beams * torch.arange(batch_size, device=scores.device, dtype=torch.long) + i
43 | cands = scores[beam_indices]
44 | mean_scores = torch.log((source_weights.unsqueeze(-1).expand(-1, scores.size(-1)) * cands).sum(dim=0))
45 | for j in beam_indices:
46 | scores[j] = mean_scores
47 |
48 | if torch.isnan(scores).any():
49 | scores = torch.nan_to_num(scores, nan=float('-inf'))
50 |
51 | return scores
52 |
53 |
54 | class BatchedForcedBOSTokenLogitsProcessor(ForcedBOSTokenLogitsProcessor):
55 | r"""
56 | [`LogitsProcessor`] that enforces the specified token as the first generated token.
57 | This subclass allows for different forced bos tokens per row in the batch.
58 | """
59 |
60 | def __init__(self, bos_token_ids: List[int]):
61 | self.bos_token_ids = bos_token_ids
62 |
63 | def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
64 | cur_len = input_ids.shape[-1]
65 | if cur_len != 1:
66 | return scores
67 |
68 | batch_size = len(self.bos_token_ids)
69 | num_beams = int(input_ids.size(0) / batch_size)
70 | # Disable all tokens by default
71 | scores.fill_(-float("inf"))
72 | for row_idx in range(batch_size):
73 | # Set the bos token for all beams corresponding to this input row
74 | row_indices = torch.arange(num_beams, device=scores.device) + row_idx * num_beams
75 | scores[row_indices, self.bos_token_ids[row_idx]] = 0
76 | return scores
77 |
78 |
79 | class M2M100Model(TranslationModel):
80 | """
81 | Loads one of the models described in: Fan, Angela, et al. "Beyond english-centric multilingual machine
82 | translation." Journal of Machine Learning Research 22.107 (2021): 1-48.
83 |
84 | Uses the implementation of the Hugging Face Transformers library
85 | (https://huggingface.co/docs/transformers/model_doc/m2m_100).
86 | """
87 |
88 | def __init__(self,
89 | model_name_or_path: str = "facebook/m2m100_418M",
90 | device=None,
91 | ):
92 | self.tokenizer = M2M100Tokenizer.from_pretrained(model_name_or_path)
93 | self.model_name_or_path = model_name_or_path
94 | self.model = M2M100ForConditionalGeneration.from_pretrained(model_name_or_path)
95 | if device is not None:
96 | self.model = self.model.to(device)
97 | self.model.config.max_length = max(self.model.config.max_length, self.model.config.max_position_embeddings - 4)
98 |
99 | def __str__(self):
100 | return self.model_name_or_path
101 |
102 | @property
103 | def supported_languages(self) -> Set[str]:
104 | return {'af', 'am', 'ar', 'ast', 'az', 'ba', 'be', 'bg', 'bn', 'br', 'bs', 'ca', 'ceb', 'cs', 'cy', 'da', 'de', 'el', 'en', 'es', 'et', 'fa', 'ff', 'fi', 'fr', 'fy', 'ga', 'gd', 'gl', 'gu', 'ha', 'he', 'hi', 'hr', 'ht', 'hu', 'hy', 'id', 'ig', 'ilo', 'is', 'it', 'ja', 'jv', 'ka', 'kk', 'km', 'kn', 'ko', 'lb', 'lg', 'ln', 'lo', 'lt', 'lv', 'mg', 'mk', 'ml', 'mn', 'mr', 'ms', 'my', 'ne', 'nl', 'no', 'ns', 'oc', 'or', 'pa', 'pl', 'ps', 'pt', 'ro', 'ru', 'sd', 'si', 'sk', 'sl', 'so', 'sq', 'sr', 'ss', 'su', 'sv', 'sw', 'ta', 'th', 'tl', 'tn', 'tr', 'uk', 'ur', 'uz', 'vi', 'wo', 'xh', 'yi', 'yo', 'zh', 'zu'}
105 |
106 | @property
107 | def requires_src_lang(self) -> bool:
108 | return True
109 |
110 | def _set_src_lang(self, src_lang: str):
111 | assert src_lang in self.supported_languages
112 | self.src_lang = src_lang
113 | self.tokenizer.src_lang = src_lang
114 |
115 | def _set_tgt_lang(self, tgt_lang: str):
116 | assert tgt_lang in self.supported_languages
117 | self.tgt_lang = tgt_lang
118 | self.tokenizer.tgt_lang = tgt_lang
119 |
120 | @torch.no_grad()
121 | def _translate(self,
122 | source_sentences: List[str],
123 | return_score: bool = False,
124 | batch_size: int = 8,
125 | num_beams: int = 5,
126 | **kwargs,
127 | ) -> Union[List[str], List[Tuple[str, float]]]:
128 | padding_strategy = PaddingStrategy.LONGEST if batch_size > 1 else PaddingStrategy.DO_NOT_PAD
129 | translations = []
130 | for src_sentences in tqdm(list(batch(source_sentences, batch_size)), disable=len(source_sentences) / batch_size < 10):
131 | inputs = self.tokenizer._batch_encode_plus(src_sentences, return_tensors="pt",
132 | padding_strategy=padding_strategy)
133 | inputs = inputs.to(self.model.device)
134 | model_output = self.model.generate(
135 | **inputs,
136 | forced_bos_token_id=self.tokenizer.get_lang_id(self.tgt_lang),
137 | num_beams=num_beams,
138 | return_dict_in_generate=True,
139 | output_scores=return_score,
140 | **kwargs,
141 | )
142 | batch_translations = self.tokenizer.batch_decode(model_output.sequences, skip_special_tokens=True)
143 | if return_score:
144 | # Does not match our score method output for some reason; need to investigate further
145 | # scores = (2 ** model_output.sequences_scores).tolist()
146 | scores = [None for _ in batch_translations]
147 | assert len(batch_translations) == len(scores)
148 | batch_translations = list(zip(batch_translations, scores))
149 | translations += batch_translations
150 | return translations
151 |
152 | def _translate_multi_source(self,
153 | multi_source_sentences: List[str],
154 | src_langs: List[str],
155 | tgt_langs: List[str],
156 | src_weights: Optional[List[float]] = None,
157 | num_beams: int = 1,
158 | **kwargs,
159 | ) -> str:
160 | assert len(multi_source_sentences) == len(src_langs) == len(tgt_langs)
161 | #src_weights = [1, -0.1]
162 | if src_weights is not None:
163 | assert len(src_weights) == len(multi_source_sentences)
164 |
165 | inputs = self.tokenizer._batch_encode_plus(multi_source_sentences, return_tensors="pt",
166 | padding_strategy=PaddingStrategy.LONGEST)
167 | # Set individual src language token per row
168 | for i, src_lang in enumerate(src_langs):
169 | inputs["input_ids"][i][0] = self.tokenizer.get_lang_id(src_lang)
170 | inputs = inputs.to(self.model.device)
171 | logits_processor = LogitsProcessorList([
172 | BatchedForcedBOSTokenLogitsProcessor([self.tokenizer.get_lang_id(tgt_lang) for tgt_lang in tgt_langs]),
173 | EnsembleLogitsProcessor(num_beams=num_beams, source_weights=src_weights, preserve_bos_token=True),
174 | ])
175 | model_output = self.model.generate(
176 | **inputs,
177 | num_beams=num_beams,
178 | return_dict_in_generate=True,
179 | logits_processor=logits_processor,
180 | **kwargs,
181 | )
182 | translations = self.tokenizer.batch_decode(model_output.sequences, skip_special_tokens=True)
183 | return translations[0]
184 |
--------------------------------------------------------------------------------
/translation_models/small100.py:
--------------------------------------------------------------------------------
1 | from typing import List, Union, Tuple, Set, Optional
2 | import torch
3 | from tqdm import tqdm
4 | from transformers import LogitsProcessorList, LogitsProcessor
5 | from transformers.file_utils import PaddingStrategy
6 | from translation_models import TranslationModel
7 | from translation_models.utils import batch
8 | import torch.nn.functional as F
9 | from translation_models.m2m100 import zero_out_max, EnsembleLogitsProcessor
10 | from transformers import M2M100ForConditionalGeneration
11 | from translation_models.tokenization_small100 import SMALL100Tokenizer
12 |
13 |
14 | class SMaLL100Model(TranslationModel):
15 | """
16 | Uses the implementation of the Hugging Face Transformers library
17 | (https://huggingface.co/alirezamsh/small100).
18 | """
19 |
20 | def __init__(self,
21 | model_name_or_path: str = "alirezamsh/small100",
22 | device=None,
23 | ):
24 | self.tokenizer = SMALL100Tokenizer.from_pretrained(model_name_or_path)
25 | self.model_name_or_path = model_name_or_path
26 | self.model = M2M100ForConditionalGeneration.from_pretrained(model_name_or_path)
27 | if device is not None:
28 | self.model = self.model.to(device)
29 | self.model.config.max_length = max(self.model.config.max_length, self.model.config.max_position_embeddings - 4)
30 |
31 | def __str__(self):
32 | return self.model_name_or_path
33 |
34 | @property
35 | def supported_languages(self) -> Set[str]:
36 | return {'af', 'am', 'ar', 'ast', 'az', 'ba', 'be', 'bg', 'bn', 'br', 'bs', 'ca', 'ceb', 'cs', 'cy', 'da', 'de', 'el', 'en', 'es', 'et', 'fa', 'ff', 'fi', 'fr', 'fy', 'ga', 'gd', 'gl', 'gu', 'ha', 'he', 'hi', 'hr', 'ht', 'hu', 'hy', 'id', 'ig', 'ilo', 'is', 'it', 'ja', 'jv', 'ka', 'kk', 'km', 'kn', 'ko', 'lb', 'lg', 'ln', 'lo', 'lt', 'lv', 'mg', 'mk', 'ml', 'mn', 'mr', 'ms', 'my', 'ne', 'nl', 'no', 'ns', 'oc', 'or', 'pa', 'pl', 'ps', 'pt', 'ro', 'ru', 'sd', 'si', 'sk', 'sl', 'so', 'sq', 'sr', 'ss', 'su', 'sv', 'sw', 'ta', 'th', 'tl', 'tn', 'tr', 'uk', 'ur', 'uz', 'vi', 'wo', 'xh', 'yi', 'yo', 'zh', 'zu'}
37 |
38 | @property
39 | def ranked_languages(self) -> List[str]:
40 | return ["en", "es", "fr", "de", "pt", "ru", "nl", "sv", "pl", "tr", "id", "zh", "vi", "ar", "el", "cz", "ja", "hu", "fi", "ko", "he", "fa", "lt", "hi"]
41 |
42 | @property
43 | def requires_src_lang(self) -> bool:
44 | return False
45 |
46 | def _set_src_lang(self, src_lang: str):
47 | assert src_lang in self.supported_languages
48 | self.src_lang = src_lang
49 | #self.tokenizer.src_lang = src_lang
50 |
51 | def _set_tgt_lang(self, tgt_lang: str):
52 | assert tgt_lang in self.supported_languages
53 | self.tgt_lang = tgt_lang
54 | self.tokenizer.tgt_lang = tgt_lang
55 |
56 | @torch.no_grad()
57 | def _translate(self,
58 | source_sentences: List[str],
59 | return_score: bool = False,
60 | batch_size: int = 8,
61 | num_beams: int = 5,
62 | **kwargs,
63 | ) -> Union[List[str], List[Tuple[str, float]]]:
64 | padding_strategy = PaddingStrategy.LONGEST if batch_size > 1 else PaddingStrategy.DO_NOT_PAD
65 | translations = []
66 | for src_sentences in tqdm(list(batch(source_sentences, batch_size)), disable=len(source_sentences) / batch_size < 10):
67 | inputs = self.tokenizer._batch_encode_plus(src_sentences, return_tensors="pt",
68 | padding_strategy=padding_strategy)
69 | inputs = inputs.to(self.model.device)
70 | model_output = self.model.generate(
71 | **inputs,
72 | num_beams=num_beams,
73 | return_dict_in_generate=True,
74 | output_scores=return_score,
75 | **kwargs,
76 | )
77 | batch_translations = self.tokenizer.batch_decode(model_output.sequences, skip_special_tokens=True)
78 | if return_score:
79 | # Does not match our score method output for some reason; need to investigate further
80 | # scores = (2 ** model_output.sequences_scores).tolist()
81 | scores = [None for _ in batch_translations]
82 | assert len(batch_translations) == len(scores)
83 | batch_translations = list(zip(batch_translations, scores))
84 | translations += batch_translations
85 | return translations
86 |
87 | def _translate_multi_source(self,
88 | multi_source_sentences: List[str],
89 | src_langs: List[str],
90 | tgt_langs: List[str],
91 | src_weights: Optional[List[float]] = None,
92 | num_beams: int = 1,
93 | **kwargs,
94 | ) -> str:
95 | assert len(multi_source_sentences) == len(src_langs)
96 | #src_weights = [0.5,0.25,0.25]
97 |
98 | inputs = self.tokenizer._batch_encode_plus(multi_source_sentences, return_tensors="pt",
99 | padding_strategy=PaddingStrategy.LONGEST)
100 | # Set individual src language token per row
101 | for i, src_lang in enumerate(src_langs):
102 | inputs["input_ids"][i][0] = self.tokenizer.get_lang_id(tgt_langs[i])
103 | inputs = inputs.to(self.model.device)
104 | logits_processor = LogitsProcessorList([EnsembleLogitsProcessor(num_beams=num_beams, source_weights=src_weights)])
105 | model_output = self.model.generate(
106 | **inputs,
107 | num_beams=num_beams,
108 | return_dict_in_generate=True,
109 | logits_processor=logits_processor,
110 | **kwargs,
111 | )
112 | translations = self.tokenizer.batch_decode(model_output.sequences, skip_special_tokens=True)
113 | return translations[0]
114 |
--------------------------------------------------------------------------------
/translation_models/tokenization_small100.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) 2022 Idiap Research Institute, http://www.idiap.ch/
2 | # Written by Alireza Mohammadshahi
3 | # This is a modified version of https://github.com/huggingface/transformers/blob/main/src/transformers/models/m2m_100/tokenization_m2m_100.py
4 | # which owns by Fariseq Authors and The HuggingFace Inc. team.
5 | #
6 | #
7 | # Licensed under the Apache License, Version 2.0 (the "License");
8 | # you may not use this file except in compliance with the License.
9 | # You may obtain a copy of the License at
10 | #
11 | # http://www.apache.org/licenses/LICENSE-2.0
12 | #
13 | # Unless required by applicable law or agreed to in writing, software
14 | # distributed under the License is distributed on an "AS IS" BASIS,
15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
16 | # See the License for the specific language governing permissions and
17 | # limitations under the License.
18 | """Tokenization classes for SMALL100."""
19 | import json
20 | import os
21 | from pathlib import Path
22 | from shutil import copyfile
23 | from typing import Any, Dict, List, Optional, Tuple, Union
24 |
25 | import sentencepiece
26 |
27 | from transformers.tokenization_utils import BatchEncoding, PreTrainedTokenizer
28 | from transformers.utils import logging
29 |
30 |
31 | logger = logging.get_logger(__name__)
32 |
33 | SPIECE_UNDERLINE = "▁"
34 |
35 | VOCAB_FILES_NAMES = {
36 | "vocab_file": "vocab.json",
37 | "spm_file": "sentencepiece.bpe.model",
38 | "tokenizer_config_file": "tokenizer_config.json",
39 | }
40 |
41 | PRETRAINED_VOCAB_FILES_MAP = {
42 | "vocab_file": {
43 | "alirezamsh/small100": "https://huggingface.co/alirezamsh/small100/resolve/main/vocab.json",
44 | },
45 | "spm_file": {
46 | "alirezamsh/small100": "https://huggingface.co/alirezamsh/small100/resolve/main/sentencepiece.bpe.model",
47 | },
48 | "tokenizer_config_file": {
49 | "alirezamsh/small100": "https://huggingface.co/alirezamsh/small100/resolve/main/tokenizer_config.json",
50 | },
51 | }
52 |
53 | PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
54 | "alirezamsh/small100": 1024,
55 | }
56 |
57 | # fmt: off
58 | FAIRSEQ_LANGUAGE_CODES = {
59 | "m2m100": ["af", "am", "ar", "ast", "az", "ba", "be", "bg", "bn", "br", "bs", "ca", "ceb", "cs", "cy", "da", "de", "el", "en", "es", "et", "fa", "ff", "fi", "fr", "fy", "ga", "gd", "gl", "gu", "ha", "he", "hi", "hr", "ht", "hu", "hy", "id", "ig", "ilo", "is", "it", "ja", "jv", "ka", "kk", "km", "kn", "ko", "lb", "lg", "ln", "lo", "lt", "lv", "mg", "mk", "ml", "mn", "mr", "ms", "my", "ne", "nl", "no", "ns", "oc", "or", "pa", "pl", "ps", "pt", "ro", "ru", "sd", "si", "sk", "sl", "so", "sq", "sr", "ss", "su", "sv", "sw", "ta", "th", "tl", "tn", "tr", "uk", "ur", "uz", "vi", "wo", "xh", "yi", "yo", "zh", "zu"]
60 | }
61 | # fmt: on
62 |
63 |
64 | class SMALL100Tokenizer(PreTrainedTokenizer):
65 | """
66 | Construct an SMALL100 tokenizer. Based on [SentencePiece](https://github.com/google/sentencepiece).
67 | This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
68 | this superclass for more information regarding those methods.
69 | Args:
70 | vocab_file (`str`):
71 | Path to the vocabulary file.
72 | spm_file (`str`):
73 | Path to [SentencePiece](https://github.com/google/sentencepiece) file (generally has a .spm extension) that
74 | contains the vocabulary.
75 | tgt_lang (`str`, *optional*):
76 | A string representing the target language.
77 | eos_token (`str`, *optional*, defaults to `""`):
78 | The end of sequence token.
79 | sep_token (`str`, *optional*, defaults to `""`):
80 | The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
81 | sequence classification or for a text and a question for question answering. It is also used as the last
82 | token of a sequence built with special tokens.
83 | unk_token (`str`, *optional*, defaults to `""`):
84 | The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
85 | token instead.
86 | pad_token (`str`, *optional*, defaults to `""`):
87 | The token used for padding, for example when batching sequences of different lengths.
88 | language_codes (`str`, *optional*):
89 | What language codes to use. Should be `"m2m100"`.
90 | sp_model_kwargs (`dict`, *optional*):
91 | Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for
92 | SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things,
93 | to set:
94 | - `enable_sampling`: Enable subword regularization.
95 | - `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout.
96 | - `nbest_size = {0,1}`: No sampling is performed.
97 | - `nbest_size > 1`: samples from the nbest_size results.
98 | - `nbest_size < 0`: assuming that nbest_size is infinite and samples from the all hypothesis (lattice)
99 | using forward-filtering-and-backward-sampling algorithm.
100 | - `alpha`: Smoothing parameter for unigram sampling, and dropout probability of merge operations for
101 | BPE-dropout.
102 | Examples:
103 | ```python
104 | >>> from tokenization_small100 import SMALL100Tokenizer
105 | >>> tokenizer = SMALL100Tokenizer.from_pretrained("alirezamsh/small100", tgt_lang="ro")
106 | >>> src_text = " UN Chief Says There Is No Military Solution in Syria"
107 | >>> tgt_text = "Şeful ONU declară că nu există o soluţie militară în Siria"
108 | >>> model_inputs = tokenizer(src_text, text_target=tgt_text, return_tensors="pt")
109 | >>> model(**model_inputs) # should work
110 | ```"""
111 |
112 | vocab_files_names = VOCAB_FILES_NAMES
113 | max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
114 | pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
115 | model_input_names = ["input_ids", "attention_mask"]
116 |
117 | prefix_tokens: List[int] = []
118 | suffix_tokens: List[int] = []
119 |
120 | def __init__(
121 | self,
122 | vocab_file,
123 | spm_file,
124 | tgt_lang=None,
125 | bos_token="",
126 | eos_token="",
127 | sep_token="",
128 | pad_token="",
129 | unk_token="",
130 | language_codes="m2m100",
131 | sp_model_kwargs: Optional[Dict[str, Any]] = None,
132 | num_madeup_words=8,
133 | **kwargs,
134 | ) -> None:
135 | self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
136 |
137 | self.language_codes = language_codes
138 | fairseq_language_code = FAIRSEQ_LANGUAGE_CODES[language_codes]
139 | self.lang_code_to_token = {lang_code: f"__{lang_code}__" for lang_code in fairseq_language_code}
140 |
141 | kwargs["additional_special_tokens"] = kwargs.get("additional_special_tokens", [])
142 | kwargs["additional_special_tokens"] += [
143 | self.get_lang_token(lang_code)
144 | for lang_code in fairseq_language_code
145 | if self.get_lang_token(lang_code) not in kwargs["additional_special_tokens"]
146 | ]
147 |
148 | super().__init__(
149 | tgt_lang=tgt_lang,
150 | bos_token=bos_token,
151 | eos_token=eos_token,
152 | sep_token=sep_token,
153 | unk_token=unk_token,
154 | pad_token=pad_token,
155 | language_codes=language_codes,
156 | sp_model_kwargs=self.sp_model_kwargs,
157 | num_madeup_words=num_madeup_words,
158 | **kwargs,
159 | )
160 |
161 | self.vocab_file = vocab_file
162 | self.encoder = load_json(vocab_file)
163 | self.decoder = {v: k for k, v in self.encoder.items()}
164 | self.spm_file = spm_file
165 | self.sp_model = load_spm(spm_file, self.sp_model_kwargs)
166 |
167 | self.encoder_size = len(self.encoder)
168 |
169 | self.lang_token_to_id = {
170 | self.get_lang_token(lang_code): self.encoder_size + i for i, lang_code in enumerate(fairseq_language_code)
171 | }
172 | self.lang_code_to_id = {lang_code: self.encoder_size + i for i, lang_code in enumerate(fairseq_language_code)}
173 | self.id_to_lang_token = {v: k for k, v in self.lang_token_to_id.items()}
174 |
175 | self._tgt_lang = tgt_lang if tgt_lang is not None else "en"
176 | self.cur_lang_id = self.get_lang_id(self._tgt_lang)
177 | self.set_lang_special_tokens(self._tgt_lang)
178 |
179 | self.num_madeup_words = num_madeup_words
180 |
181 | @property
182 | def vocab_size(self) -> int:
183 | return len(self.encoder) + len(self.lang_token_to_id) + self.num_madeup_words
184 |
185 | @property
186 | def tgt_lang(self) -> str:
187 | return self._tgt_lang
188 |
189 | @tgt_lang.setter
190 | def tgt_lang(self, new_tgt_lang: str) -> None:
191 | self._tgt_lang = new_tgt_lang
192 | self.set_lang_special_tokens(self._tgt_lang)
193 |
194 | def _tokenize(self, text: str) -> List[str]:
195 | return self.sp_model.encode(text, out_type=str)
196 |
197 | def _convert_token_to_id(self, token):
198 | if token in self.lang_token_to_id:
199 | return self.lang_token_to_id[token]
200 | return self.encoder.get(token, self.encoder[self.unk_token])
201 |
202 | def _convert_id_to_token(self, index: int) -> str:
203 | """Converts an index (integer) in a token (str) using the decoder."""
204 | if index in self.id_to_lang_token:
205 | return self.id_to_lang_token[index]
206 | return self.decoder.get(index, self.unk_token)
207 |
208 | def convert_tokens_to_string(self, tokens: List[str]) -> str:
209 | """Converts a sequence of tokens (strings for sub-words) in a single string."""
210 | return self.sp_model.decode(tokens)
211 |
212 | def get_special_tokens_mask(
213 | self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
214 | ) -> List[int]:
215 | """
216 | Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
217 | special tokens using the tokenizer `prepare_for_model` method.
218 | Args:
219 | token_ids_0 (`List[int]`):
220 | List of IDs.
221 | token_ids_1 (`List[int]`, *optional*):
222 | Optional second list of IDs for sequence pairs.
223 | already_has_special_tokens (`bool`, *optional*, defaults to `False`):
224 | Whether or not the token list is already formatted with special tokens for the model.
225 | Returns:
226 | `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
227 | """
228 |
229 | if already_has_special_tokens:
230 | return super().get_special_tokens_mask(
231 | token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
232 | )
233 |
234 | prefix_ones = [1] * len(self.prefix_tokens)
235 | suffix_ones = [1] * len(self.suffix_tokens)
236 | if token_ids_1 is None:
237 | return prefix_ones + ([0] * len(token_ids_0)) + suffix_ones
238 | return prefix_ones + ([0] * len(token_ids_0)) + ([0] * len(token_ids_1)) + suffix_ones
239 |
240 | def build_inputs_with_special_tokens(
241 | self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
242 | ) -> List[int]:
243 | """
244 | Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
245 | adding special tokens. An MBART sequence has the following format, where `X` represents the sequence:
246 | - `input_ids` (for encoder) `X [eos, src_lang_code]`
247 | - `decoder_input_ids`: (for decoder) `X [eos, tgt_lang_code]`
248 | BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a
249 | separator.
250 | Args:
251 | token_ids_0 (`List[int]`):
252 | List of IDs to which the special tokens will be added.
253 | token_ids_1 (`List[int]`, *optional*):
254 | Optional second list of IDs for sequence pairs.
255 | Returns:
256 | `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
257 | """
258 | if token_ids_1 is None:
259 | if self.prefix_tokens is None:
260 | return token_ids_0 + self.suffix_tokens
261 | else:
262 | return self.prefix_tokens + token_ids_0 + self.suffix_tokens
263 | # We don't expect to process pairs, but leave the pair logic for API consistency
264 | if self.prefix_tokens is None:
265 | return token_ids_0 + token_ids_1 + self.suffix_tokens
266 | else:
267 | return self.prefix_tokens + token_ids_0 + token_ids_1 + self.suffix_tokens
268 |
269 | def get_vocab(self) -> Dict:
270 | vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
271 | vocab.update(self.added_tokens_encoder)
272 | return vocab
273 |
274 | def __getstate__(self) -> Dict:
275 | state = self.__dict__.copy()
276 | state["sp_model"] = None
277 | return state
278 |
279 | def __setstate__(self, d: Dict) -> None:
280 | self.__dict__ = d
281 |
282 | # for backward compatibility
283 | if not hasattr(self, "sp_model_kwargs"):
284 | self.sp_model_kwargs = {}
285 |
286 | self.sp_model = load_spm(self.spm_file, self.sp_model_kwargs)
287 |
288 | def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
289 | save_dir = Path(save_directory)
290 | if not save_dir.is_dir():
291 | raise OSError(f"{save_directory} should be a directory")
292 | vocab_save_path = save_dir / (
293 | (filename_prefix + "-" if filename_prefix else "") + self.vocab_files_names["vocab_file"]
294 | )
295 | spm_save_path = save_dir / (
296 | (filename_prefix + "-" if filename_prefix else "") + self.vocab_files_names["spm_file"]
297 | )
298 |
299 | save_json(self.encoder, vocab_save_path)
300 |
301 | if os.path.abspath(self.spm_file) != os.path.abspath(spm_save_path) and os.path.isfile(self.spm_file):
302 | copyfile(self.spm_file, spm_save_path)
303 | elif not os.path.isfile(self.spm_file):
304 | with open(spm_save_path, "wb") as fi:
305 | content_spiece_model = self.sp_model.serialized_model_proto()
306 | fi.write(content_spiece_model)
307 |
308 | return (str(vocab_save_path), str(spm_save_path))
309 |
310 | def prepare_seq2seq_batch(
311 | self,
312 | src_texts: List[str],
313 | tgt_texts: Optional[List[str]] = None,
314 | tgt_lang: str = "ro",
315 | **kwargs,
316 | ) -> BatchEncoding:
317 | self.tgt_lang = tgt_lang
318 | self.set_lang_special_tokens(self.tgt_lang)
319 | return super().prepare_seq2seq_batch(src_texts, tgt_texts, **kwargs)
320 |
321 | def _build_translation_inputs(self, raw_inputs, tgt_lang: Optional[str], **extra_kwargs):
322 | """Used by translation pipeline, to prepare inputs for the generate function"""
323 | if tgt_lang is None:
324 | raise ValueError("Translation requires a `tgt_lang` for this model")
325 | self.tgt_lang = tgt_lang
326 | inputs = self(raw_inputs, add_special_tokens=True, **extra_kwargs)
327 | return inputs
328 |
329 | def _switch_to_input_mode(self):
330 | self.set_lang_special_tokens(self.tgt_lang)
331 |
332 | def _switch_to_target_mode(self):
333 | self.prefix_tokens = None
334 | self.suffix_tokens = [self.eos_token_id]
335 |
336 | def set_lang_special_tokens(self, src_lang: str) -> None:
337 | """Reset the special tokens to the tgt lang setting. No prefix and suffix=[eos, tgt_lang_code]."""
338 | lang_token = self.get_lang_token(src_lang)
339 | self.cur_lang_id = self.lang_token_to_id[lang_token]
340 | self.prefix_tokens = [self.cur_lang_id]
341 | self.suffix_tokens = [self.eos_token_id]
342 |
343 | def get_lang_token(self, lang: str) -> str:
344 | return self.lang_code_to_token[lang]
345 |
346 | def get_lang_id(self, lang: str) -> int:
347 | lang_token = self.get_lang_token(lang)
348 | return self.lang_token_to_id[lang_token]
349 |
350 |
351 | def load_spm(path: str, sp_model_kwargs: Dict[str, Any]) -> sentencepiece.SentencePieceProcessor:
352 | spm = sentencepiece.SentencePieceProcessor(**sp_model_kwargs)
353 | spm.Load(str(path))
354 | return spm
355 |
356 |
357 | def load_json(path: str) -> Union[Dict, List]:
358 | with open(path, "r") as f:
359 | return json.load(f)
360 |
361 |
362 | def save_json(data, path: str) -> None:
363 | with open(path, "w") as f:
364 | json.dump(data, f, indent=2)
365 |
--------------------------------------------------------------------------------
/translation_models/utils.py:
--------------------------------------------------------------------------------
1 | from typing import List
2 |
3 |
4 | def batch(input: List, batch_size: int):
5 | l = len(input)
6 | for ndx in range(0, l, batch_size):
7 | yield input[ndx:min(ndx + batch_size, l)]
8 |
--------------------------------------------------------------------------------
/translation_models/utils_llama.py:
--------------------------------------------------------------------------------
1 | language_names = {'eng': 'English', 'afr': 'Afrikaans', 'amh': 'Amharic', 'ara': 'Arabic', 'hye': 'Armenian',
2 | 'asm': 'Assamese', 'ast': 'Asturian', 'azj': 'Azerbaijani', 'bel': 'Belarusian', 'ben': 'Bengali', 'bos': 'Bosnian',
3 | 'bul': 'Bulgarian', 'mya': 'Burmese', 'cat': 'Catalan', 'ceb': 'Cebuano', 'zho_simpl': 'Chinese (Simplified)',
4 | 'zho_trad': 'Chinese (Traditional)', 'hrv': 'Croatian', 'ces': 'Czech', 'dan': 'Danish', 'nld': 'Dutch',
5 | 'est': 'Estonian', 'tgl': 'Filipino (Tagalog)', 'fin': 'Finnish', 'fra': 'French', 'ful': 'Fula', 'glg': 'Galician',
6 | 'lug': 'Ganda', 'kat': 'Georgian', 'deu': 'German', 'ell': 'Greek', 'guj': 'Gujarati', 'hau': 'Hausa',
7 | 'heb': 'Hebrew', 'hin': 'Hindi', 'hun': 'Hungarian', 'isl': 'Icelandic', 'ibo': 'Igbo', 'ind': 'Indonesian',
8 | 'gle': 'Irish', 'ita': 'Italian', 'jpn': 'Japanese', 'jav': 'Javanese', 'kea': 'Kabuverdianu', 'kam': 'Kamba',
9 | 'kan': 'Kannada', 'kaz': 'Kazakh', 'khm': 'Khmer', 'kor': 'Korean', 'kir': 'Kyrgyz', 'lao': 'Lao', 'lav': 'Latvian',
10 | 'lin': 'Lingala', 'lit': 'Lithuanian', 'luo': 'Luo', 'ltz': 'Luxembourgish', 'mkd': 'Macedonian', 'msa': 'Malay',
11 | 'mal': 'Malayalam', 'mlt': 'Maltese', 'mri': 'Māori', 'mar': 'Marathi', 'mon': 'Mongolian', 'npi': 'Nepali',
12 | 'nso': 'Northern Sotho', 'nob': 'Norwegian', 'nya': 'Nyanja', 'oci': 'Occitan', 'ory': 'Oriya', 'orm': 'Oromo',
13 | 'pus': 'Pashto', 'fas': 'Persian', 'pol': 'Polish', 'por': 'Portuguese (Brazil)', 'pan': 'Punjabi',
14 | 'ron': 'Romanian', 'rus': 'Russian', 'srp': 'Serbian', 'sna': 'Shona', 'snd': 'Sindhi', 'slk': 'Slovak',
15 | 'slv': 'Slovenian', 'som': 'Somali', 'ckb': 'Sorani Kurdish', 'spa': 'Spanish (Latin America)', 'swh': 'Swahili',
16 | 'swe': 'Swedish', 'tgk': 'Tajik', 'tam': 'Tamil', 'tel': 'Telugu', 'tha': 'Thai', 'tur': 'Turkish',
17 | 'ukr': 'Ukrainian', 'umb': 'Umbundu', 'urd': 'Urdu', 'uzb': 'Uzbek', 'vie': 'Vietnamese', 'cym': 'Welsh',
18 | 'wol': 'Wolof', 'xho': 'Xhosa', 'yor': 'Yoruba', 'zul': 'Zulu', }
19 |
20 | # First sentence from "dev" split
21 | one_shot_sentences = {
22 | 'afr': 'Op Maandag het wetenskaplikes vanaf die Stanford Universiteit Medieseskool aangekondig dat die uitvinding van ‘n diagnostiese hulpmiddel wat selle kan rangskik volgens tipe: ‘n klein drukbare skyfie wat vervaardig kan word deur om standard inkjetprinters te gebruik vir ‘n moontlike een VS sent elk.',
23 | 'amh': 'ሰኞ እለት፣ በስታንፎርድ ዩኒቨርሲቲ የህክምና ትምህርት ቤት ህዋሶችን በአይነት የሚያስቀምጥ አዲስ የምርመራ መሳሪያ እንደተፈጠረ አስታውቋል፡ እያንዳንዱን በአንደ የዩ.ኤስ ሳንቲም የሚሆን መደበኛ የኢንክጄት አታሚዎችን በመጠቀም ሊፈበረክ የሚችል ትንሽ መታተም የሚችል ቺፕ።',
24 | 'ara': 'في يوم الاثنين، أعلن علماء من كلية الطب بجامعة ستانفورد عن اختراع أداة تشخيصية جديدة يمكنها تصنيف الخلايا حسب النوع: شريحة صغيرة قابلة للطباعة يمكن تصنيعها باستخدام طابعات نفاثة للحبر قياسية مقابل حوالي سنت أمريكي واحد لكل منها.',
25 | 'hye': 'Երկուշաբթի օրը Բժշկության Ստենֆորդի համալսարանի դպրոցի գիտնականները հայտարարեց նոր ախտորոշման գործիքի ստեղծման մասին, որը կարող է տեսակավորել բջիջներն ըստ տեսակի՝ փոքրիկ տպման հնարավորությամբ չիպ, որը կարող է արտադրվել՝ օգտագործելով ստանդարտ թանաքային տպիչներ յուրաքանչյուրը հավանաբար մոտ մեկ ԱՄՆ ցենտով։',
26 | 'asm': "সোমবাৰে, ষ্টেনফ'ৰ্ড ইউনিভাৰচিটি স্কুল অৱ মেডিচিনৰ বিজ্ঞানীসকলে ঘোষণা কৰিছিল যে তেওঁলোকে এটা নতুন ৰোগ চিনাক্তকৰণ সঁজুলি আৱিষ্কাৰ কৰিছিল যিয়ে ধৰণ অনুসৰি কোষসমূহ ক্ৰম কৰে: এটা ক্ষুদ্ৰ মুদ্ৰণযোগ্য চিপ যিটো প্ৰত্যেকটোতে সম্ভৱপৰ প্ৰায় 1 U.S. চেণ্ট খৰচ হোৱাকৈ ষ্টেণ্ডাৰ্ড ইনজেক্ট প্ৰিণ্টাৰ ব্যৱহাৰ কৰি প্ৰস্তুত কৰা হৈছে।",
27 | 'ast': 'El llunes, científicos de la Escuela Universitaria de Medicina de la Universidad de Stanford, ficieron anuncia de la invención d’una ferramienta de diagnósticu que puede dixebrar les célules acordies col tipu: un chip imprimible minúsculu que se puede manufacturar usando impresores de tinta estándar pol preciu d’aproximadamente un centavu de dólar per unidá.',
28 | 'azj': 'Bazar ertəsi, Stanford Universitetinin Tibb Fakültəsinin professorları, hüceyrələri növlərinə görə kateqoriyalara bölən yeni bir diaqnoz cihazının ixtira edildiyini bildirib. Bu kiçik çap edilə bilən çip, hər biri təxminən 1 ABŞ sentinə başa gələn standart mürəkkəb kartricli printerlər vasitəsilə istehsal edilə bilər.',
29 | 'bel': 'Навукоўцы медыцынскага факультэту Стэнфардскага ўніверсітэту абвясцілі ў панядзелак аб адкрыцці новага дыягнастычнага сродку, які дазваляе сартаваць клеткі па тыпах. Гаворка ідзе пра маленькі друкаваны чып, які можна надрукаваць на звычайным струменным прынтары. Кошт такога чыпу — каля аднаго цэнту ЗША за адзінку.',
30 | 'ben': 'সোমবার, স্ট্যানফোর্ড বিশ্ববিদ্যালয়ের স্কুল অফ মেডিসিনের বিজ্ঞানীরা একটি নতুন রোগ নির্ণয়ের যন্ত্র আবিষ্কারের ঘোষণা দিয়েছেন যেটি কোষগুলি প্রকারভেদে আলাদা করতে পারে: একটি ছোট মুদ্রণযোগ্য চিপ যার সম্ভাব্য মূল্য প্রায় মার্কিন এক সেন্ট মানসম্পন্ন ইঙ্কজেট প্রিন্টার ব্যবহার করে তৈরি করা যেতে পারে।',
31 | 'bos': 'Naučnici sa Medicinskog fakulteta Univerziteta u Stanfordu su u ponedjeljak najavili izum novog dijagnostičkog alata koji može da sortira stanice prema tipu: maleni čip koji se može štampati i proizvoditi korištenjem standardnih tintnih štampača za oko jedan američki cent.',
32 | 'bul': 'Учените от медицинското училище в университета в Станфод обявиха в понеделник изобретяването на нов диагностичен инструмент, който може да сортира клетките по тип: малък печатен чип, който може да бъде произведен с помощта на стандартни мастилено-струйни принтери за вероятно около един американски цент всеки.',
33 | 'mya': 'တနင်္လာနေ့တွင် စတန်းဖို့ဒ်တက္ကသိုလ် ဆေးကျောင်းမှ သိပ္ပံပညာရှင်များသည် ဆဲလ်များကို အမျိုးအစားအလိုက် စီစဉ်နိုင်သော ရောဂါခွဲခြားစစ်ဆေးမှု ကိရိယာအသစ် တီထွင်မှုအကြောင်းကို ကြေညာခဲ့သည်- ၎င်းသည် စံနှုန်းမီ မင်စက်ကလေးများဖြင့် ပုံဖော်သည့် ပရင်တာများကို သုံးကာ ထုတ်လုပ်နိုင်သော အလွန်သေးငယ်သည့် တစ်ခုလျှင် U.S. ဆင့်တစ်ပြားသာသာရှိသော ပရင့်လုပ်၍ရသော အပြားလေးဖြစ်ပါသည်။',
34 | 'cat': "Dilluns, científics de la Facultat de Medicina de la Universitat de Stanford van anunciar la invenció d'una nova eina per fer diagnòstics que pot separar les cèl·lules per tipus: un petit xip que pot ser imprès i fabricat usant impressores d'injecció estàndards per probablement un cèntim americà cadascun.",
35 | 'ceb': 'Kaniadtong Lunes, gipahibalo sa mga siyentista gikan sa Stanford University School of Medicine ang pag-imbento sa usa ka bag-ong himan sa pagdayagnos nga makahimo sa paghan-ay sa mga selyula pinaagi sa lahi: usa ka gamay nga maimprinta nga tipik nga mahimong gamit ang sagad nga mga inkjet printer alang sa tingali mga usa ka sentimo sa Estados Unidos.',
36 | 'zho_simpl': '周一,斯坦福大学医学院的科学家宣布,他们发明了一种可以将细胞按类型分类的新型诊断工具:一种可打印的微型芯片。这种芯片可以使用标准喷墨打印机制造,每片价格可能在一美分左右。',
37 | 'hrv': 'U ponedjeljak su znanstvenici s Medicinskog fakulteta Sveučilišta u Stanfordu najavili izum novog dijagnostičkog alata koji omogućuje sortiranje stanica prema tipu: to je sićušni čip koji je moguće isprintati s pomoću običnog tintnog pisača za cijenu od oko jedan američki cent po komadu.',
38 | 'ces': 'V pondělí vědci z Lékařské fakulty Stanfordovy univerzity oznámili vynález nového diagnostického nástroje, který dokáže třídit buňky podle typu: malý vytisknutelný čip, který lze vyrobit pomocí standardních inkoustových tiskáren za cenu přibližně jednoho amerického centu za kus.',
39 | 'dan': 'Mandag offentliggjorde forskere fra Stanford Universitys School of Medicine opfindelsen af et nyt diagnoseværktøj, som kan sortere celler efter type: en ganske lille printbar chip, som kan fremstilles med almindelige inkjetprintere, muligvis for omkring én US-cent stykket.',
40 | 'nld': 'Op maandag kondigden wetenschappers van de Stanford University School of Medicine aan dat een nieuw diagnostisch hulpmiddel is uitgevonden dat cellen op type kan ordenen: een piepkleine af te drukken chip die voor waarschijnlijk ongeveer één dollarcent per stuk kan worden geproduceerd met behulp van standaard inkjetprinters.',
41 | 'eng': 'On Monday, scientists from the Stanford University School of Medicine announced the invention of a new diagnostic tool that can sort cells by type: a tiny printable chip that can be manufactured using standard inkjet printers for possibly about one U.S. cent each.',
42 | 'est': 'Esmaspäeval teatasid Stanfordi ülikooli meditsiiniteaduskonna teadlased, et leiutasid uue diagnostilise tööriista, millega saab rakke tüübi järgi sortida; see on väike prinditav kiip, mida saab toota standardsete tindiprinteritega ja ühe hind võib olla vaid 1USA sent.',
43 | 'tgl': 'Noong Lunes, inanunsiyo ng mga siyentipiko mula sa Stanford University School of Medicine ang imbensyon ng panibagong kagamitan sa pag-diagnose na makakauri sa mga cell ayon sa uri: isang maliit na chip na maaaring maprint na maaaring magawa gamit ang standard inkjet na mga printer at posibleng nasa isang U.S. sentimo kada isa.',
44 | 'fin': 'Stanfordin yliopiston lääketieteen laitoksen tutkijat ilmoittivat maanantaina uuden diagnostiikkatyökalun keksimisestä: solut tyypin mukaan lajitteleva pienenpieni tulostettava siru, joka voidaan valmistaa normaaleilla mustesuihkutulostimilla mahdollisesti noin yhden Yhdysvaltain sentin kappalehintaan.',
45 | 'fra': "Des scientifiques de l’école de médecine de l’université de Stanford ont annoncé ce lundi la création d'un nouvel outil de diagnostic, qui permettrait de différencier les cellules en fonction de leur type. Il s'agit d'une petit puce imprimable, qui peut être produite au moyen d'une imprimante à jet d'encre standard, pour un coût d'environ un cent de dollar pièce.",
46 | 'ful': 'Nyande Altine, himɓe ekkititta hala tsari duniya je huwata ha Makaranta dokotoro en je Stanfod wi kuje kesum man je heftata iri kujeji ɓandu kuje mari yonki wawan maha ɗum be kujeji noone printa ɗum nanga sisi 1.',
47 | 'glg': 'O luns, científicos da Escola de Medicina da Universidade de Stanford anunciaron a creación dunha nova ferramenta diagnóstica que pode clasificar as células polo seu tipo: un minúsculo chip imprimible que se pode fabricar empregando impresoras de inxección de tinta normais cun custo de probablemente un céntimo de dólar estadounidense cada un.',
48 | 'lug': "Ku balaza, Banasayansi okuva mu setendekero ya Stanford ku somero ly'ebyedagala balangirira okuvumbulwa kwa akuuma akakebera nga kasobola okusengeka obutafaali nga kasinzira kukika kyabwo: Akuuma katono akasobola okufulumizibwa ku lupapula akasobola okukolebwa ne Printa enungi ku sente entono nga emu eya US buli kamu.",
49 | 'kat': 'ორშაბათს, სტენფორდის უნივერსიტეტის სამედიცინო კათედრის მეცნიერებმა გააკეთეს განცხადება ახალი დიაგნოსტიკური ხელსაწყოს გამოგონების შესახებ, რომელსაც უჯრედის ტიპებად დაყოფა შეუძლია: წვრილი, დასაბეჭდი ჩიპი, რომლის წარმოებაც შესაძლებელია ჩვეულებრივი ჭავლური პრინტერის საშუალებით. ერთი ცალის სავარაუდო ღირებულება აშშ ცენტს შეადგენს.',
50 | 'deu': 'Am Montag haben die Wisenschaftler der Stanford University School of Medicine die Erfindung eines neuen Diagnosetools bekanntgegeben, mit dem Zellen nach ihrem Typ sortiert werden können: ein winziger, ausdruckbarer Chip, der für jeweils etwa einen US-Cent mit Standard-Tintenstrahldruckern hergestellt werden kann.',
51 | 'ell': 'Τη Δευτέρα, επιστήμονες από την Ιατρική Σχολή του Πανεπιστημίου του Στάνφορντ ανακοίνωσαν την εφεύρεση ενός νέου εργαλείου διάγνωσης με δυνατότητα ομαδοποίησης των κυττάρων ανά τύπο: ένα μικροσκοπικό εκτυπώσιμο τσιπ που μπορεί να κατασκευαστεί με απλούς εκτυπωτές ψεκασμού μελάνης με κόστος περίπου ένα σεντ του αμερικανικού δολαρίου το καθένα.',
52 | 'guj': 'સોમવારે, સ્ટેનફોર્ડ યુનિવર્સિટી સ્કૂલ ઓફ મેડિસિનના વૈજ્ઞાનિકોએ એક નવા નિદાન સાધનની શોધની જાહેરાત કરી હતી જે કોષોને પ્રકાર દ્વારા ક્રમમાં ગોઠવી શકે છેઃ એક નાનકડી પ્રિન્ટેબલ ચિપ કે જે સંભવતઃ એક અમેરિકન સેન્ટ માટે સ્ટાન્ડર્ડ ઇન્કજેટ પ્રિન્ટર્સનો ઉપયોગ કરીને બનાવી શકાય.',
53 | 'hau': "A ranar Litinin, masana kimiyya daga Kwalejin Kimiyya ta Jami'ar Stanford suka sanar da kirkirar sabon kayan aikin bincike wanda zai iya rarrabe kwayoyin hallita ta nau'i: ƙaramin cip mai buguwa wanda za'a iya kera ta yin amfani da daidaitattun firinta na inkjet don yiwuwar kusan Amurka ɗaya kowannensu.",
54 | 'heb': 'ביום שני, מדענים מבית הספר לרפואה באוניברסיטת סטנפורד הכריזו על המצאת אמצעי אבחון חדש אשר מסוגל למיין תאים לפי סוג: שבב זעיר הניתן להדפסה שאפשר לייצר באמצעות מדפסות הזרקת דיו סטנדרטיות בעלות של בערך סנט אמריקאי אחד לכל פריט.',
55 | 'hin': 'सोमवार को, स्टैनफ़ोर्ड यूनिवर्सिटी स्कूल ऑफ़ मेडिसिन के वैज्ञानिकों ने एक नए डायग्नोस्टिक उपकरण के आविष्कार की घोषणा की जो कोशिकाओं को उनके प्रकार के आधार पर छाँट सकता है: एक छोटी प्रिंट करने योग्य चिप जिसे स्टैण्डर्ड इंकजेट प्रिंटर का उपयोग करके लगभग एक अमेरिकी सेंट के लिए निर्मित किया जा सकता है.',
56 | 'hun': 'Hétfőn a Stanford Egyetem Orvostudományi Kara bejelentette egy új diagnosztikai eszköz feltalálását, amely képes típus szerint rendszerezni a sejteket: ez egy apró nyomtatható csip, amelyet szabványos tintasugaras nyomtatóval lehet előállítani, akár darabját egy amerikai centért.',
57 | 'isl': 'Á mánudag tilkynntu vísindamenn frá læknadeild Stanford-háskóla uppfinningu á nýju greiningartæki sem getur flokkað frumur eftir tegund: örlítill prentanleg flaga sem hægt er að framleiða með venjulegum bleksprautuprentara fyrir mögulega um eitt bandarískt sent stykkið.',
58 | 'ibo': 'Na Monde, ndi oka mmuta sayensi sitere na Mahadum Stanford Ulo-akwukwo nke Ogwu kwuputara nchoputa nke ngwa-oru nyochaputa ohuu nwere it hoputa mkpuru-ahu site udi: otu mpekere ngwa nka nke enwere ike iruputa site na iji inkjeti igwe-prita din mma nke enwere-ike imeputa na otu senti mba US n’otu.',
59 | 'ind': 'Ilmuwan dari Stanford University School of Medicine pada hari Senin mengumumkan penemuan alat diagnostik baru yang bisa mengurutkan sel berdasarkan tipe: cip kecil dapat dicetak yang bisa diproduksi menggunakan printer inkjet standar dengan biaya sekitar satu sen AS per cip.',
60 | 'gle': "Ar an Luan, d’fhógair eolaithe ó Scoil Leighis Ollscoil Stanford go rabhthas tar éis uirlis nua dhiagnóiseach a dhéanamh ar féidir léi cealla a shórtáil de réir cineáil: sliseanna beag inphriontáilte is féidir a mhonarú le scairdphrintéirí ar thart ar ceint amháin S.A.. b'fhéidir.",
61 | 'ita': "Nella giornata di lunedì, alcuni scienziati della Scuola di Medicina dell'Università di Stanford hanno annunciato l'invenzione di un nuovo strumento diagnostico capace di ordinare le cellule in base al tipo: un chip minuscolo che può essere stampato utilizzando stampanti a getto di inchiostro al costo di circa 1 centesimo di dollaro l'uno.",
62 | 'jpn': '月曜日にスタンフォード大学医学部の科学者たちは、細胞を種類別に分類できる新しい診断ツールを発明したと発表しました。それは標準的なインクジェットプリンタで印刷して製造できる小型チップであり、原価は1枚あたり1円ほどす。',
63 | 'jav': 'Ing dina Senin, ilmuwan saka Fakultas Kedokteran Universitas Stanford ngumumake penemuan piranti diagnosa sing isa misah-misahake sel adhedhasar jenis: kepingan cilik sing isa dicetak sing isa diasilake kanthi nggunakake printer inkjet standar kanggo mungkin saben-saben watara saji sen AS.',
64 | 'kea': 'Na sigunda-fera, sientistas di Skóla di Midisina di Universidadi di Stanford anúnsia invenson di un novu faraménta di diagnótiku ki pode klasifika sélulas pa tipu: un xiipi pikinoti ki ta inprimi ki pode ser fabrikadu ku uzu di inprisoras di jatu di tinta padron pa pusivelmenti serka di un séntimu merkanu kada.',
65 | 'kam': 'wakwambiliilya, anasayanzi kuma sukulu wa stanford wa dawa matangaasie kuseuvya kwa muio mweu wa kuthima ula utonya kuvathukanya cell na muvae : kachip kaini katonya uprinitiwa kala katonya usoovwa kutumia printa ya inkjet standard kwa centi imwe ya U.S.',
66 | 'kan': 'ಅಂದಾಜು ತಲಾ ಒಂದು ಯು.ಎಸ್\u200c ಸೆಂಟ್\u200cನಿಂದ ಪ್ರಮಾಣಿತ ಇಂಕ್\u200cಜೆಟ್ ಪ್ರಿಂಟರುಗಳನ್ನು ಬಳಸಿ ಉತ್ಪಾದಿಸಬಹುದಾದ ಸಣ್ಣ ಪ್ರಿಂಟ್ ಮಾಡಬಹುದಾದ ಚಿಪ್\u200c ಆಗಿರುವ ಸೆಲ್\u200cಗಳ ವಿಧವನ್ನು ಆಯೋಜಿಸುವ ಹೊಸ ಪತ್ತೆ ಪರಿಕರದ ಅನ್ವೇಷಣೆಯ ಘೋಷಣೆಯನ್ನು ಸ್ಟಾನ್\u200cಫೋರ್ಡ್\u200c ವಿಶ್ವವಿದ್ಯಾಲಯದ ಔಷಧ ಶಾಲೆಯ ವಿದ್ಯಾರ್ಥಿಗಳು ಸೋಮವಾರ ಮಾಡಿದ್ದಾರೆ.',
67 | 'kaz': 'Дүйсенбі күні Стэнфорд университетінің медицина факультетінің ғалымдары жасушаларды түрі бойынша сұрыптай алатын жаңа диагностикалық құралды ойлап тапқанын жариялады: бұл әрбірін шамамен бір АҚШ центке стандартты сиялы принтерлерді пайдаланып өндіруге болатын өте кішкентай басып шығаруға болатын чип.',
68 | 'khm': 'កាល\u200bពី\u200bថ្ងៃ\u200bច័ន្ទ អ្នកវិទ្យាសាស្ត្រ\u200bមក\u200bពី\u200bសាលា\u200bវេជ្ជសាស្ត្រ\u200bនៃ\u200bសាកល\u200bវិទ្យាល័យ ស្តែន\u200bហ្វ័ឌ (Stanford University School of Medicine) បាន\u200bប្រកាស\u200bពី\u200bការបង្កើត\u200bឧបករណ៍\u200bធ្វើ\u200bរោគ\u200bវិនិច្ឆ័យ\u200bថ្មី\u200bមួយ ដែល\u200bអាច\u200bតម្រៀប\u200bកោសិកា\u200bតាម\u200bប្រភេទ៖ បន្ទះ\u200bឈីប\u200bដែល\u200bអាច\u200bព្រីន\u200bបាន\u200bដ៏\u200bតូច\u200bមួយ ដែល\u200bអាច\u200bផលិត\u200bដោយ\u200bប្រើ\u200bម៉ាស៊ីន\u200bព្រីន\u200bប្រភេទinkjectស្តង់ដារ ដែល\u200bមាន\u200bតម្លៃ\u200bប្រហែល\u200bមួយ\u200bសេន\u200bសហរដ្ឋអាមេរិក។',
69 | 'kor': '스탠포드 의과대학 연구진은 지난 월요일 세포를 유형별로 분류할 수 있는 새로운 진단도구를 개발했다고 밝혔다. 이는 아주 작은 크기의 인쇄가 가능한 칩으로, 일반적인 잉크젯 프린터를 이용해 개 당 미화 약 1센트로 생산이 가능할 것으로 예상된다.',
70 | 'kir': 'Стэнфорд жогорку окуу жайынын Медицина мектебинин кызматкерлери дүйшөмбүдө клеткаларды параметрлери боюнча иреттей турган жаңы диагностикалык ыкманы ойлоп табышканын айтышты: ал болгону бир АКШ центине даярдала турган чачма принтерлер менен басылып чыгуучу чакан чип.',
71 | 'lao': 'ໃນວັນຈັນນັກວິທະຍາສາດຈາກໂຮງຮຽນແພດສາດມະຫາວິທະຍາໄລສະແຕນຝອດໄດ້ເປີດເຜີຍເຖິງການປະດິດເຄື່ອງມືກວດພະຍາດແບບໃໝ່ທີ່ສາມາດຈັດແບ່ງຈຸລັງຕາມປະເພດ: ແຜ່ນຊິບຂະໜາດນ້ອຍໆທີ່ສາມາດຜະລິດໄດ້ໂດໃຊ້ເຄື່ອງພິມໝຶກມາດຕະຖານເຊິ່ງລາຄາອາດຈະປະມານໜຶ່ງເຊັນສະຫະລັດຕໍ່ອັນ.',
72 | 'lav': 'Stenforda Universitātes Medicīnas skolas zinātnieki pirmdien paziņoja, ka izgudrojuši jaunu diagnostikas rīku, ar kuru var šūnas kārtot pēc to veidiem: tas ir sīks drukājams čips, kuru var izgatavot, izmantojot standarta strūklprinteri, un kura iespējamās izmaksas ir aptuveni viens ASV\xa0cents gabalā.',
73 | 'lin': 'Na mokolo ya yambo, bato ya siansi ya universite ya kimonganga ya Stanford balobaki basali esaleli ya sika oyo ekoki kosalisa na kokabola baselile na kolanda lolenge na yango: mwa eloko moko oyo bakoki kosala na lisalisi ya ba imprimantes à jet ya encre po ezala na talo ya cent moko ya Etats-Unis.',
74 | 'lit': 'Pirmadienį Stanfordo universiteto medicinos mokyklos mokslininkai paskelbė apie naujo diagnostikos įrankio, galinčio rūšiuoti ląsteles pagal tipą, išradimą: mažas spausdintinis lustas, kurį galima gaminti naudojant standartinius rašalinius spausdintuvus. Vieno kaina maždaug vienas JAV dolerio centas.',
75 | 'luo': "Chieng' Wuoktich, josayans mawuok e Mbalariany mar Stanford e Skul mar Thieth nolando ni negifwenyo gimanyien mitiyogo e nono tuoche ma nyalo pogo ng'injo mag del kaluwore kod kitgi: en chip moro matin ma inyalo go chapa gi printa ma bende inyalo losi kitiyo kod printa mapile mag inkjet kwom manyalo romo otonglo achiel mar Amerka e moro ka moro.",
76 | 'ltz': "E Méindeg hu Wëssenschaftler vun der Stanford University School of Medicine d'Erfindung vun engem neien Diagnosgeschier ugekënnegt, dat Zellen no Typ sortéiere kann: e klengen dréckbaren Chip, dee mat Hëllef vun engem Standardtëntestraldrécker fir méiglecherweis ongeféier een US-Cent d'Stéck hiergestallt ka ginn.",
77 | 'mkd': 'Во понеделникот научниците од медицинскиот факултет на Универзитетот Стенфорд објавија дека измислиле нова алатка за дијагностицирање со којашто се подредуваат клетките според типови: се работи за мал чип за печатење кој може да се произведе со помош на стандардни инк-џет печатачи, и тоа по цена од околу еден американски цент.',
78 | 'msa': 'Pada hari Isnin, Saintis daripada Sekolah Perubatan Universiti Stamford mengumumkan penemuan alat diagnostik baru yang boleh mengasingkan sel-sel mengikut jenis: cip kecil yang boleh dicetak yang boleh dihasilakn menggunakan pencetak standard inkjet untuk kira-kira satu sen A.S setiap satu.',
79 | 'mal': 'തിങ്കളാഴ്ച്ച, സ്റ്റാൻഫോർഡ് യൂണിവേഴ്\u200cസിറ്റി സ്\u200cകൂൾ ഓഫ് മെഡിസിനിലെ ശാസ്ത്രജ്ഞന്മാർ കോശങ്ങളെ അവയുടെ ഇനം അനുസരിച്ച് തരംതിരിക്കാൻ കഴിയുന്ന ഒരു പുതിയ രോഗനിർണയ ഉപകരണം കണ്ടുപിടിച്ചതായി പ്രഖ്യാപിച്ചു: സ്റ്റാൻഡേർഡ് ഇങ്ക്\u200cജെറ്റ് പ്രിന്റ്ററുകൾ ഉപയോഗിച്ച് നിർമ്മിക്കാൻ സാധിക്കുന്ന ഏകദേശം ഒരു യു.എസ് സെന്റ് ഓരോന്നിനും വേണ്ടിവരുന്ന പ്രിന്റ് ചെയ്യാൻ കഴിയുന്ന ഒരു ചെറിയ ചിപ്പ്.',
80 | 'mlt': 'Nhar it-Tnejn, xjentisti mill-Iskola tal-Mediċina tal-Università ta’ Stanford ħabbru l-invenzjoni ta’ għodda dijanjostika ġdida li tista’ tirranġa ċ-ċelloli skont it-tip: ċippa ċkejkna li tista’ tiġi stampata li tista’ tiġi manifatturata bl-użu ta’ inkjet printers standard għal possibilment madwar ċenteżmu tal-Istati Uniti kull waħda.',
81 | 'mri': 'I te Mane, i kī ake ngā kaipūtaiao nō Stanford University School of Medicine mō te hanganga o tētahi taputapu whakatau e āhei ai te wewete i ngā pūtau ki ana momo: mō te 1 hēneti U.S pea, he rehu-mihini tā iti nei e taea ana te hanga mā ngā mihini tā inkjet noa.',
82 | 'mar': 'स्टॅनफोर्ड युनिव्हर्सिटी स्कूल ऑफ मेडिसीनमधील शास्त्रज्ञांनी पेशींचे प्रकारानुसार विभाजन करणाऱ्या निदान साधनाचा शोध लावल्याचे सोमवारी जाहीर केलेः सामान्य इंकजेट प्रिंटरचा वापर करून एका लहान प्रिंट करता येणाऱ्या चीपचे उत्पादन अंदाजे एक U.S. प्रति सेंट खर्चून केले जाऊ शकते.',
83 | 'mon': 'Даваа гарагт Стэнфордын Их Сургуулийн Анагаахын сургуулийн эрдэмтэд эсийг төрлөөр нь эрэмбэлж чаддаг шинэ оношилгооны багаж бүтээснээ зарлалаа: уг багаж нь тухай бүрийг нь нэг цент ам.долларын өртгөөр стандарт бэхэн хэвлэгч ашиглан үйлдвэрлэж болдог жижигхэн хэвлэж болдог чип юм.',
84 | 'npi': 'सोमबारका दिन, स्ट्यानफोर्ड युनिभर्सिटी स्कुल अफ मेडिसिनका वैज्ञानिकहरूले एक नयाँ डायग्नोस्टिक उपकरणको आविष्कारको घोषणा गरे जसले कोषहरूलाई प्रकारका आधारमा क्रमबद्ध गर्न सक्दछः एउटा सानो प्रिन्ट गर्न सकिने चिप जुन मानक ईंकजेट प्रिन्टरहरू प्रयोग गरेर सम्भवतः लगभग एक अमेरिकी सेन्टको लागतमा निर्माण गर्न सकिन्छ।',
85 | 'nso': 'Ka Mošupulogo, boramahlale ba go tšwa Sekolong sa Yunibesithi ya Stanford sa Medicine ba tsebišitše ka go dirwa ga sedirišwa se seswa sa tekolo seo se ka beakanyago disele ka mehuta: chip yeo e gatišegago ye nnyane yeo e ka dirwago go šomišwa di printer tša inkjet tša sente ye tee go U.S.',
86 | 'nob': 'Forskere fra Stanford University School of Medicine gjorde kunngjørelsen på mandag om oppfinnelsen av et nytt diagnoseverktøy som sorterer celler etter type: en ørliten utskrivbar chip kan produseres ved hjelp av standard inkjettskrivere for muligens ca. én amerikansk cent hver.',
87 | 'nya': "Tsiku Lolemba, asayansi ochokera ku Stanford University School of Medicine analengeza zakupangidwa kwa chida chatsopano chomwe chitha kusiyanitsa ma cell ndi mtundu: kachipangizo kakang'ono kosindikizidwaka kakhoza kupangidwa pogwiritsa ntchito ma inkjet osindikizika apamwamba pafupifupi cent imodzi ya U.S iliyonse.",
88 | 'oci': "Diluns, de scientifics de l'escòla de medecina de l'Universitat Stanford anoncièron l'invencion d'un novèl esplech de diagnostic que pòt triar las cellulas per tipe: una nièra minuscula imprimibla que pòt èsser manufacturada en utilizar d'estampadoiras de get de tencha per un còst possiblament a l'entorn d'un cent american caduna.",
89 | 'ory': 'ସୋମବାର ଦିନ, ଷ୍ଟାନଫୋର୍ଡ ୟୁନିଭରସିଟି ସ୍କୁଲ୍ ଅଫ୍ ମେଡିସିନ୍ ର ବୈଜ୍ଞାନିକମାନେ ଏକ ନୂତନ ନିଦାନ ଉପକରଣର ଉଦ୍ଭାବନ ବିଷୟରେ ଘୋଷଣା କରିଛନ୍ତି ଯାହାର କୋଷଗୁଡିକ ପ୍ରକାର ଅନୁଯାୟୀ କ୍ରମବଦ୍ଧ କରାଯାଇପାରିବ: ଏକ କ୍ଷୁଦ୍ର ମୁଦ୍ରଣଯୋଗ୍ୟ ଚିପ୍ ଯାହାକି ସମ୍ଭବତଃ ପାଖାପାଖି 1 ପ୍ରତି ଆମେରିକୀୟ ସେଣ୍ଟରେ ଷ୍ଟାଣ୍ଡାର୍ଡ ଇଙ୍କଜେଟ୍ ମୁଦ୍ରକ ବ୍ୟବହାର କରି ଉତ୍ପାଦିତ ହୋଇପାରିବ।',
90 | 'orm': 'Wixata, saayintistootni yuunivasitii Stanfordii kan kutaa barnoota fayyaa meeshaa haaraa yaaliif ta’uu kan seelii gosa gosaan adda addaa baasu beeksisan: waan xiqqaa maxxanfamuu danda’uu kan maxxansituu inkjeeti fayyadamuun maxxaansu kan tokkon tokko isaanii saantimaa U.S tokko ta’uu malaa.',
91 | 'pus': 'د Stanford پوهنتون د طب د ځانګی ساینسپوهانو د دوه شنبی په ورځ د یوې داسې تشخیصی آلی د ایجاد اعلان وکړ چې کولای شي حجرات د نوعی له مخی وويشي ، یو کوچنی چپ چې چاپ کول یې په معیاری Inkject Printer سره په د یوه امریکایې ډالر د یو سنټ په بیه ممکن دي',
92 | 'fas': 'روز دوشنبه، دانشمندان دانشکده پزشکی دانشگاه استنفورد از اختراع دستگاه تشخیصی جدیدی سخن گفتند که می\u200cتواند سلول\u200cها را براساس نوعشان مرتب\u200cسازی کند: نوعی تراشه قابل چاپ که بااستفاده از چاپگرهای استاندارد جوهرافشان قابل تولید است و هزینه هرکدام از آنها احتمالاً حدود یک سنت آمریکا در می\u200cآید.',
93 | 'pol': 'W poniedziałek naukowcy ze Szkoły Medycznej Uniwersytetu Stanforda obwieścili wynalezienie nowego narzędzia diagnostycznego, sortującego komórki według rodzaju: jest to miniaturowy chip drukowany, który można wyprodukować za pomocą standardowych drukarek atramentowych przy koszcie szacowanym na jednego centa amerykańskiego za sztukę.',
94 | 'por': 'Na segunda-feira, cientistas da Escola de Medicina da Universidade de Stanford anunciaram a invenção de uma nova ferramenta de diagnóstico que pode classificar células por tipo: um minúsculo chip imprimível que pode ser fabricado usando impressoras jato de tinta padrão por possivelmente cerca de um centavo de dólar cada.',
95 | 'pan': 'ਸੋਮਵਾਰ ਨੂੰ, ਸਟੈਨਫੋਰਡ ਯੂਨੀਵਰਸਿਟੀ ਸਕੂਲ ਆਫ਼ ਮੈਡੀਸਿਨ ਦੇ ਵਿਗਿਆਨੀਆਂ ਨੇ ਇੱਕ ਨਵੇਂ ਡਾਇਗਨੋਸਟਿਕ ਟੂਲ ਦੇ ਅਵਿਸ਼ਕਾਰ ਬਾਰੇ ਘੋਸ਼ਣਾ ਕੀਤੀ ਜੋ ਕਿਸਮ ਮੁਤਾਬਕ ਸੈੱਲਾਂ ਦੀ ਛਾਂਟੀ ਕਰ ਸਕਦਾ ਹੈ: ਇੱਕ ਬਹੁਤ ਛੋਟੀ ਪ੍ਰਿੰਟ ਕਰਨਯੋਗ ਚਿੱਪ ਜਿਸਦਾ ਨਿਰਮਾਣ ਸਟੈਂਡਰਡ ਇੰਕਜੈੱਟ ਪ੍ਰਿੰਟਰਾਂ ਦੀ ਵਰਤੋਂ ਕਰਕੇ ਸੰਭਾਵੀ ਤੌਰ ‘ਤੇ ਲਗਭਗ ਇੱਕ ਯੂ.ਐਸ. ਸੈਂਟ ਦੇ ਖ਼ਰਚੇ ਵਿੱਚ ਕੀਤਾ ਜਾ ਸਕਦਾ ਹੈ।',
96 | 'ron': 'Luni, oameni de știință de la Facultatea de Medicină a Universității Stanford au anunțat inventarea unui nou instrument de diagnosticare, care poate sorta celulele în funcție de tipul lor: un cip minuscul, printabil, care poate fi produs folosind imprimante obișnuite cu jet de cerneală, pentru aproximativ un cent american bucata.',
97 | 'rus': 'В понедельник ученые из Медицинской школы Стэнфордского университета объявили об изобретении нового диагностического инструмента, который может сортировать клетки по их типу; это маленький чип, который можно напечатать, используя стандартный струйный принтер примерно за 1 цент США.',
98 | 'srp': 'У понедељак су научници са медицинског факултета Универзитета Станфорд представили изум новог алата за дијагностику који може да поређа ћелије по типу: малени чип који се може штампати и који може да се произведе коришћењем стандардних инкџет штампача, и да кошта отприлике један амерички цент по комаду.',
99 | 'sna': 'NeMuvhuru, mascientists vekuChikoro cheMishonga che yunivhesiti yekuStanford vakazivisa kugadzirwa kwe muchina mutsva wekunzvera uyo unogona kuronga macell nerudzi arwo: chi chip chinotsikiswa chinogadzirwa uchishandisa zvitsikiso zveinkhet awo anongangoita sendi rimwe chete remadhora.',
100 | 'snd': 'سومر جي ڏينهن، اسٽينفورڊ يونيورسٽي اسڪول آف ميڊيسن جي سائنسدانن هڪ نئون تشخيصي ٽول ايجاد ڪرڻ جو اعلان ڪيو جيڪو جيو گهرڙن جي قسمن جي لحاظ سان ترتيب ڏيئي سگهي ٿو: هڪ ننڍي ڇپائيءِ جوڳي چپ جيڪو معياري انڪ جيٽ پرنٽرز جو استعمال ڪندي هر هڪ لاءِ لڳ ڀڳ هڪ U.S. سينٽ ۾ تيار ڪري سگهجي ٿي.',
101 | 'slk': 'V pondelok vedci z lekárskej fakulty Stanfordskej univerzity informovali o objavení nového diagnostického prístroja, ktorý je schopný triediť bunky podľa druhu: malý čip, ktorý sa dá vytlačiť a je možné vyrobiť ho použitím štandardných atramentových tlačiarní, z ktorých každý môže mať hodnotu zhruba jeden cent.',
102 | 'slv': 'Znanstveniki Medicinske fakultete Univerze v Stanfordu so v ponedeljek razglasili iznajdbo novega diagnostičnega orodja, ki je sposobno razvrščanja celic glede na vrsto: drobceno tiskano vezje, ki ga je mogoče izdelati z uporabo standardnih brizgalnih tiskalnikov za okoli en ameriški cent na kos.',
103 | 'som': 'Maalinta Isniinta, saynis yahano ka socda Jamacada Stanford ee Iskuul Caafimadka waxay ku dhawaaqeen sameynta qalab daweyn cusub lagu habeyn karo unugyada : kuwa yar yar ee lagu soo saari karo muditan waxaana macquul ah in lagu soo saro Mareykan ka mid walba.',
104 | 'ckb': 'لە دوو شەمەدا زانایانی سکوڵی پزیشكی لە زانکۆی ستانفۆرد رایانگەیاند کە داهێنانیان کردووە بۆ ئامرازێکی نوێی پشکنین کە دەتوانێت خانەکان پۆلێن بکات بەپێی جۆرەکان: ئامێرەکە پارچەیەکی ئەلیکترۆنیی بچوکە کە دەکرێت بە چاپکردن دروست بکرێت بە بەکارهێنانی چاپکەری مەرەکەب فڕێدەر بە تەنها یەك سەنتی ئەمریکی.',
105 | 'spa': 'El lunes, los científicos de la facultad de medicina de la Universidad de Stanford anunciaron el invento de una nueva herramienta de diagnóstico que puede catalogar las células según su tipo: un pequeñísimo chip que se puede imprimir y fabricar con impresoras de inyección de uso corriente, por un posible costo de, aproximadamente, un centavo de dólar por cada uno.',
106 | 'swh': 'Mnamo Jumatatu, wanasayansi kutoka Shule ya Tiba ya Chuo Kikuu cha Stanford walitangaza uvumbuzi wa kifaa kipya cha utambuzi ambacho kinaweza kupanga seli kwa aina: kidude kidogo kinachoweza kuchapwa, na ambacho kinaweza kutengenezwa kwa kutumia printa ya kawaida ya kupuliza rangi, yawezekana kwa takribani senti moja ya Marekani kwa kila moja.',
107 | 'swe': 'I måndags meddelade forskare från Stanford University School of Medicine att man tagit fram ett nytt diagnostiskt verktyg som kan ordna celler efter typ: ett litet utskrivbart chip som kan tillverkas med vanliga bläckstråleskrivare för eventuellt ungefär en amerikansk cent vardera.',
108 | 'tgk': 'Олимони мактаби тибби донишгоҳи Стенфорд рӯзи душанбе дар бораи ихтирои асбоби нави ташхисие, ки метавонад ҳуҷайраҳоро аз рӯи навъ мураттаб кунад, эълон карданд: чипи хурди чопшавандае, ки бо истифода аз принтерҳои фавракдами муқаррарӣ истеҳсол карда мешавад ва эҳтимолан ҳар як кадомаш тақрибан 1 сенти ИМА арзиш дорад.',
109 | 'tam': 'திங்களன்று, ஸ்டான்போர்ட் யுனிவர்சிட்டி ஸ்கூல் ஆஃப் மெடிசின் விஞ்ஞானிகள் ஒரு புதிய நோயறிதல் கருவியின் கண்டுபிடிப்பை அறிவித்தனர், இது செல்களை வகைப்படி வரிசைப்படுத்தலாம்: ஒரு சிறிய அச்சிடக்கூடிய சில்லு, தரமான இன்க்ஜெட் அச்சுப்பொறிகளைப் பயன்படுத்தி ஒவ்வொன்றும் 1 யு.எஸ்.',
110 | 'tel': 'సోమవారం, స్టాన్ఫోర్డ్ యూనివర్శిటీ స్కూల్ ఆఫ్ మెడిసిన్ శాస్త్రవేత్తలు కణాల రకాన్ని క్రమబద్ధీకరించగల కొత్త రోగనిర్ధారణ సాధనం యొక్క ఆవిష్కరణను ప్రకటించారు: ప్రామాణిక ఇంక్జెట్ ప్రింటర్లను ఉపయోగించి 1 యు.ఎస్.',
111 | 'tha': 'เมื่อวันจันทร์ที่ผ่านมา นักวิทยาศาสตร์จากโรงเรียนการแพทย์แห่งมหาวิทยาลัยสแตนฟอร์ดได้ประกาศถึงการประดิษฐ์เครื่องมือวินิจฉัยใหม่ที่สามารถจัดเรียงเซลล์ตามประเภทได้ ซึ่งก็คือชิปขนาดเล็กซึ่งสามารถผลิตได้โดยใช้เครื่องพิมพ์อิงค์เจ็ตแบบมาตรฐานในราคาประมาณชิ้นละหนึ่งเซ็นต์ดอลลาร์สหรัฐ',
112 | 'tur': 'Stanford Üniversitesi Tıp Fakültesi’nde görev alan bilim insanları, pazartesi günü hücreleri cinslerine göre sıraya koyabilen yeni bir teşhis aracının bulunduğunu duyurdu. Bu araç, standart mürekkep püskürtmeli yazıcılar kullanılarak her biri tahminen yaklaşık bir sente imal edilebilen küçük bir yazdırılabilir çipti.',
113 | 'ukr': 'У понеділок, науковці зі Школи медицини Стенфордського університету оголосили про винайдення нового діагностичного інструменту, що може сортувати клітини за їх видами: це малесенький друкований чіп, який можна виготовити за допомогою стандартних променевих принтерів десь по одному центу США за штуку.',
114 | 'umb': 'K’eteke lyatete lyosemana, ulongisi wosikola wesaku isangiwa vo Stanford vakonomwisa etungo lyocina cimwe cikwatisa okusanga oselula kwenda itava okuyitanga: okacina kaco katito calwa kwenda citava okuyupa l’okuyipanga l’onjanga yalwa kwenda yisukilañgo ovava l’osentimu Amerikanu imosi.',
115 | 'urd': 'پیر کے روز، سٹینفورڈ اسکول آف میڈیسن کے سائنسدانوں نے ایک جدید تشخیصی آلہ دریافت کرنے کا اعلان کیا ہے جو خلیوں کو اس کی اقسام کے لحاظ سے ترتیب دے سکتا ہے: یہ ایک چھوٹی سی پرنٹیبل چپ ہے جو غالباً ایک امریکی سنٹ میں معیاری انک جیٹ پرنٹرز کا استعمال کر کے تیار کی جا سکتی ہے-',
116 | 'uzb': "Dushanba kuni Stenford Universitetining Tibbiyot maktabi olimlari hujayralarni turlariga qarab saralay oladigan yangi tashxis vositasi ixtirosini e'lon qildi: har biri taxminan bir AQSH senti atrofida bo'lgan standart rangli printerlardan foydalangan holda ishlab chiqarish mumkin bo'lgan ingichka bosma chip.",
117 | 'vie': 'Vào hôm thứ Hai, các nhà khoa học thuộc Trường Y Đại học Stanford đã công bố phát minh một dụng cụ chẩn đoán mới có thể phân loại tế bào: một con chíp nhỏ có thể sản xuất bằng máy in phun tiêu chuẩn với giá khoảng một xu Mỹ mỗi chiếc.',
118 | 'cym': "Ar ddydd Llun, datganodd gwyddonwyr o Ysgol Feddygaeth Stanford eu bod wedi dyfeisio teclyn diagnostig newydd sy'n gallu didoli celloedd yn ôl math: sglodyn bach argraffadwy y gellir ei gynhyrchu gan ddefnyddio chwistrell-argraffwyr safonol am tua un sent yr U.D. yr un.",
119 | 'wol': 'Ci Altine gi, xeltukat yi nekk Daara ju kawe ji di ajju ci wàllu Paj bu Stanford yëgle nañu seetub jumtukayu jagnostik bu bées bu mëna xaajale xeeti selul yi: pus bu ndaw bu nu imprme bu nu mëna defar jëfandikoo daa bu imprimant yi ngir lu toll ci benn dërëmu amerig bu nekk.',
120 | 'xho': 'NgoMvulo, iinzululwazi ezivela eStandford University School of Medicine zazise ngesixhobo sokuxilonga esitsha esinoku sota iiseli ngohlobo: itship encinci eprintwayo enokwenziwa ngokusebenzisa iiprinta ze inki yaye zinokuxabisa malunga nesenti enye yase U.S..',
121 | 'yor': 'Àwọn onímọ̀ ìjìnlẹ̀ sáyẹ̀nsì láti ilé ìkẹ́ẹ̀kọ́ gíga ti ìsègun Stanford lọ́jọ́ ajé ti kéde ìdásílẹ̀ irinṣẹ́ ìwádìí tuntun tí ó le tó nǹkan lẹ́sẹẹsẹ pẹ̀lú bí wọ́n bá se rí: irinṣẹ́ kéreké tí a lè tẹ̀ jáde pẹ̀lú lílo irinṣẹ́ ìtẹ̀wé ìgbàlódé pẹ̀lú owó ẹyọ fún ọ̀kọ̀ọ̀kan.',
122 | 'zul': 'NgoMsombuluko, usosayensi wase-Stanford University School of Medicine umemezele ithuluzi Elisha elikwazi ukuhlukanisa amagqamuzane ngezinhlobo zawo: ucezwana oluncane olunyathelisekayo olungakhandwa ngokusebenzisa imishini evamile yokunyathelisa ngokusesilinganisweni ngesenti elilodwa laseMelika.',
123 | }
124 |
--------------------------------------------------------------------------------