├── .gitignore ├── README.md ├── example.py ├── requirements.txt ├── setup.py └── src ├── RuTranscript.py ├── __init__.py ├── data ├── alphabet.txt ├── error_words_stresses_default.txt ├── irregular_exceptions.xlsx ├── paired_consonants.txt └── sorted_allophones.txt ├── tests ├── test_consonants.py ├── test_modules.py ├── test_phrases.py └── test_vowels.py └── tools ├── __init__.py ├── allophones_tools.py ├── main_tools.py ├── sounds.py ├── stress_tools.py └── syntax_tree.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.egg-info 2 | __pycache__/ 3 | .vscode/ 4 | .idea 5 | 6 | dist/ 7 | build/ 8 | temp/ 9 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # RuTranscript 2 | 3 | This package was created in order to make a phonetic transcription in russian. 4 | The library is based on the literary norm of phonetic transcription for the Russian language and uses symbols 5 | of the International Phonetic Alphabet. Transcription takes into account the allocation of allophones. 6 | The resulting library can be used in automatic speech recognition and synthesis tasks. 7 | 8 | At the moment, there is no functional for division into syllables in this framework, due to its variability. 9 | Therefore, allophones that depend on the place in the syllable 10 | (for example, *j* at the beginning of the syllable - *ʝ*) are allocated only in cases where the beginning of 11 | the syllable coincides with the beginning of the word or the end of the syllable coincides with the end of the word. 12 | 13 | For a more detailed description of how the framework works, see the article: https://www.dialog-21.ru/media/5722/badasyana137.pdf 14 | 15 | # Installation 16 | 17 | ``` 18 | pip install git+https://github.com/suralmasha/RuTranscript 19 | pip install -r requirements.txt 20 | ``` 21 | 22 | # Usage 23 | 24 | Put your text in the appropriate variable (in the example - `text`). 25 | Pass it to the `RuTranscript()` and use method `transcribe()`. 26 | 27 | ``` 28 | from ru_transcript import RuTranscript 29 | 30 | text = 'Как получить транскрипцию?' 31 | ru_transcript = RuTranscript(text) 32 | ru_transcript.transcribe() 33 | ``` 34 | 35 | You may define stresses both for one word and for all words in the text. 36 | To do this, put a stress symbol (preferably '+') before or after the stressed vowel 37 | and put the stressed text in an additional variable (in the example - `stressed_text_if_have`). 38 | To define where you've putted the stress mark use the parameter `stress_place` (possible values: `'after'` or `'before'`). 39 | 40 | **Important!** The number of words in these two texts must match. 41 | 42 | ``` 43 | text = 'Как получить транскрипцию?' 44 | stressed_text_if_have = 'Как получи+ть транскрипцию?' 45 | ru_transcript = RuTranscript(text, stressed_text_if_have) 46 | ru_transcript.transcribe() 47 | ``` 48 | 49 | or 50 | 51 | ``` 52 | text = 'Как получить транскрипцию?' 53 | stressed_text_if_have = 'Как получ+ить транскрипцию?' 54 | ru_transcript = RuTranscript(text, stressed_text_if_have, stress_place='before') 55 | ru_transcript.transcribe() 56 | ``` 57 | 58 | Pauses are arranged according to punctuation: the end of a sentence is indicated by a long pause (`'||'`), 59 | punctuation marks inside a sentence are indicated by short pauses (`'|'`). 60 | 61 | You can get a list of **allophones** by using method `get_allophones()`. 62 | 63 | ``` 64 | print(ru_transcript.get_allophones()) 65 | ``` 66 | 67 | Output: 68 | ``` 69 | ['k', 'a', 'k', 'p', 'ə', 'lʷ', 'ʊ', 't͡ɕ', 'i', 'tʲ', 't', 'r', 'ɐ', 'n', 's', 'k', 'rʲ', 'i', 'p', 't͡sˠ', 'ɨ', 'jᶣ', 'ᵿ'] 70 | ``` 71 | 72 | You can get a list of **phonemes (main allophones)** by using method `get_phonemes()` - 73 | this is a less detailed sort of transcription. 74 | 75 | ``` 76 | print(ru_transcript.get_phonemes()) 77 | ``` 78 | 79 | Output: 80 | ``` 81 | ['k', 'a', 'k', 'p', 'o', 'l', 'u', 't͡ɕ', 'i', 'tʲ', 't', 'r', 'a', 'n', 's', 'k', 'rʲ', 'i', 'p', 't͡s', 'i', 'j', 'u'] 82 | ``` 83 | 84 | You can see **how stresses were placed** by using method `get_stressed_text`. 85 | 86 | ``` 87 | print(ru_transcript.get_stressed_text()) 88 | ``` 89 | 90 | Output: 91 | ``` 92 | 'ка+к получи+ть транскри+пцию' 93 | ``` 94 | 95 | You can also find an example of using the framework in `example.py`. 96 | -------------------------------------------------------------------------------- /example.py: -------------------------------------------------------------------------------- 1 | from src import RuTranscript 2 | 3 | if __name__ == "__main__": 4 | text = 'Как получить транскрипцию?' 5 | ru_transcript = RuTranscript(text) 6 | ru_transcript.transcribe() 7 | 8 | print("{:<15} {:}".format('Original text:', text)) 9 | print("{:<15} {:}".format('Stressed text:', ru_transcript.get_stressed_text( 10 | stress_place='before', 11 | stress_symbol='+' 12 | ))) 13 | 14 | print('------------------------------------------------------') 15 | print('Transcription (allophones):') 16 | print(' '.join(ru_transcript.get_allophones())) 17 | print('Transcription (allophones) with spaces, pauses and stresses:') 18 | print(' '.join(ru_transcript.get_allophones( 19 | stress_place='before', 20 | save_stresses=True, 21 | save_spaces=True, 22 | save_pauses=True, 23 | stress_symbol='+' 24 | ))) 25 | 26 | print('------------------------------------------------------') 27 | print('Transcription (phonemes):') 28 | print(' '.join(ru_transcript.get_phonemes())) 29 | print('Transcription (phonemes) with spaces, pauses and stresses:') 30 | print(' '.join(ru_transcript.get_phonemes( 31 | stress_place='before', 32 | save_stresses=True, 33 | save_spaces=True, 34 | save_pauses=True, 35 | stress_symbol='+' 36 | ))) 37 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | spacy==3.4.4 2 | ru_core_news_sm @ https://github.com/explosion/spacy-models/releases/download/ru_core_news_sm-3.4.0/ru_core_news_sm-3.4.0-py3-none-any.whl 3 | numpy==1.23.3 4 | nltk==3.7 5 | epitran==1.24 6 | openpyxl==3.1.1 7 | -e git+https://github.com/sovaai/sova-tts-tps@v1.2.0#egg=tps 8 | -e git+https://github.com/Desklop/StressRNN#egg=stressrnn 9 | -e git+https://github.com/seriyps/ru_number_to_text#egg=num2t4ru 10 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python3 2 | # -*- coding: utf-8 -*- 3 | # Author: suralmasha - Badasyan Alexandra 4 | 5 | import os 6 | from setuptools import setup, find_packages 7 | from shutil import copytree, copy, rmtree, ignore_patterns 8 | from os.path import join 9 | 10 | 11 | if __name__ == "__main__": 12 | # Package constants 13 | PACKAGE_NAME = 'ru_transcript' 14 | PACKAGE_VERSION = '1.0' 15 | PACKAGE_DESCRIPTION = 'Phonetic transcription in russian' 16 | PACKAGE_SOURCES_URL = 'https://github.com/suralmasha/RuTranscript' 17 | 18 | # Variables 19 | sources_dir = './src' 20 | temp_dir = 'temp' 21 | excluded_files = ignore_patterns('setup.py', '.git', 'dist', 'tests', 'example.py', 'jpt_example.ipynb') 22 | 23 | # Prepare temp folders 24 | rmtree(temp_dir, ignore_errors=True) 25 | copytree(sources_dir, join(temp_dir, 'ru_transcript'), copy_function=copy, ignore=excluded_files) 26 | 27 | # Read long_description 28 | with open('README.md', encoding='utf8') as f: 29 | long_description = f.read().splitlines() 30 | long_description = '/n'.join(long_description) 31 | 32 | # Read requirements from file excluded comments 33 | with open('requirements.txt', encoding='utf8') as f: 34 | install_requires = f.read().splitlines() 35 | 36 | # Prepare data files 37 | data_files = [join('data', '*.txt'), join('data', '.xlsx')] 38 | 39 | # Classifiers 40 | classifiers = [ 41 | 'Natural Language :: Russian', 42 | 'Programming Language :: Python :: 3.8', 43 | 'Topic :: Text Processing :: Linguistic :: NLP' 44 | ] 45 | 46 | # Build package 47 | setup( 48 | name=PACKAGE_NAME, # package name 49 | version=PACKAGE_VERSION, # version 50 | description=PACKAGE_DESCRIPTION, # short description 51 | long_description=long_description, 52 | url=PACKAGE_SOURCES_URL, # package URL 53 | author='Badasyan Alexandra', 54 | author_email='sashabadasyan@icloud.com', 55 | classifiers=classifiers, 56 | keywords='nlp russian transcription phonetics linguistic', 57 | install_requires=install_requires, # list of packages this package depends on 58 | packages=find_packages(temp_dir), # return a list of str representing the packages it could find in source dir 59 | package_dir={'': temp_dir}, # set up sources dir 60 | package_data={'': data_files}, # append all external files to package 61 | include_package_data=True, 62 | zip_safe=False 63 | ) 64 | -------------------------------------------------------------------------------- /src/RuTranscript.py: -------------------------------------------------------------------------------- 1 | import warnings 2 | from os.path import join, dirname, abspath 3 | 4 | import spacy 5 | import epitran 6 | from openpyxl import load_workbook 7 | from nltk.stem.snowball import SnowballStemmer 8 | from tps import find, download 9 | from tps import modules as md 10 | 11 | from .tools.main_tools import get_punctuation_dict, text_norm_tok, find_clitics, extract_phrasal_words, \ 12 | apply_differences 13 | from .tools.stress_tools import put_stresses, remove_extra_stresses, replace_stress_before 14 | from .tools.allophones_tools import nasal_m_n, silent_r, voiced_ts, shch, long_ge, fix_jotised, long_consonants, \ 15 | vowels, labia_velar, stunning, assimilative_palatalization, first_jot 16 | from .tools.sounds import epi_starterpack, allophones 17 | from .tools.syntax_tree import SyntaxTree 18 | 19 | snowball = SnowballStemmer('russian') 20 | nlp = spacy.load('ru_core_news_sm', disable=["tagger", "morphologizer", "attribute_ruler"]) 21 | 22 | ROOT_DIR = dirname(abspath(__file__)) 23 | wb = load_workbook(join(ROOT_DIR, 'data/irregular_exceptions.xlsx')) 24 | sheet = wb.active 25 | irregular_exceptions = {sheet[f'A{i}'].value: sheet[f'B{i}'].value for i in range(2, sheet.max_row + 1)} 26 | irregular_exceptions_stems = {snowball.stem(ex): pron for ex, pron in irregular_exceptions.items()} 27 | 28 | epi = epitran.Epitran('rus-Cyrl') 29 | second_silent = 'стн стл здн рдн нтск ндск лвств'.split() 30 | first_silent = 'лнц дц вств'.split() 31 | hissing_rd = {'сш': 'шш', 'зш': 'шш', 'сж': 'жж', 'сч': 'щ'} 32 | non_ipa_symbols = {'t͡ɕʲ': 't͡ɕ', 'ʂʲː': 'ʂ', 'ɕːʲ': 'ɕː', 'ʒ': 'ʐ', 'd͡ʐ': 'd͡ʒ'} 33 | 34 | try: 35 | yo_dict = find("yo.dict", raise_exception=True) 36 | except FileNotFoundError: 37 | yo_dict = download("yo.dict") 38 | 39 | try: 40 | e_dict = find("e.dict", raise_exception=True) 41 | except FileNotFoundError: 42 | e_dict = download("e.dict") 43 | 44 | e_replacer = md.Replacer([e_dict, "plane"]) 45 | yo_replacer = md.Replacer([yo_dict, "plane"]) 46 | syntax_tree = SyntaxTree() 47 | 48 | 49 | class RuTranscript: 50 | def __init__(self, text: str, stressed_text: str = None, stress_place: str = 'after', replacement_dict: dict = None, 51 | stress_accuracy_threshold: float = 0.86): 52 | """ 53 | Makes a phonetic transcription in russian using IPA. 54 | 55 | :param text: A text to transcribe. 56 | :param stressed_text: The same (!) text with stresses. 57 | You may define stresses both for one word and for all words in the text. 58 | To do this, put a stress symbol (preferably '+') before or after the stressed vowel. 59 | :param stress_place: 'after' - if the stress symbol is after the stressed vowel, 60 | 'before' - if the stress symbol is before the stressed vowel. 61 | :param replacement_dict: Custom dictionary for replacing words (for example, {'tts': 'синтез речи'}). 62 | :param stress_accuracy_threshold: A threshold for the accuracy of stress placement for StressRNN. 63 | """ 64 | text, stressed_text = self._get_text_and_stressed_text(text, stressed_text, replacement_dict) 65 | self._pause_dict = get_punctuation_dict(text) 66 | self._tokens = text_norm_tok(text) 67 | self._sections_len = len(self._tokens) 68 | self._stressed_tokens = text_norm_tok(stressed_text) 69 | 70 | self._stress_accuracy_threshold = stress_accuracy_threshold 71 | self._stress_place = stress_place 72 | 73 | self._phrasal_words_indexes = [] 74 | self._letters_list = [] 75 | self._phonemes_list = [] 76 | self._allophones_list = [[]] * self._sections_len 77 | self._transliterated_tokens = [[]] * self._sections_len 78 | self._phrasal_words = [[]] * self._sections_len 79 | self._stressed_text = [[]] * self._sections_len 80 | 81 | def _get_text_and_stressed_text(self, text, stressed_text, replacement_dict): 82 | text = ' '.join(['—' if word == '-' else word for word in text.replace('\n', ' ').lower().split()]) 83 | stressed_text = ' '.join(['—' if word == '-' else word 84 | for word in stressed_text.replace('\n', ' ').lower().split()]) \ 85 | if stressed_text is not None else text 86 | 87 | if replacement_dict is not None: 88 | user_replacer = md.Replacer([replacement_dict, "plane"]) 89 | text = user_replacer(text) 90 | stressed_text = user_replacer(stressed_text) 91 | 92 | return text, stressed_text 93 | 94 | def _remove_dashes(self, section_num): 95 | section = self._tokens[section_num] 96 | a_section = self._stressed_tokens[section_num] 97 | self._tokens[section_num] = [token.replace('-', '') for token in section] 98 | self._stressed_tokens[section_num] = [token.replace('-', '') if token.count('+') == 1 99 | else remove_extra_stresses(token).replace('-', '') 100 | for token in a_section] 101 | 102 | def _tps(self, section_num): 103 | """ 104 | Makes replaces 'е - э' and 'е - ё' 105 | """ 106 | default_section = self._tokens[section_num] 107 | self._tokens[section_num] = [e_replacer(token.replace('+', '')) for token in self._tokens[section_num]] 108 | self._tokens[section_num] = [yo_replacer(token.replace('+', '')) for token in self._tokens[section_num]] 109 | 110 | if self._tokens[section_num] != [token.replace('+', '') for token in default_section]: 111 | self._stressed_tokens[section_num] = [ 112 | apply_differences([default_section[i], self._tokens[section_num][i]]) 113 | for i in range(len(default_section)) 114 | ] 115 | 116 | def _join_phonemes(self, transliterated_tokens, limit=10000): 117 | section_phonemes_list = [] 118 | joined_tokens = '_'.join(transliterated_tokens) 119 | joined_tokens = joined_tokens.replace('‑', '-') 120 | i = 0 121 | counter = 0 122 | default_len = len(joined_tokens) 123 | while i < default_len: 124 | if joined_tokens[i] not in ['+', '-']: 125 | n = 4 126 | if i != default_len - 1: 127 | while (joined_tokens[i: i + n] not in epi_starterpack + ['_', '|', '||', 'γ', 'ʐ']) and (n > 0): 128 | counter += 1 129 | if counter > limit: 130 | raise IndexError('Endless loop') 131 | n -= 1 132 | section_phonemes_list.append(joined_tokens[i: i + n]) 133 | elif joined_tokens[i] in epi_starterpack + ['||', 'γ']: 134 | section_phonemes_list.append(joined_tokens[i]) 135 | i += n 136 | else: 137 | section_phonemes_list.append(joined_tokens[i]) 138 | i += 1 139 | 140 | section_phonemes_list = [x for x in section_phonemes_list if x not in ['', 'ʲ']] 141 | 142 | n = 0 143 | for allophone_index in range(len(section_phonemes_list) - 1): 144 | allophone = section_phonemes_list[allophone_index + n] 145 | next_allophone = section_phonemes_list[allophone_index + n + 1] 146 | if (allophone == 't͡s' and next_allophone == 's') or (allophone == 'd͡ʒ' and next_allophone == 'ʐ'): 147 | del section_phonemes_list[allophone_index + n + 1] 148 | n -= 1 149 | 150 | # print(section_phonemes_list) 151 | return section_phonemes_list 152 | 153 | @staticmethod 154 | def add_prestressed_syllable_sign(section: list): 155 | section_result = section[:] 156 | n = 0 157 | for symb_i, symb in enumerate(section): 158 | if symb == '+': 159 | preavi = [phon_i for phon_i, phon in enumerate(section[:symb_i - 1]) if 160 | allophones[phon]['phon'] == 'V' and '_' not in section[phon_i + n:symb_i]] 161 | if preavi: 162 | section_result.insert(preavi[-1] + n + 1, '-') 163 | n += 1 164 | 165 | return section_result 166 | 167 | def _lpt_1(self, section_num): 168 | """ 169 | Letter-phoneme transformation by B.M. Lobanov. Part 1 - Irregular exceptions. 170 | """ 171 | for i, token in enumerate(self._tokens[section_num]): 172 | stem = snowball.stem(token) 173 | if stem in irregular_exceptions_stems: 174 | try: 175 | new_token = irregular_exceptions[token] 176 | except KeyError: 177 | ending = token[len(stem):] 178 | dif = - (len(token) - len(stem)) 179 | new_token = irregular_exceptions_stems[stem][:dif] + ending 180 | 181 | self._tokens[section_num][i] = new_token 182 | accent_index = self._stressed_tokens[section_num][i].index('+') 183 | self._stressed_tokens[section_num][i] = new_token[:accent_index] + '+' + new_token[accent_index:] 184 | 185 | def _lpt_2(self, section_num): 186 | """ 187 | Letter-phoneme transformation by B.M. Lobanov. Part 2 - Regular exceptions. 188 | """ 189 | for i, token in enumerate(self._stressed_tokens[section_num]): 190 | # adjective endings 'ого его' 191 | if token != 'ого+' and (token.replace('+', '').startswith('какого') 192 | or token.replace('+', '').endswith('ого') 193 | or token.replace('+', '').endswith('его')): 194 | accent_index = token.index('+') 195 | token = token.replace('+', '').replace('ого', 'ово').replace('его', 'ево') 196 | self._stressed_tokens[section_num][i] = token[:accent_index] + '+' + token[accent_index:] 197 | 198 | # 'что' --> 'што' 199 | if 'что' in self._stressed_tokens[section_num][i]: 200 | self._stressed_tokens[section_num][i] = token.replace('что', 'што') 201 | 202 | # verb endings 'тся ться' 203 | if token not in {'заботься', 'отметься'}: 204 | if token[-3:] == 'тся': 205 | self._stressed_tokens[section_num][i] = token[:-3] + 'ца' 206 | elif token[-4:] == 'ться': 207 | self._stressed_tokens[section_num][i] = token[:-4] + 'ца' 208 | 209 | # noun endings 'ия ие ию' 210 | if (token[-2:] in {'ия', 'ие', 'ию'}) and (token[-3] not in {'ц', 'щ'}): 211 | if token[-3] not in {'ж', 'ш'}: 212 | self._stressed_tokens[section_num][i] = token[:-2] + 'ь' + token[-1] 213 | else: 214 | self._stressed_tokens[section_num][i] = token[:-2] + 'й' + token[-1] 215 | 216 | # unpronounceable consonants 217 | for sub in first_silent + second_silent: 218 | if sub in token: 219 | new_sub = sub.translate(str.maketrans('', '', 'ьъ')) 220 | self._stressed_tokens[section_num][i] = token.translate(str.maketrans(sub, new_sub)) 221 | 222 | # combinations with hissing consonants 223 | stem = snowball.stem(token) 224 | if ('зч' in token or 'тч' in token or 'дч' in token) and (stem[-3:] == 'чик' or stem[-3:] == 'чиц'): 225 | self._stressed_tokens[section_num][i] = token.replace('зч', 'щ').replace('тч', 'ч').replace('дч', 226 | 'ч') 227 | for key, value in hissing_rd.items(): 228 | if key in token: 229 | self._stressed_tokens[section_num][i] = token.replace(key, value) 230 | 231 | def _lpt_3(self, section_num): 232 | """ 233 | Letter-phoneme transformation by B.M. Lobanov. Part 3 - Transliteration 234 | """ 235 | self._transliterated_tokens[section_num] = [epi.transliterate(token).replace('6', '').replace('4', '') 236 | for token in self._stressed_tokens[section_num]] 237 | for i, token in enumerate(self._transliterated_tokens[section_num]): 238 | for key, value in non_ipa_symbols.items(): 239 | if key in token: 240 | token = token.replace(key, value) 241 | self._transliterated_tokens[section_num][i] = token 242 | 243 | def _lpt_4(self, section_num): 244 | """ 245 | Letter-phoneme transformation by B.M. Lobanov. Part 4 - Common Rules 246 | """ 247 | # fricative g 248 | for i, token in enumerate(self._transliterated_tokens[section_num]): 249 | try: 250 | next_token = self._transliterated_tokens[section_num][i + 1] 251 | except IndexError: 252 | next_token = ' ' 253 | 254 | token_let = self._tokens[section_num][i] 255 | nlp_token = nlp(token_let)[0] 256 | lemma = nlp_token.lemma_ 257 | 258 | if lemma in {'ага', 'ого', 'угу', 'господь', 'господи', 'бог'}: 259 | self._transliterated_tokens[section_num][i] = token.replace('ɡ', 'γ', 1) 260 | elif token_let in {'ах', 'эх', 'ох', 'ух'}: 261 | next_token_allophone = allophones.get(next_token[0], {}) 262 | if next_token_allophone.get('voice', '') == 'voiced': 263 | self._transliterated_tokens[section_num][i] = token.replace('x', 'γ', 1) 264 | 265 | # ---- Join phonemes ---- 266 | joined_phonemes = self._join_phonemes(self._transliterated_tokens[section_num], limit=10000) 267 | self._phonemes_list.append(joined_phonemes) 268 | 269 | # ---- Join letters ---- 270 | joined_letters = list('_'.join(self._stressed_tokens[section_num])) 271 | self._letters_list.append(joined_letters) 272 | 273 | # ---- Continue LPC-4. Common rules ---- 274 | self._phonemes_list[section_num] = fix_jotised(self._phonemes_list[section_num], 275 | self._letters_list[section_num]) 276 | self._phonemes_list[section_num] = shch(self._phonemes_list[section_num]) 277 | self._phonemes_list[section_num] = long_ge(self._phonemes_list[section_num]) 278 | self._phonemes_list[section_num] = assimilative_palatalization(self._tokens[section_num], 279 | self._phonemes_list[section_num]) 280 | self._phonemes_list[section_num] = long_consonants(self._phonemes_list[section_num]) 281 | self._phonemes_list[section_num] = stunning(self._phonemes_list[section_num]) 282 | 283 | def transcribe(self): 284 | for section_num in range(self._sections_len): 285 | self._tps(section_num) 286 | # ---- Accenting ---- 287 | self._stressed_tokens[section_num] = put_stresses( 288 | tokens_list=self._stressed_tokens[section_num], 289 | stress_place=self._stress_place, 290 | stress_accuracy_threshold=self._stress_accuracy_threshold) 291 | self._stressed_text[section_num] = self._stressed_tokens[section_num] 292 | # ---- Removing dashes ---- 293 | self._remove_dashes(section_num) 294 | # ---- Phrasal words extraction ---- 295 | dep = syntax_tree.make_dependency_tree(' '.join(self._tokens[section_num])) 296 | self._phrasal_words_indexes.append(find_clitics(dep, self._tokens[section_num])) 297 | # ---- Letter-phoneme transformation ---- 298 | self._lpt_1(section_num) 299 | self._lpt_2(section_num) 300 | self._lpt_3(section_num) 301 | self._lpt_4(section_num) 302 | # ---- Allophones - consonants ---- 303 | self._allophones_list[section_num] = first_jot(self._phonemes_list[section_num]) 304 | self._allophones_list[section_num] = nasal_m_n(self._allophones_list[section_num]) 305 | self._allophones_list[section_num] = silent_r(self._allophones_list[section_num]) 306 | self._allophones_list[section_num] = voiced_ts(self._allophones_list[section_num]) 307 | # ---- Extract phrasal words ---- 308 | self._phrasal_words[section_num] = extract_phrasal_words(self._allophones_list[section_num], 309 | self._phrasal_words_indexes[section_num]) 310 | # ---- Allophones - vowels ---- 311 | self._phrasal_words[section_num] = self.add_prestressed_syllable_sign(self._phrasal_words[section_num]) 312 | self._allophones_list[section_num] = vowels(self._phrasal_words[section_num]) 313 | self._allophones_list[section_num] = labia_velar(self._allophones_list[section_num]) 314 | 315 | def _insert_pauses(self, sounds_list: list): 316 | for i, key in enumerate(self._pause_dict): 317 | sounds_list.insert(i + key, self._pause_dict[key]) 318 | 319 | def _get_escape_symbols(self, save_stresses: bool = False, save_spaces: bool = False): 320 | escape_symbols = ['+', '-', '_'] 321 | if save_stresses: 322 | escape_symbols.remove('+') 323 | if save_spaces: 324 | escape_symbols.remove('_') 325 | 326 | return escape_symbols 327 | 328 | def _join_sounds(self, escape_symbols: list, sounds_list: list): 329 | return ' '.join( 330 | [' '.join([x for x in section if x not in escape_symbols]) 331 | if section != '||' 332 | else section 333 | for section in sounds_list] 334 | ) 335 | 336 | def get_allophones(self, stress_place: str = None, save_stresses: bool = False, save_spaces: bool = False, 337 | save_pauses: bool = False, stress_symbol: str = '+'): 338 | """ 339 | :param stress_place: 'after' - to place the stress symbol after the stressed vowel, 340 | 'before' - to place the stress symbol before the stressed vowel. 341 | :param stress_symbol: A symbol that you want to indicate the stress. 342 | Be careful not to use letters and signs from the following list 343 | ['.', '_', '-', 'ʲ', 'ᶣ', 'ʷ', 'ˠ', 'ː', '͡']! 344 | :param save_spaces: Will replace spaces with '_'. 345 | :param save_stresses: Will replace stresses with the stress_symbol. 346 | :param save_pauses: Will replace punctuation with '||' for long pauses ('.', '?', '!', '…') 347 | and '|' for short pauses (other symbols). 348 | :return: List of allophones. 349 | """ 350 | if stress_symbol in ['.', '_', '-', 'ʲ', 'ᶣ', 'ʷ', 'ˠ', 'ː', '͡']: 351 | warnings.warn("The stress symbol intersects with the IPA transcription signs " 352 | "or the internal sighs of the framework.\nIt may cause an unpredictable behaviour.\n" 353 | "Better don't use signs from the following list " 354 | "['.', '_', '-', 'ʲ', 'ᶣ', 'ʷ', 'ˠ', 'ː', '͡']!") 355 | 356 | if save_pauses: 357 | self._insert_pauses(self._allophones_list) 358 | 359 | escape_symbols = self._get_escape_symbols(save_stresses=save_stresses, save_spaces=save_spaces) 360 | res = self._join_sounds(escape_symbols, self._allophones_list).split() 361 | 362 | if stress_place is None: 363 | stress_place = self._stress_place 364 | if stress_place == 'before': 365 | res = replace_stress_before(res) 366 | 367 | if (stress_symbol != '+') and ('+' not in escape_symbols): 368 | res = [x.replace('+', stress_symbol) for x in res] 369 | 370 | return res 371 | 372 | def get_phonemes(self, stress_place: str = None, save_stresses: bool = False, save_spaces: bool = False, 373 | save_pauses: bool = False, stress_symbol: str = '+'): 374 | """ 375 | :param stress_place: 'after' - to place the stress symbol after the stressed vowel, 376 | 'before' - to place the stress symbol before the stressed vowel. 377 | :param stress_symbol: A symbol that you want to indicate the stress. 378 | Be careful not to use letters and signs from the following list 379 | ['.', '_', '-', 'ʲ', 'ᶣ', 'ʷ', 'ˠ', 'ː', '͡']! 380 | :param save_spaces: Will replace spaces with '_'. 381 | :param save_stresses: Will replace stresses with the stress_symbol. 382 | :param save_pauses: Will replace punctuation with '||' for long pauses ('.', '?', '!', '…') 383 | and '|' for short pauses (other symbols). 384 | :return: List of phonemes. 385 | """ 386 | if stress_symbol in ['.', '_', '-', 'ʲ', 'ᶣ', 'ʷ', 'ˠ', 'ː', '͡']: 387 | warnings.warn("The stress symbol intersects with the IPA transcription signs " 388 | "or the internal sighs of the framework.\nIt may cause an unpredictable behaviour.\n" 389 | "Better don't use signs from the following list " 390 | "['.', '_', '-', 'ʲ', 'ᶣ', 'ʷ', 'ˠ', 'ː', '͡']!") 391 | 392 | if save_pauses: 393 | self._insert_pauses(self._phonemes_list) 394 | 395 | escape_symbols = self._get_escape_symbols(save_stresses=save_stresses, save_spaces=save_spaces) 396 | res = self._join_sounds(escape_symbols, self._phonemes_list).split() 397 | 398 | if stress_place is None: 399 | stress_place = self._stress_place 400 | if stress_place == 'before': 401 | res = replace_stress_before(res) 402 | 403 | if (stress_symbol != '+') and ('+' not in escape_symbols): 404 | res = [x.replace('+', stress_symbol) for x in res] 405 | 406 | return res 407 | 408 | def get_stressed_text(self, stress_place: str = None, stress_symbol: str = '+'): 409 | """ 410 | :param stress_place: 'after' - to place the stress symbol after the stressed vowel, 411 | 'before' - to place the stress symbol before the stressed vowel. 412 | :param stress_symbol: A symbol that you want to indicate the stress. 413 | Be careful not to use signs from the following list ['.', '_', '-', 'ʲ', 'ᶣ', 'ʷ', 'ˠ', 'ː', '͡']! 414 | :return: A text string with stresses. 415 | """ 416 | if stress_symbol in ['.', '_', '-', 'ʲ', 'ᶣ', 'ʷ', 'ˠ', 'ː', '͡']: 417 | warnings.warn("The stress symbol intersects with the IPA transcription signs " 418 | "or the internal sighs of the framework.\nIt may cause an unpredictable behaviour.\n" 419 | "Better don't use signs from the following list " 420 | "['.', '_', '-', 'ʲ', 'ᶣ', 'ʷ', 'ˠ', 'ː', '͡']!") 421 | 422 | if stress_place is None: 423 | stress_place = self._stress_place 424 | if stress_place == 'before': 425 | res = ' '.join([''.join(replace_stress_before(' '.join(section))) for section in self._stressed_text]) 426 | else: 427 | res = ' '.join([' '.join(section) for section in self._stressed_text]) 428 | 429 | if stress_symbol != '+': 430 | res = res.replace('+', stress_symbol) 431 | 432 | return res 433 | -------------------------------------------------------------------------------- /src/__init__.py: -------------------------------------------------------------------------------- 1 | from .RuTranscript import RuTranscript 2 | from .tools.allophones_tools import get_allophone_info 3 | from .tools.main_tools import text_norm_tok 4 | 5 | __all__ = [ 6 | 'RuTranscript', 7 | 'get_allophone_info', 8 | 'text_norm_tok' 9 | ] 10 | -------------------------------------------------------------------------------- /src/data/alphabet.txt: -------------------------------------------------------------------------------- 1 | a, ɑ, æ, æ., ɐ., ɐ, ə, ʌ, b, bʷ, bː, bːʷ, bˠ, bʲ, bᶣ, v, vʷ, vˠ, vʲ, vᶣ, ɡ, ɡʷ, ɡˠ, ɡʲ, ɡᶣ, ɡː, γ, γʷ, d, dʷ, dˠ, dʲ, dᶣ, dː, dːʷ, dːˠ, dʲː, dːᶣ, ʐ, ʐʷ, ʐˠ, ʑː, ʑːʷ, ʑːˠ, ʑʲː, ʑːᶣ, d͡ʒ, d͡ʒᶣ, z, zʷ, zˠ, zʲ, zᶣ, zː, zʲː, i, ɪ, ɪ., j, ʝ, jʷ, jᶣ, ʝʷ, ʝᶣ, k, kʷ, kˠ, kʲ, kː, kːʷ, kʲː, kᶣ, l, lʷ, lˠ, lʲ, lᶣ, lː, lːʷ, lʲː, lːᶣ, m, mʷ, mˠ, mʲ, mː, mːʷ, mʲː, mːˠ, mᶣ, ɱ, ɱʲ, n, nʷ, nˠ, nʲ, nː, nːʷ, nːˠ, nʲː, nᶣ, o, ɵ, p, pʷ, pː, pːʷ, pʲː, pˠ, pʲ, pᶣ, r, rʷ, rˠ, rʲ, rː, rːʷ, rʲː, rᶣ, r̥, r̥ʲ, s, sʷ, sˠ, sʲ, sː, sːʷ, sʲː, sᶣ, t, tʷ, tˠ, tʲ, tː, tʲː, tᶣ, u, ʉ, ʊ, ᵿ, f, fʷ, fˠ, fʲ, fʲː, fᶣ, x, xʷ, xˠ, xʲ, xᶣ, t͡s, t͡sʷ, t͡sˠ, t͡sː, t͡sːʷ, t͡sːˠ, d̻͡z̪, t͡ɕ, t͡ɕᶣ, t͡ɕː, t͡ɕːᶣ, ʂ, ʂʷ, ʂˠ, ʂː, ʂːʷ, ʂːˠ, ɕː, ɕːᶣ, ɨ, ɨ̟, ɯ̟ɨ̟, ᵻ, ɛ, e, ɪ., ʔ -------------------------------------------------------------------------------- /src/data/error_words_stresses_default.txt: -------------------------------------------------------------------------------- 1 | выходно+го 2 | бого+в 3 | о+строго 4 | ско+рого 5 | вы+скочившего 6 | общечелове+ческого 7 | слепо+го 8 | манхэ+ттэнского 9 | заре+цкого 10 | ви+дного 11 | уси+ленного 12 | джа+зового 13 | люби+мовского 14 | мару+синого 15 | циа+нистого 16 | непро+шеного 17 | шоссэ+ 18 | сму+глого 19 | исаа+киевского 20 | голуби+ного 21 | како+гото 22 | ку+пишь 23 | головно+го 24 | тре+звого 25 | тройно+го 26 | канцеля+рского 27 | то+лстого 28 | до+лгого 29 | кладби+щенского 30 | грохо+чущего 31 | четырнадцатиле+тнего 32 | редакцио+нного 33 | до+рого 34 | то+нущего 35 | изра+ильского 36 | меньшеви+стского 37 | худо+го 38 | давни+шнего 39 | бе+гающего 40 | трудолюби+вого 41 | ку+пит 42 | воро+вского 43 | шестидеся+того 44 | неистреби+мого 45 | до+лжного 46 | произво+дственно 47 | техни+ческого 48 | обще+ственного 49 | запоро+жского 50 | избало+ванного 51 | семьна+дцатого 52 | водопрово+дного 53 | бродя+чего 54 | крикли+вого 55 | седовла+сого 56 | комари+ного 57 | нену+жного 58 | цини+чного 59 | отставно+го 60 | рога+того 61 | души+стого 62 | пусто+го 63 | военнослу+жащего 64 | та+ющего 65 | портно+го 66 | многомиллио+нного 67 | како+гонибудь 68 | со+льного 69 | кафэ+ 70 | единоутро+бного 71 | изоби+лующего 72 | ржано+го 73 | посторо+ннего 74 | туре+цкого 75 | индонези+йского 76 | уе+здного 77 | ми+лого 78 | но+белевского 79 | было+го 80 | не+рвного 81 | гости+ничного 82 | внешта+тного 83 | городско+го 84 | журнали+стского 85 | ежеме+сячного 86 | предыду+щего 87 | а+нгельского 88 | ро+вного 89 | ну+жного 90 | недостаю+щего 91 | купэ+ 92 | зре+лого 93 | ни+щего 94 | прие+зжего 95 | купи+те 96 | не+жного 97 | го+ночного 98 | за+сранного 99 | семидеся+того 100 | дзержи+нского 101 | цвета+стого 102 | га+нсовского 103 | пифаго+ровского 104 | листово+го 105 | распа+хнутого 106 | жуко+вского 107 | та+ллиннского 108 | несомне+нного 109 | расти+тельного 110 | уби+того 111 | са+мого 112 | слы+шавшего 113 | диссиде+нтствующего 114 | транскри+пцию 115 | транскри+пция 116 | литературнохудо+жественный -------------------------------------------------------------------------------- /src/data/irregular_exceptions.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/suralmasha/RuTranscript/30cbc40c5ac368021bcc8a05002fa33cc50ee9b6/src/data/irregular_exceptions.xlsx -------------------------------------------------------------------------------- /src/data/paired_consonants.txt: -------------------------------------------------------------------------------- 1 | (b, p), (ɡ, k), (z, s), (v, f), (d, t), (ʐ, ʂ), (ʑː, ʂː), (bʲ, pʲ), (ɡʲ, kʲ), (zʲ, sʲ), (zʲː, sʲː), (vʲ, fʲ), (dʲ, tʲ), (ʐʲ, ʂʲ), (ʑʲː, ʂʲː), (bʷ, pʷ), (ɡʷ, kʷ), (zʷ, sʷ), (vʷ, fʷ), (dʷ, tʷ), (ʐʷ, ʂʷ), (ʑːʷ, ʂːʷ), (bᶣ, pᶣ), (ɡᶣ, kᶣ), (zᶣ, sᶣ), (vᶣ, fᶣ), (dᶣ, tᶣ), (ʐᶣ, ʂᶣ), (ʑːᶣ, ɕːᶣ), (bˠ, pˠ), (ɡˠ, kˠ), (zˠ, sˠ), (vˠ, fˠ), (dˠ, tˠ), (ʐˠ, ʂˠ), (ʑːˠ, ʂːˠ) -------------------------------------------------------------------------------- /src/data/sorted_allophones.txt: -------------------------------------------------------------------------------- 1 | total_c = b, bʷ, bˠ, bʲ, bᶣ, bː, bːʷ, v, vʷ, vˠ, vʲ, vᶣ, ɡ, ɡʷ, ɡˠ, ɡː, γ, γʷ, ɡʲ, ɡᶣ, d, dʷ, dˠ, dʲ, dᶣ, dː, dːʷ, dːˠ, dʲː, dːᶣ, ʐ, ʐʷ, ʐˠ, ʑː, ʑːʷ, ʑːˠ, ʑʲː, ʑːᶣ, d͡ʒ, d͡ʒᶣ, z, zʷ, zˠ, zʲ, zᶣ, zː, zʲː, j, ʝ, jʷ, jᶣ, ʝʷ, ʝᶣ, k, kʷ, kˠ, kʲ, kᶣ, kː, kːʷ, kʲː, l, lʷ, lˠ, lʲ, lᶣ, lː, lːʷ, lʲː, lːᶣ, m, mʷ, mˠ, mʲ, mᶣ, ɱ, ɱʲ, mː, mːʷ, mʲː, n, nʷ, nˠ, nʲ, nᶣ, nː, nːʷ, nːˠ, nʲː, p, pʷ, pˠ, pʲ, pᶣ, pː, pːʷ, pʲː, r, rʷ, rˠ, rʲ, rᶣ, rː, rːʷ, rʲː, r̥, r̥ʲ, s, sʷ, sˠ, sʲ, sᶣ, sː, sːʷ, sʲː, t, tʷ, tˠ, tʲ, tᶣ, tː, tʲː, f, fʷ, fˠ, fʲ, fᶣ, fʲː, x, xʷ, xˠ, xʲ, xᶣ, t͡s, t͡sʷ, t͡sˠ, t͡sː, t͡sːʷ, t͡sːˠ, d̻͡z̪, t͡ɕ, t͡ɕᶣ, t͡ɕː, t͡ɕːᶣ, ʂ, ʂʷ, ʂˠ, ʂː, ʂːʷ, ʂːˠ, ɕː, ɕːᶣ, ʔ 2 | voiceless_c = ʔ, p, pʷ, pˠ, pʲ, pᶣ, pː, pːʷ, pʲː, f, fʷ, fˠ, fʲ, fᶣ, fʲː, k, kʷ, kˠ, kʲ, kᶣ, kː, kːʷ, kʲː, t, tʷ, tˠ, tʲ, tᶣ, tː, tʲː, ʂ, ʂʷ, ʂˠ, ʂː, ʂːʷ, ʂːˠ, s, sʷ, sˠ, sʲ, sᶣ, sː, sʲː, sːʷ, x, xʷ, xˠ, xʲ, xᶣ, t͡s, t͡sʷ, t͡sˠ, t͡sː, t͡sːʷ, t͡sːˠ, t͡ɕ, t͡ɕᶣ, t͡ɕː, t͡ɕːᶣ, ɕː, ɕːᶣ 3 | voiced_c = b, bʷ, bˠ, bʲ, bᶣ, bː, bːʷ, v, vʷ, vˠ, vʲ, vᶣ, ɡ, ɡʷ, ɡˠ, ɡʲ, ɡᶣ, ɡː, γ, γʷ, d, dʷ, dˠ, dʲ, dᶣ, dː, dːʷ, dːˠ, dʲː, dːᶣ, ʐ, dʷ, dˠ, ʑː, ʑːʷ, ʑːˠ, ʑʲː, ʑːᶣ, z, zʷ, zˠ, zʲ, zᶣ, zː, zʲː, j, jʷ, jᶣ, ʝ, ʝʷ, ʝᶣ, m, mʷ, mˠ, mʲ, mᶣ, mː, mːʷ, mʲː, mːˠ, ɱ, ɱʲ, n, nʷ, nˠ, nʲ, nᶣ, nː, nːʷ, nːˠ, nʲː, r, rʷ, rˠ, rʲ, rᶣ, rː, rːʷ, rʲː, r̥, r̥ʲ, d͡ʒ, d͡ʒᶣ, d̻͡z̪, l, lʷ, lˠ, lʲ, lᶣ, lː, lʲː, lːʷ, ʐ, ʐʷ, ʐˠ 4 | soft_c = bʲ, bᶣ, dʲ, dᶣ, dʲː, dːᶣ, vʲ, vᶣ, ɡʲ, ɡᶣ, zʲ, zʲː, zᶣ, lʲ, lᶣ, lʲː, lːᶣ, mʲ, mᶣ, mʲː, ɱ, ɱʲ, nʲ, nᶣ, nʲː, nᶣ, rʲ, rᶣ, rʲː, r̥ʲ, pʲ, pᶣ, pʲː, tʲ, tᶣ, tʲː, fʲ, fᶣ, fʲː, kʲ, kᶣ, kʲː, sʲ, sᶣ, sʲː, xʲ, xᶣ, ʑʲː, ʑːᶣ 5 | always_soft_c = j, ʝ, jʷ, ʝʷ, jᶣ, ʝᶣ, ɕː, ɕːᶣ, t͡ɕ, t͡ɕᶣ, t͡ɕː, t͡ɕːᶣ, d͡ʒ, d͡ʒᶣ 6 | hard_c = b, bʷ, bˠ, bː, bːʷ, d, dʷ, dˠ, dː, dːʷ, dːˠ, v, vʷ, vˠ, ɡ, ɡʷ, ɡˠ, ɡː, z, zʷ, zˠ, zː, l, lʷ, lˠ, lː, lːʷ, m, mʷ, mˠ, mː, mːʷ, mːˠ, n, nʷ, nˠ, nː, nːʷ, nːˠ, r, rʷ, rˠ, rː, rːʷ, r̥, p, pʷ, pˠ, pː, pːʷ, t, tʷ, tˠ, tː, f, fʷ, fˠ, k, kʷ, kˠ, kː, kːʷ, s, sʷ, sˠ, sː, sːʷ, x, xʷ, xˠ 7 | always_hard_c = ʔ, γ, γʷ, ʂ, ʂʷ, ʂˠ, ʂː, ʂːʷ, ʂːˠ, t͡s, t͡sʷ, t͡sˠ, t͡sː, t͡sːʷ, t͡sːˠ, d̻͡z̪, ʐ, ʐʷ, ʐˠ, ʑː, ʑːʷ, ʑːˠ 8 | hissing_c = ɕː, ʑː, ʑːʷ, ʑːˠ, ʑʲː, ʑːᶣ, ʐ, ʂ, ʂʷ, ʐʷ, d͡ʒ, d͡ʒᶣ, ɕːᶣ, ʂˠ, ʂː, ʂːʷ, ʂːˠ, ʐˠ, t͡ɕ, t͡ɕᶣ, t͡ɕː, t͡ɕːᶣ 9 | bilabial_c = b, bʷ, bˠ, bʲ, bᶣ, bː, bːʷ, p, pʷ, pˠ, pʲ, pᶣ, pː, pːʷ, pʲː, m, mʷ, mˠ, mʲ, mᶣ, mː, mːʷ, mʲː, mːˠ, ɱ, ɱʲ 10 | labiodental_c = f, fʷ, fˠ, fʲ, fᶣ, fʲː, v, vʷ, vˠ, vʲ, vᶣ 11 | dental_c = t, tʷ, tˠ, tʲ, tᶣ, tː, tʲː, d, dʷ, dˠ, dʲ, dᶣ, dː, dːʷ, dːˠ, dʲː, dːᶣ, s, sʷ, sˠ, sʲ, sᶣ, sː, sːʷ, sʲː, z, zʷ, zˠ, zʲ, zᶣ, zː, zʲː, n, nʷ, nˠ, nʲ, nᶣ, nː, nːʷ, nːˠ, nʲː, l, lʷ, lˠ, lʲ, lᶣ, lː, lːʷ, lʲː, lːᶣ, t͡s, t͡sʷ, t͡sˠ, t͡sː, t͡sːʷ, t͡sːˠ, d̻͡z̪ 12 | palatinodental_c = t͡ɕ, t͡ɕᶣ, t͡ɕː, t͡ɕːᶣ, d͡ʒ, d͡ʒᶣ, ʂ, ʂʷ, ʂˠ, ʂː, ʂːʷ, ʂːˠ, ɕː, ɕːᶣ, ʐ, ʐʷ, ʐˠ, ʑː, ʑːʷ, ʑːˠ, ʑʲː, ʑːᶣ, r, rʷ, rˠ, rː, rːʷ, rʲː, r̥, rʲ, rᶣ, r̥ʲ 13 | palatal_c = j, jʷ, jᶣ, ʝᶣ, ʝ, ʝʷ 14 | velar_c = k, kʷ, kˠ, kʲ, kᶣ, kː, kːʷ, kʲː, ɡ, ɡʷ, ɡˠ, ɡʲ, ɡᶣ, ɡː, γ, γʷ, x, xʷ, xˠ, xʲ, xᶣ 15 | glottal_c = ʔ 16 | explosive_c = ʔ, b, bʷ, bˠ, bʲ, bᶣ, bː, bːʷ, p, pʷ, pˠ, pʲ, pᶣ, pː, pːʷ, pʲː, t, tʷ, tˠ, tʲ, tᶣ, tː, tʲː, d, dʷ, dˠ, dʲ, dᶣ, dː, dːʷ, dːˠ, dʲː, dːᶣ, k, kʷ, kˠ, kʲ, kᶣ, kː, kːʷ, kʲː, ɡ, ɡʷ, ɡˠ, ɡʲ, ɡᶣ, ɡː, γ, γʷ 17 | affricate_c = t͡s, t͡sʷ, t͡sˠ, t͡sː, t͡sːʷ, t͡sːˠ, d̻͡z̪, t͡ɕ, t͡ɕᶣ, t͡ɕː, t͡ɕːᶣ, d͡ʒ, d͡ʒᶣ 18 | fricative_c = f, fʷ, fˠ, fʲ, fᶣ, fʲː, v, vʷ, vˠ, vʲ, vᶣ, s, sʷ, sˠ, sʲ, sᶣ, sː, sːʷ, sʲː, z, zʷ, zˠ, zʲ, zᶣ, zː, zʲː, ʂ, ʂʷ, ʂˠ, ʂː, ʂːʷ, ʂːˠ, ɕː, ɕːᶣ, ʐ, ʐʷ, ʐˠ, ʑː, ʑːʷ, ʑːˠ, ʑʲː, ʑːᶣ, j, jʷ, jᶣ, ʝᶣ, ʝ, ʝʷ, x, xʷ, xˠ, xʲ, xᶣ 19 | nasal_c = m, mʷ, mˠ, mʲ, mᶣ, mː, mːʷ, mʲː, mːˠ, ɱ, ɱʲ, n, nʷ, nˠ, nʲ, nᶣ, nː, nːʷ, nːˠ, nʲː 20 | lateral_c = l, lʷ, lˠ, lʲ, lᶣ, lː, lːʷ, lʲː, lːᶣ 21 | vibrant_c = r, rʷ, rˠ, rʲ, rᶣ, rː, rːʷ, rʲː, r̥, r̥ʲ 22 | paired_c = (b, p), (ɡ, k), (z, s), (v, f), (d, t), (ʐ, ʂ), (ʑː, ʂː), (bʲ, pʲ), (ɡʲ, kʲ), (zʲ, sʲ), (zʲː, sʲː), (vʲ, fʲ), (dʲ, tʲ), (ʐʲ, ʂʲ), (ʑʲː, ʂʲː), (bʷ, pʷ), (ɡʷ, kʷ), (zʷ, sʷ), (vʷ, fʷ), (dʷ, tʷ), (ʐʷ, ʂʷ), (ʑːʷ, ʂːʷ), (bᶣ, pᶣ), (ɡᶣ, kᶣ), (zᶣ, sᶣ), (vᶣ, fᶣ), (dᶣ, tᶣ), (ʐᶣ, ʂᶣ), (ʑːᶣ, ɕːᶣ), (bˠ, pˠ), (ɡˠ, kˠ), (zˠ, sˠ), (vˠ, fˠ), (dˠ, tˠ), (ʐˠ, ʂˠ), (ʑːˠ, ʂːˠ) 23 | total_v = a, ɑ, æ, æ., ɐ., ɐ, ə, ʌ, i, ɪ, ɪ., o, ɵ, u, ʉ, ʊ, ᵿ, ɨ, ᵻ, ɨ̟, ɯ̟ɨ̟, ɛ, e 24 | front_v = i, e, ɛ, æ, æ. 25 | near_front_v = ɪ, ɪ. 26 | central_v = ɨ, ɨ̟, ɯ̟ɨ̟, ᵻ, ɵ, ə, a 27 | near_back_v = ʊ, ᵿ, ɐ., ɐ 28 | back_v = u, ʉ, o, ʌ, ɑ 29 | close_v = i, ɨ, ɨ̟, ɯ̟ɨ̟, u, ʉ 30 | near_close_v = ɪ, ɪ., ʊ, ᵿ, ᵻ 31 | close_mid_v = e, ɵ, o 32 | mid_v = ə 33 | open_mid_v = ɛ, ʌ 34 | near_open_v = æ, æ., ɐ., ɐ 35 | open_v = a, ɑ 36 | rounded_v = o, ɵ, u, ʉ, ʊ, ᵿ 37 | velarize_v = ɨ, ɨ̟, ɯ̟ɨ̟, ᵻ 38 | sonorous_class = r, rʷ, rˠ, rʲ, rᶣ, rː, rːʷ, rʲː, r̥, r̥ʲ l, lʷ, lˠ, lʲ, lᶣ, lː, lːʷ, lʲː, lːᶣ m, mʷ, mˠ, mʲ, mᶣ, mː, mːʷ, mʲː, mːˠ, ɱ, ɱʲ, n, nʷ, nˠ, nʲ, nᶣ, nː, nːʷ, nːˠ, nʲː 39 | voiced_class = ʒ, ʒʷ, ʒˠ, ʒː, ʒːʷ, ʒːˠ, ʒʲː, ʒːᶣ, d͡ʒ, d͡ʒᶣ, d̻͡z̪, ʐ, ʐʷ, ʐˠ, ʑː, ʑːʷ, ʑːˠ, ʑʲː, ʑːᶣ, b, bʷ, bˠ, bʲ, bᶣ, bː, bːʷ, v, vʷ, vˠ, vʲ, vᶣ, ɡ, ɡʷ, ɡˠ, ɡʲ, ɡᶣ, ɡː, γ, γʷ, d, dʷ, dˠ, dʲ, dᶣ, dː, dːʷ, dːˠ, dʲː, dːᶣ, dʷ, dˠ, z, zʷ, zˠ, zʲ, zᶣ, zː, zʲː, j, jʷ, jᶣ, ʝ, ʝʷ, ʝᶣ 40 | voiceless_class = t͡ɕ, t͡ɕᶣ, t͡ɕː, t͡ɕːᶣ, ɕː, ɕːᶣ, ʂ, ʂʷ, ʂˠ, ʂː, ʂːʷ, ʂːˠ, ʔ, p, pʷ, pˠ, pʲ, pᶣ, pː, pːʷ, pʲː, f, fʷ, fˠ, fʲ, fᶣ, fʲː, k, kʷ, kˠ, kʲ, kᶣ, kː, kːʷ, kʲː, t, tʷ, tˠ, tʲ, tᶣ, tː, tʲː, s, sʷ, sˠ, sʲ, sᶣ, sː, sʲː, sːʷ, x, xʷ, xˠ, xʲ, xᶣ, t͡s, t͡sʷ, t͡sˠ, t͡sː, t͡sːʷ, t͡sːˠ 41 | complex_experiment = l, lʷ, lˠ, lʲ, lᶣ, lː, lːʷ, lʲː, lːᶣ, ɨ, ᵻ, ɨ̟, ɯ̟ɨ̟, j, jʷ, jᶣ, ʝ, ʝʷ, ʝᶣ, o, ɵ, u, ʉ, ʊ, ᵿ 42 | rare_experiment = zː, d̻͡z̪, tʲː, r̥ʲ, ʑː, d͡ʒ, ɱ, tː, lː, nʲː, mː, sʲː, sː, xʲ, fʲ, f, ɡʲ, ʝ, nː, r̥, zʲ, ɕː 43 | random_vowels_experiment = ɐ., u, ɛ, ɯ̟ɨ̟, ɵ 44 | long_consonants_experiment = bː, ɡː, dː, dːʷ, dːˠ, dʲː, dːᶣ, zː, zʲː, kː, kːʷ, kʲː, lː, lːʷ, lʲː, lːᶣ, mː, mːʷ, mʲː, nː, nːʷ, nːˠ, nʲː, pː, pːʷ, pʲː, rː, rːʷ, rʲː, sː, sːʷ, sʲː, tː, tʲː, fʲː 45 | -------------------------------------------------------------------------------- /src/tests/test_consonants.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | 3 | from .. import RuTranscript 4 | 5 | 6 | class TestConsonants(unittest.TestCase): 7 | 8 | def test_fricative_g_1(self): 9 | testing_text = 'господи' 10 | testing_a_text = 'го+споди' 11 | ru_transcript = RuTranscript(testing_text, testing_a_text) 12 | ru_transcript.transcribe() 13 | print(testing_text, ru_transcript.get_allophones()) 14 | self.assertEqual(['γʷ', 'o', 's', 'p', 'ə', 'dʲ', 'ɪ'], ru_transcript.get_allophones()) 15 | 16 | def test_fricative_g_2(self): 17 | testing_text = 'ах да' 18 | testing_a_text = 'а+х да+' 19 | ru_transcript = RuTranscript(testing_text, testing_a_text) 20 | ru_transcript.transcribe() 21 | print(testing_text, ru_transcript.get_allophones()) 22 | self.assertEqual(['a', 'γ', 'd', 'ʌ'], ru_transcript.get_allophones()) 23 | 24 | def test_nasal_m_n(self): # 'м' / 'н' перед губно-зубными согласными 25 | testing_text = 'амфора' 26 | testing_a_text = 'а+мфора' 27 | ru_transcript = RuTranscript(testing_text, testing_a_text) 28 | ru_transcript.transcribe() 29 | print(testing_text, ru_transcript.get_allophones()) 30 | self.assertEqual(['a', 'ɱ', 'f', 'ə', 'r', 'ʌ'], ru_transcript.get_allophones()) 31 | 32 | def test_silent_r(self): # 'р' перед глухими согласными и в конце слова 33 | testing_text = 'арфа' 34 | testing_a_text = 'а+рфа' 35 | ru_transcript = RuTranscript(testing_text, testing_a_text) 36 | ru_transcript.transcribe() 37 | print(testing_text, ru_transcript.get_allophones()) 38 | self.assertEqual(['a', 'r̥', 'f', 'ʌ'], ru_transcript.get_allophones()) 39 | 40 | def test_long_sh(self): # долгий 'ш', в сочетании 'сш' 41 | testing_text = 'сшить' 42 | testing_a_text = 'сши+ть' 43 | ru_transcript = RuTranscript(testing_text, testing_a_text) 44 | ru_transcript.transcribe() 45 | print(testing_text, ru_transcript.get_allophones()) 46 | self.assertEqual(['ʂːˠ', 'ɨ', 'tʲ'], ru_transcript.get_allophones()) 47 | 48 | def test_ts(self): # 'ц' 49 | testing_text = 'цапля' 50 | testing_a_text = 'ца+пля' 51 | ru_transcript = RuTranscript(testing_text, testing_a_text) 52 | ru_transcript.transcribe() 53 | print(testing_text, ru_transcript.get_allophones()) 54 | self.assertEqual(['t͡s', 'ɐ.', 'p', 'lʲ', 'æ.'], ru_transcript.get_allophones()) 55 | 56 | def test_voiced_ts(self): # 'ц' перед звонкой согласной 57 | testing_text = 'плацдарм' 58 | testing_a_text = 'плацда+рм' 59 | ru_transcript = RuTranscript(testing_text, testing_a_text) 60 | ru_transcript.transcribe() 61 | print(testing_text, ru_transcript.get_allophones()) 62 | self.assertEqual(['p', 'l', 'ɐ', 'd̻͡z̪', 'd', 'a', 'r', 'm'], ru_transcript.get_allophones()) 63 | 64 | def test_dj(self): # сочетание 'дж' 65 | testing_text = 'джунгли' 66 | testing_a_text = 'джу+нгли' 67 | ru_transcript = RuTranscript(testing_text, testing_a_text) 68 | ru_transcript.transcribe() 69 | print(testing_text, ru_transcript.get_allophones()) 70 | self.assertEqual(['d͡ʒᶣ', 'ʉ', 'n', 'ɡ', 'lʲ', 'ɪ'], ru_transcript.get_allophones()) 71 | 72 | def test_shch_1(self): # 'щ' 73 | testing_text = 'щегол' 74 | testing_a_text = 'щего+л' 75 | ru_transcript = RuTranscript(testing_text, testing_a_text) 76 | ru_transcript.transcribe() 77 | print(testing_text, ru_transcript.get_allophones()) 78 | self.assertEqual(['ɕː', 'ə', 'ɡʷ', 'o', 'l'], ru_transcript.get_allophones()) 79 | 80 | def test_shch_2(self): # 'ж' перед глух.согл. 81 | testing_text = 'мужчина' 82 | testing_a_text = 'мужчи+на' 83 | ru_transcript = RuTranscript(testing_text, testing_a_text) 84 | ru_transcript.transcribe() 85 | print(testing_text, ru_transcript.get_allophones()) 86 | self.assertEqual(['mʷ', 'ʊ', 'ɕː', 'i', 'n', 'ʌ'], ru_transcript.get_allophones()) 87 | 88 | def test_shch_3(self): # сочетания 'сч', 'зч', 'жч' 89 | testing_text = 'считать' 90 | testing_a_text = 'счита+ть' 91 | ru_transcript = RuTranscript(testing_text, testing_a_text) 92 | ru_transcript.transcribe() 93 | print(testing_text, ru_transcript.get_allophones()) 94 | self.assertEqual(['ɕː', 'ɪ', 't', 'a', 'tʲ'], ru_transcript.get_allophones()) 95 | 96 | def test_ch(self): # 'ч' 97 | testing_text = 'течь' 98 | testing_a_text = 'те+чь' 99 | ru_transcript = RuTranscript(testing_text, testing_a_text) 100 | ru_transcript.transcribe() 101 | print(testing_text, ru_transcript.get_allophones()) 102 | self.assertEqual(['tʲ', 'e', 't͡ɕ'], ru_transcript.get_allophones()) 103 | 104 | def test_long_ge_1(self): # 'ж' долгий 105 | testing_text = 'жужжать' 106 | testing_a_text = 'жужжа+ть' 107 | ru_transcript = RuTranscript(testing_text, testing_a_text) 108 | ru_transcript.transcribe() 109 | print(testing_text, ru_transcript.get_allophones()) 110 | self.assertEqual(['ʐʷ', 'ʊ', 'ʑː', 'ɐ.', 'tʲ'], ru_transcript.get_allophones()) 111 | 112 | def test_long_ge_2(self): # 'щ' пред звонкой согласной 113 | testing_text = 'вещдок' 114 | testing_a_text = 'вещдо+к' 115 | ru_transcript = RuTranscript(testing_text, testing_a_text) 116 | ru_transcript.transcribe() 117 | print(testing_text, ru_transcript.get_allophones()) 118 | self.assertEqual(['vʲ', 'ɪ', 'ʑː', 'dʷ', 'o', 'k'], ru_transcript.get_allophones()) 119 | 120 | def test_long_ge_3(self): # сочетание 'зж' 121 | testing_text = 'заезжий' 122 | testing_a_text = 'зае+зжий' 123 | ru_transcript = RuTranscript(testing_text, testing_a_text) 124 | ru_transcript.transcribe() 125 | print(testing_text, ru_transcript.get_allophones()) 126 | self.assertEqual(['z', 'ɐ', 'j', 'e', 'ʑːˠ', 'ɨ', 'j'], ru_transcript.get_allophones()) 127 | 128 | def test_j_1(self): # 'й' 129 | testing_text = 'май' 130 | testing_a_text = 'ма+й' 131 | ru_transcript = RuTranscript(testing_text, testing_a_text) 132 | ru_transcript.transcribe() 133 | print(testing_text, ru_transcript.get_allophones()) 134 | self.assertEqual(['m', 'a', 'j'], ru_transcript.get_allophones()) 135 | 136 | def test_j_2(self): # йотированный гласный после разделительных ъ и ь 137 | testing_text = 'объявление' 138 | testing_a_text = 'объявле+ние' 139 | ru_transcript = RuTranscript(testing_text, testing_a_text) 140 | ru_transcript.transcribe() 141 | print(testing_text, ru_transcript.get_allophones()) 142 | self.assertEqual(['ə', 'bʲ', 'j', 'ɪ', 'v', 'lʲ', 'e', 'nʲ', 'j', 'æ.'], ru_transcript.get_allophones()) 143 | 144 | def test_j_3(self): # йотированный гласный между двумя гласными 145 | testing_text = 'заяц' 146 | testing_a_text = 'за+яц' 147 | ru_transcript = RuTranscript(testing_text, testing_a_text) 148 | ru_transcript.transcribe() 149 | print(testing_text, ru_transcript.get_allophones()) 150 | self.assertEqual(['z', 'a', 'j', 'ɪ.', 't͡s'], ru_transcript.get_allophones()) 151 | 152 | def test_j_4(self): # йотированный гласный перед ударным гласным 153 | testing_text = 'заезжий' 154 | testing_a_text = 'зае+зжий' 155 | ru_transcript = RuTranscript(testing_text, testing_a_text) 156 | ru_transcript.transcribe() 157 | print(testing_text, ru_transcript.get_allophones()) 158 | self.assertEqual(['z', 'ɐ', 'j', 'e', 'ʑːˠ', 'ɨ', 'j'], ru_transcript.get_allophones()) 159 | 160 | def test_j_5(self): # йотированный гласный в начале слова 161 | testing_text = 'яхта' 162 | testing_a_text = 'я+хта' 163 | ru_transcript = RuTranscript(testing_text, testing_a_text) 164 | ru_transcript.transcribe() 165 | print(testing_text, ru_transcript.get_allophones()) 166 | self.assertEqual(['ʝ', 'æ', 'x', 't', 'ʌ'], ru_transcript.get_allophones()) 167 | 168 | def test_j_first(self): # йот в начале слога 169 | testing_text = 'я' 170 | testing_a_text = 'я+' 171 | ru_transcript = RuTranscript(testing_text, testing_a_text) 172 | ru_transcript.transcribe() 173 | print(testing_text, ru_transcript.get_allophones()) 174 | self.assertEqual(['ʝ', 'æ'], ru_transcript.get_allophones()) 175 | 176 | def test_long_consonant_junction_of_words(self): # долгий согласный на стыке слов 177 | testing_text = 'вот так' 178 | testing_a_text = 'вот так' 179 | ru_transcript = RuTranscript(testing_text, testing_a_text) 180 | ru_transcript.transcribe() 181 | print(testing_text, ru_transcript.get_allophones()) 182 | self.assertEqual(['v', 'ɐ', 'tː', 'a', 'k'], ru_transcript.get_allophones()) 183 | 184 | def test_consonants_stunning_in_the_end_of_a_word(self): 185 | testing_text = 'всерьёз' 186 | ru_transcript = RuTranscript(testing_text) 187 | ru_transcript.transcribe() 188 | print(testing_text, ru_transcript.get_allophones()) 189 | self.assertEqual(['vʲ', 'sʲ', 'ɪ', 'rʲ', 'jᶣ', 'ɵ', 's'], ru_transcript.get_allophones()) 190 | 191 | 192 | if __name__ == '__main__': 193 | unittest.main() 194 | -------------------------------------------------------------------------------- /src/tests/test_modules.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | 3 | from .. import RuTranscript 4 | from .. import text_norm_tok 5 | 6 | 7 | class TestModules(unittest.TestCase): 8 | 9 | def test_stress_one_syllable(self): 10 | testing_text = 'нос' 11 | ru_transcript = RuTranscript(testing_text) 12 | ru_transcript.transcribe() 13 | print(testing_text, ru_transcript.get_stressed_text()) 14 | self.assertEqual('но+с', ru_transcript.get_stressed_text()) 15 | 16 | def test_stress_yo(self): 17 | testing_text = 'ёлка' 18 | ru_transcript = RuTranscript(testing_text) 19 | ru_transcript.transcribe() 20 | print(testing_text, ru_transcript.get_stressed_text()) 21 | self.assertEqual('ё+лка', ru_transcript.get_stressed_text()) 22 | 23 | def test_stress_readme_transcription(self): 24 | testing_text = 'Как получить транскрипцию?' 25 | ru_transcript = RuTranscript(testing_text) 26 | ru_transcript.transcribe() 27 | print(testing_text, ru_transcript.get_stressed_text()) 28 | self.assertEqual('ка+к получи+ть транскри+пцию', ru_transcript.get_stressed_text()) 29 | 30 | def test_replace_e(self): 31 | testing_text = 'синтез речи в библиотеке' 32 | ru_transcript = RuTranscript(testing_text) 33 | ru_transcript.transcribe() 34 | print(testing_text, ru_transcript._tokens) 35 | self.assertEqual([['синтэз', 'речи', 'в', 'библиотеке']], ru_transcript._tokens) 36 | 37 | def test_replace_yo(self): 38 | testing_text = 'елка для ее ежика перышка подвел конек мед' 39 | ru_transcript = RuTranscript(testing_text) 40 | ru_transcript.transcribe() 41 | print(testing_text, ru_transcript._tokens) 42 | self.assertEqual([['ёлка', 'для', 'её', 'ёжика', 'пёрышка', 'подвёл', 'конёк', 'мёд']], ru_transcript._tokens) 43 | 44 | def test_replace_user_dict(self): 45 | testing_text = 'TTS - это увлекательно' 46 | ru_transcript = RuTranscript(testing_text, replacement_dict={"tts": "синтез речи"}) 47 | ru_transcript.transcribe() 48 | print(testing_text, ru_transcript._tokens) 49 | self.assertEqual([['синтэз', 'речи'], ['это', 'увлекательно']], ru_transcript._tokens) 50 | 51 | def test_dirty_text(self): 52 | testing_text = 'синтез речи - это#$ «увлекательно»' 53 | res = text_norm_tok(testing_text) 54 | print(testing_text, res) 55 | self.assertEqual([['синтез', 'речи', '-', 'это', 'увлекательно']], res) 56 | 57 | def test_error_stress(self): 58 | testing_text = 'литературнохудожественный' 59 | ru_transcript = RuTranscript(testing_text) 60 | ru_transcript.transcribe() 61 | print(testing_text, ru_transcript.get_stressed_text()) 62 | self.assertEqual('литературнохудо+жественный', ru_transcript.get_stressed_text()) 63 | 64 | 65 | if __name__ == '__main__': 66 | unittest.main() 67 | -------------------------------------------------------------------------------- /src/tests/test_phrases.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | 3 | from .. import RuTranscript 4 | 5 | 6 | class TestPhrases(unittest.TestCase): 7 | 8 | def test_readme_transcription(self): 9 | testing_text = 'Как получить транскрипцию?' 10 | testing_a_text = 'Ка+к получи+ть транскри+пцию?' 11 | ru_transcript = RuTranscript(testing_text, testing_a_text) 12 | ru_transcript.transcribe() 13 | print(testing_text, ru_transcript.get_allophones()) 14 | self.assertEqual(['k', 'a', 'k', 'p', 'ə', 'lʷ', 'ʊ', 't͡ɕ', 'i', 'tʲ', 't', 'r', 'ɐ', 'n', 's', 'k', 'rʲ', 15 | 'i', 'p', 't͡sˠ', 'ɨ', 'jᶣ', 'ᵿ'], ru_transcript.get_allophones()) 16 | 17 | def test_readme_comma(self): 18 | testing_text = 'Мышка, кошка и собака' 19 | testing_a_text = 'Мы+шка, ко+шка и+ соба+ка' 20 | ru_transcript = RuTranscript(testing_text, testing_a_text) 21 | ru_transcript.transcribe() 22 | print(testing_text, ru_transcript.get_allophones()) 23 | self.assertEqual(['mˠ', 'ɨ', 'ʂ', 'k', 'ʌ', 'kʷ', 'o', 'ʂ', 'k', 'ʌ', 'i', 's', 'ɐ', 'b', 'a', 'k', 'ʌ'], 24 | ru_transcript.get_allophones()) 25 | 26 | def test_1(self): 27 | testing_text = 'И никогда, ни в единой самой убогой самой фантастической петербургской компании ' \ 28 | '— меня не объявляли гением.' 29 | testing_a_text = 'И никогд+а, н+и в ед+иной с+амой уб+огой с+амой фантаст+ической петерб+ургской комп+ании ' \ 30 | '— мен+я н+е объявл+яли г+ением.' 31 | ru_transcript = RuTranscript(testing_text, testing_a_text, stress_place='before') 32 | ru_transcript.transcribe() 33 | print(testing_text, ru_transcript.get_allophones()) 34 | self.assertEqual(['i', 'nʲ', 'ɪ', 'k', 'ɐ', 'ɡ', 'd', 'a', 'nʲ', 'ɪ', 'vʲ', 'j', 'ɪ', 'dʲ', 'i', 'n', 'ə', 'j', 35 | 's', 'a', 'm', 'ə', 'j', 'ᵿ', 'bʷ', 'o', 'ɡ', 'ə', 'j', 's', 'a', 'm', 'ə', 'j', 'f', 'ə', 36 | 'n', 't', 'ɐ', 'sʲ', 'tʲ', 'i', 't͡ɕ', 'ə', 's', 'k', 'ə', 'j', 'pʲ', 'ɪ.', 'tʲ', 'ɪ', 'r', 37 | 'bʷ', 'u', 'r', 'ʐ', 's', 'k', 'ə', 'j', 'k', 'ɐ', 'm', 'p', 'a', 'nʲ', 'ɪ', 'i', 'mʲ', 'ɪ', 38 | 'nʲ', 'æ', 'nʲ', 'ɪ.', 'ə', 'bʲ', 'j', 'ɪ', 'v', 'lʲ', 'æ', 'lʲ', 'ɪ', 'ɡʲ', 'e', 'nʲ', 39 | 'ɪ', 'j', 'ɪ.', 'm'], ru_transcript.get_allophones()) 40 | 41 | def test_2(self): 42 | testing_text = 'Но против Агнии Францевны, у меня было сильное оружие — вежливость.' 43 | testing_a_text = 'Н+о пр+отив +Агнии Фр+анцевны, у мен+я б+ыло с+ильное ор+ужие — в+ежливость.' 44 | ru_transcript = RuTranscript(testing_text, testing_a_text, stress_place='before') 45 | ru_transcript.transcribe() 46 | print(testing_text, ru_transcript.get_allophones()) 47 | self.assertEqual(['n', 'ɐ', 'p', 'rʷ', 'o', 'tʲ', 'ɪ', 'v', 'a', 'ɡ', 'nʲ', 'ɪ', 'i', 'f', 'r', 'a', 'n', 48 | 't͡s', 'ə', 'v', 'nˠ', 'ᵻ', 'ᵿ', 'mʲ', 'ɪ', 'nʲ', 'æ', 'bˠ', 'ɨ', 'l', 'ʌ', 'sʲ', 'i', 'lʲ', 49 | 'n', 'ə', 'j', 'æ.', 'ɐ', 'rʷ', 'u', 'ʐ', 'j', 'æ.', 'vʲ', 'e', 'ʐ', 'lʲ', 'ɪ', 'v', 'ə', 50 | 'sʲ', 'tʲ'], ru_transcript.get_allophones()) 51 | 52 | def test_3(self): 53 | testing_text = 'Что апперцепция у Бальзака неорганична.' 54 | testing_a_text = 'Чт+о апперц+епция у Бальз+ака неорган+ична.' 55 | ru_transcript = RuTranscript(testing_text, testing_a_text, stress_place='before') 56 | ru_transcript.transcribe() 57 | print(testing_text, ru_transcript.get_allophones()) 58 | self.assertEqual(['ʂ', 'tʷ', 'o', 'ə', 'p', 'pʲ', 'ɪ', 'r̥', 't͡sˠ', 'ᵻ', 'p', 't͡sˠ', 'ɨ', 'j', 'æ.', 'ᵿ', 'b', 59 | 'ɐ', 'lʲ', 'z', 'a', 'k', 'ʌ', 'nʲ', 'ɪ.', 'ə', 'r', 'ɡ', 'ɐ', 'nʲ', 'i', 't͡ɕ', 'n', 'ʌ'], 60 | ru_transcript.get_allophones()) 61 | 62 | def test_4(self): 63 | testing_text = 'Башкирия, Уфа, эвакуация, мне три недели.' 64 | testing_a_text = 'Башк+ирия, Уф+а, эваку+ация, мн+е тр+и нед+ели.' 65 | ru_transcript = RuTranscript(testing_text, testing_a_text, stress_place='before') 66 | ru_transcript.transcribe() 67 | print(testing_text, ru_transcript.get_allophones()) 68 | self.assertEqual(['b', 'ɐ', 'ʂ', 'kʲ', 'i', 'rʲ', 'j', 'æ.', 'ᵿ', 'f', 'a', 'ɪ.', 'v', 'ə', 'kʷ', 'ʊ', 'æ', 69 | 't͡sˠ', 'ɨ', 'j', 'æ.', 'mʲ', 'nʲ', 'e', 't', 'rʲ', 'i', 'nʲ', 'ɪ', 'dʲ', 'e', 'lʲ', 'ɪ'], 70 | ru_transcript.get_allophones()) 71 | 72 | def test_5(self): 73 | testing_text = 'Настоящие мужчины гибнут на передовой.' 74 | testing_a_text = 'Насто+ящие мужч+ины г+ибнут н+а передов+ой.' 75 | ru_transcript = RuTranscript(testing_text, testing_a_text, stress_place='before') 76 | ru_transcript.transcribe() 77 | print(testing_text, ru_transcript.get_allophones()) 78 | self.assertEqual(['n', 'ə', 's', 't', 'ɐ', 'j', 'æ', 'ɕː', 'ɪ', 'j', 'æ.', 'mʷ', 'ʊ', 'ɕː', 'i', 'nˠ', 'ᵻ', 79 | 'ɡʲ', 'i', 'b', 'nʷ', 'ʊ', 't', 'n', 'ə', 'pʲ', 'ɪ.', 'rʲ', 'ɪ.', 'd', 'ɐ', 'vʷ', 'o', 'j'], 80 | ru_transcript.get_allophones()) 81 | 82 | def test_6(self): 83 | testing_text = 'Неуклюжие эпиграммы.' 84 | testing_a_text = 'Неукл+южие эпигр+аммы.' 85 | ru_transcript = RuTranscript(testing_text, testing_a_text, stress_place='before') 86 | ru_transcript.transcribe() 87 | print(testing_text, ru_transcript.get_allophones()) 88 | self.assertEqual(['nʲ', 'ɪ.', 'ᵿ', 'k', 'lᶣ', 'ʉ', 'ʐ', 'j', 'æ.', 'ɪ.', 'pʲ', 'ɪ', 'ɡ', 'r', 'a', 'mːˠ', 'ᵻ'], 89 | ru_transcript.get_allophones()) 90 | 91 | def test_7(self): 92 | testing_text = 'Да и с Вольфом у меня хорошие отношения.' 93 | testing_a_text = 'Д+а и с В+ольфом у мен+я хор+ошие отнош+ения.' 94 | ru_transcript = RuTranscript(testing_text, testing_a_text, stress_place='before') 95 | ru_transcript.transcribe() 96 | self.assertEqual(['d', 'a', 'i', 's', 'vʷ', 'o', 'lʲ', 'f', 'ə', 'm', 'ᵿ', 'mʲ', 'ɪ', 'nʲ', 'æ', 'x', 'ɐ', 97 | 'rʷ', 'o', 'ʂ', 'j', 'æ.', 'ə', 't', 'n', 'ɐ', 'ʂˠ', 'ᵻ', 'nʲ', 'j', 'æ.'], 98 | ru_transcript.get_allophones()) 99 | 100 | def test_8(self): 101 | testing_text = 'Хотя наиболее чудовищные эпатирующие подробности лагерной жизни, я как говорится опустил.' 102 | testing_a_text = 'Хот+я наиб+олее чуд+овищные эпат+ирующие подр+обности л+агерной ж+изни, я к+ак говор+ится ' \ 103 | 'опуст+ил.' 104 | ru_transcript = RuTranscript(testing_text, testing_a_text, stress_place='before') 105 | ru_transcript.transcribe() 106 | print(testing_text, ru_transcript.get_allophones()) 107 | self.assertEqual(['x', 'ɐ', 'tʲ', 'æ', 'n', 'ə', 'i', 'bʷ', 'o', 'lʲ', 'ɪ.', 'j', 'æ.', 't͡ɕᶣ', 'ᵿ', 'dʷ', 'o', 108 | 'vʲ', 'ɪ', 'ɕː', 'nˠ', 'ᵻ', 'j', 'æ.', 'ɪ.', 'p', 'ɐ', 'tʲ', 'i', 'rʷ', 'ʊ', 'jᶣ', 'ᵿ', 'ɕː', 109 | 'ɪ', 'j', 'æ.', 'p', 'ɐ', 'd', 'rʷ', 'o', 'b', 'n', 'ə', 'sʲ', 'tʲ', 'ɪ', 'l', 'a', 'ɡʲ', 110 | 'ɪ.', 'r', 'n', 'ɐ', 'j', 'ʐˠ', 'ɨ', 'zʲ', 'nʲ', 'ɪ', 'ʝ', 'æ', 'k', 'a', 'k', 'ɡ', 'ə', 'v', 111 | 'ɐ', 'rʲ', 'i', 't͡s', 'ə', 'ə', 'pʷ', 'ʊ', 112 | 'sʲ', 'tʲ', 'i', 'l'], ru_transcript.get_allophones()) 113 | 114 | def test_yo(self): 115 | testing_text = 'елка для ее ежика перышка подвел конек мед.' 116 | ru_transcript = RuTranscript(testing_text) 117 | ru_transcript.transcribe() 118 | print(testing_text, ru_transcript.get_allophones()) 119 | self.assertEqual(['ʝᶣ', 'ɵ', 'l', 'k', 'ʌ', 'd', 'lʲ', 'æ', 'j', 'ɪ', 'jᶣ', 'ɵ', 'jᶣ', 'ɵ', 'ʐˠ', 'ɨ', 'k', 'ʌ', 120 | 'pᶣ', 'ɵ', 'rˠ', 'ᵻ', 'ʂ', 'k', 'ʌ', 'p', 'ɐ', 'd', 'vᶣ', 'ɵ', 'l', 'k', 'ɐ', 'nᶣ', 'ɵ', 'kʲ', 121 | 'mᶣ', 'ɵ', 't'], ru_transcript.get_allophones()) 122 | 123 | def test_dashes(self): 124 | testing_text = 'Синтез речи - это что-то увлекательное!' 125 | ru_transcript = RuTranscript(testing_text) 126 | ru_transcript.transcribe() 127 | print(testing_text, ru_transcript.get_allophones()) 128 | self.assertEqual(['sʲ', 'i', 'n', 'tˠ', 'ᵻ', 'z', 'rʲ', 'e', 't͡ɕ', 'ɪ', 'ɛ', 't', 'ʌ', 'ʂ', 'tʷ', 'o', 't', 129 | 'ʌ', 'ᵿ', 'v', 'lʲ', 'ɪ', 'k', 'a', 'tʲ', 'ɪ.', 'lʲ', 'n', 'ə', 'j', 'æ.'], 130 | ru_transcript.get_allophones()) 131 | 132 | def test_spaces(self): 133 | testing_text = 'Синтез речи - это что-то увлекательное!\n' 134 | ru_transcript = RuTranscript(testing_text) 135 | ru_transcript.transcribe() 136 | print(testing_text, ru_transcript.get_allophones()) 137 | self.assertEqual(['sʲ', 'i', 'n', 'tˠ', 'ᵻ', 'z', 'rʲ', 'e', 't͡ɕ', 'ɪ', 'ɛ', 't', 'ʌ', 'ʂ', 'tʷ', 'o', 't', 138 | 'ʌ', 'ᵿ', 'v', 'lʲ', 'ɪ', 'k', 'a', 'tʲ', 'ɪ.', 'lʲ', 'n', 'ə', 'j', 'æ.'], 139 | ru_transcript.get_allophones()) 140 | 141 | def test_skipping_proclitic(self): 142 | testing_text = 'Они расцветают и становятся заметными лишь на фоне какого-нибудь безобразия.' 143 | testing_a_text = 'Он+и расцвет+ают и стан+овятся зам+етными л+ишь н+а ф+оне какого-нибудь безобр+азия.' 144 | ru_transcript = RuTranscript(testing_text, testing_a_text, stress_place='before') 145 | ru_transcript.transcribe() 146 | print(testing_text, ru_transcript.get_allophones()) 147 | self.assertEqual(['ɐ', 'nʲ', 'i', 'r', 'ə', 's', 'd̻͡z̪', 'vʲ', 'ɪ', 't', 'a', 'jᶣ', 'ᵿ', 't', 'ᵻ', 's', 't', 148 | 'ɐ', 'nʷ', 'o', 'vʲ', 'ɪ.', 't͡s', 'ə', 'z', 'ɐ', 'mʲ', 'e', 't', 'nˠ', 'ᵻ', 'mʲ', 'ɪ', 149 | 'lʲ', 'ɪ', 'ʂ', 'n', 'ɐ', 'fʷ', 'o', 'nʲ', 'æ.', 'k', 'ɐ', 'kʷ', 'o', 'v', 'ə', 'nʲ', 'ɪ', 150 | 'bʷ', 'ʊ', 'dʲ', 'bʲ', 'ɪ.', 'z', 'ɐ', 'b', 'r', 'a', 'zʲ', 'j', 'æ.'], 151 | ru_transcript.get_allophones()) 152 | 153 | def test_skipping_enclitic(self): 154 | testing_text = 'Да это же писатель!' 155 | ru_transcript = RuTranscript(testing_text) 156 | ru_transcript.transcribe() 157 | print(testing_text, ru_transcript.get_allophones()) 158 | self.assertEqual(['d', 'ɐ', 'e', 't', 'ə', 'ʐ', 'ə', 'pʲ', 'ɪ', 's', 'a', 'tʲ', 'ɪ.', 'lʲ'], 159 | ru_transcript.get_allophones()) 160 | 161 | 162 | if __name__ == '__main__': 163 | unittest.main() 164 | -------------------------------------------------------------------------------- /src/tests/test_vowels.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | 3 | from .. import RuTranscript 4 | 5 | 6 | class TestVowels(unittest.TestCase): 7 | 8 | def test_vowel_a_1(self): # ударный после тв. согл. 9 | testing_text = 'трава' 10 | testing_a_text = 'трава+' 11 | ru_transcript = RuTranscript(testing_text, testing_a_text) 12 | ru_transcript.transcribe() 13 | print(testing_text, ru_transcript.get_allophones()) 14 | self.assertEqual(['t', 'r', 'ɐ', 'v', 'a'], ru_transcript.get_allophones()) 15 | 16 | def test_vowel_a_2(self): # ударный после тв. согл. перед 'л' 17 | testing_text = 'палка' 18 | testing_a_text = 'па+лка' 19 | ru_transcript = RuTranscript(testing_text, testing_a_text) 20 | ru_transcript.transcribe() 21 | print(testing_text, ru_transcript.get_allophones()) 22 | self.assertEqual(['p', 'ɑ', 'l', 'k', 'ʌ'], ru_transcript.get_allophones()) 23 | 24 | def test_vowel_a_3(self): # ударный не после тв. согл. 25 | testing_text = 'пять' 26 | testing_a_text = 'пя+ть' 27 | ru_transcript = RuTranscript(testing_text, testing_a_text) 28 | ru_transcript.transcribe() 29 | print(testing_text, ru_transcript.get_allophones()) 30 | self.assertEqual(['pʲ', 'æ', 'tʲ'], ru_transcript.get_allophones()) 31 | 32 | def test_vowel_a_4(self): # ударный после шипящих и ц 33 | testing_text = 'цапнуть' 34 | testing_a_text = 'ца+пнуть' 35 | ru_transcript = RuTranscript(testing_text, testing_a_text) 36 | ru_transcript.transcribe() 37 | print(testing_text, ru_transcript.get_allophones()) 38 | self.assertEqual(['t͡s', 'ɐ.', 'p', 'nʷ', 'ʊ', 'tʲ'], ru_transcript.get_allophones()) 39 | 40 | def test_vowel_a_5(self): # предударный после тв.согл. или в начале слова 41 | testing_text = 'паром' 42 | testing_a_text = 'паро+м' 43 | ru_transcript = RuTranscript(testing_text, testing_a_text) 44 | ru_transcript.transcribe() 45 | print(testing_text, ru_transcript.get_allophones()) 46 | self.assertEqual(['p', 'ɐ', 'rʷ', 'o', 'm'], ru_transcript.get_allophones()) 47 | 48 | def test_vowel_a_6(self): # предударный не после тв.согл. 49 | testing_text = 'тяжёлый' 50 | testing_a_text = 'тяжё+лый' 51 | ru_transcript = RuTranscript(testing_text, testing_a_text) 52 | ru_transcript.transcribe() 53 | print(testing_text, ru_transcript.get_allophones()) 54 | self.assertEqual(['tʲ', 'ɪ', 'ʐ', 'ɐ.', 'lˠ', 'ᵻ', 'j'], ru_transcript.get_allophones()) 55 | 56 | def test_vowel_a_7(self): # предударный после шипящих и 'ц' 57 | testing_text = 'жалеть' 58 | testing_a_text = 'жале+ть' 59 | ru_transcript = RuTranscript(testing_text, testing_a_text) 60 | ru_transcript.transcribe() 61 | print(testing_text, ru_transcript.get_allophones()) 62 | self.assertEqual(['ʐˠ', 'ᵻ', 'lʲ', 'e', 'tʲ'], ru_transcript.get_allophones()) 63 | 64 | def test_vowel_a_8(self): # II предударный или заударный после тв.согл. или в начале слова 65 | testing_text = 'акварель' 66 | testing_a_text = 'акваре+ль' 67 | ru_transcript = RuTranscript(testing_text, testing_a_text) 68 | ru_transcript.transcribe() 69 | print(testing_text, ru_transcript.get_allophones()) 70 | self.assertEqual(['ə', 'k', 'v', 'ɐ', 'rʲ', 'e', 'lʲ'], ru_transcript.get_allophones()) 71 | 72 | def test_vowel_a_9(self): # II предударный или заударный после тв.согл. в финальном слоге 73 | testing_text = 'собака' 74 | testing_a_text = 'соба+ка' 75 | ru_transcript = RuTranscript(testing_text, testing_a_text) 76 | ru_transcript.transcribe() 77 | print(testing_text, ru_transcript.get_allophones()) 78 | self.assertEqual(['s', 'ɐ', 'b', 'a', 'k', 'ʌ'], ru_transcript.get_allophones()) 79 | 80 | def test_vowel_a_10(self): # II предударный или заударный не после тв.согл. не в окончании 81 | testing_text = 'тяжеленный' 82 | testing_a_text = 'тяжеле+нный' 83 | ru_transcript = RuTranscript(testing_text, testing_a_text) 84 | ru_transcript.transcribe() 85 | print(testing_text, ru_transcript.get_allophones()) 86 | self.assertEqual(['tʲ', 'ɪ.', 'ʐ', 'ə', 'lʲ', 'e', 'nːˠ', 'ᵻ', 'j'], ru_transcript.get_allophones()) 87 | 88 | def test_vowel_a_11(self): # II предударный или заударный не после тв.согл. (только в окончании) 89 | testing_text = 'гуляя' 90 | testing_a_text = 'гуля+я' 91 | ru_transcript = RuTranscript(testing_text, testing_a_text) 92 | ru_transcript.transcribe() 93 | print(testing_text, ru_transcript.get_allophones()) 94 | self.assertEqual(['ɡʷ', 'ʊ', 'lʲ', 'æ', 'j', 'æ.'], ru_transcript.get_allophones()) 95 | 96 | def test_vowel_a_12(self): # II предударный или заударный после шипящих и 'ц' 97 | testing_text = 'дача' 98 | testing_a_text = 'да+ча' 99 | ru_transcript = RuTranscript(testing_text, testing_a_text) 100 | ru_transcript.transcribe() 101 | print(testing_text, ru_transcript.get_allophones()) 102 | self.assertEqual(['d', 'a', 't͡ɕ', 'ə'], ru_transcript.get_allophones()) 103 | 104 | def test_vowel_o_1(self): # ударный после тв.согл. или в начале слова 105 | testing_text = 'облако' 106 | testing_a_text = 'о+блако' 107 | ru_transcript = RuTranscript(testing_text, testing_a_text) 108 | ru_transcript.transcribe() 109 | print(testing_text, ru_transcript.get_allophones()) 110 | self.assertEqual(['o', 'b', 'l', 'ə', 'k', 'ʌ'], ru_transcript.get_allophones()) 111 | 112 | def test_vowel_o_2(self): # ударный не после тв.согл. 113 | testing_text = 'тётя' 114 | testing_a_text = 'тё+тя' 115 | ru_transcript = RuTranscript(testing_text, testing_a_text) 116 | ru_transcript.transcribe() 117 | print(testing_text, ru_transcript.get_allophones()) 118 | self.assertEqual(['tᶣ', 'ɵ', 'tʲ', 'æ.'], ru_transcript.get_allophones()) 119 | 120 | def test_vowel_o_3(self): # ударный после шипящих и 'ц' 121 | testing_text = 'цокать' 122 | testing_a_text = 'цо+кать' 123 | ru_transcript = RuTranscript(testing_text, testing_a_text) 124 | ru_transcript.transcribe() 125 | print(testing_text, ru_transcript.get_allophones()) 126 | self.assertEqual(['t͡s', 'ɐ.', 'k', 'ə', 'tʲ'], ru_transcript.get_allophones()) 127 | 128 | def test_vowel_o_4(self): # предударный после тв.согл. или в начале слова 129 | testing_text = 'стопа' 130 | testing_a_text = 'стопа+' 131 | ru_transcript = RuTranscript(testing_text, testing_a_text) 132 | ru_transcript.transcribe() 133 | print(testing_text, ru_transcript.get_allophones()) 134 | self.assertEqual(['s', 't', 'ɐ', 'p', 'a'], ru_transcript.get_allophones()) 135 | 136 | def test_vowel_o_5(self): # предударный не после тв.согл. 137 | testing_text = 'йодированный' 138 | testing_a_text = 'йоди+рованный' 139 | ru_transcript = RuTranscript(testing_text, testing_a_text) 140 | ru_transcript.transcribe() 141 | print(testing_text, ru_transcript.get_allophones()) 142 | self.assertEqual(['ʝ', 'ɪ', 'dʲ', 'i', 'r', 'ə', 'v', 'ə', 'nːˠ', 'ᵻ', 'j'], ru_transcript.get_allophones()) 143 | 144 | def test_vowel_o_6(self): # предударный после шипящих и ц 145 | testing_text = 'шокировать' 146 | testing_a_text = 'шоки+ровать' 147 | ru_transcript = RuTranscript(testing_text, testing_a_text) 148 | ru_transcript.transcribe() 149 | print(testing_text, ru_transcript.get_allophones()) 150 | self.assertEqual(['ʂˠ', 'ᵻ', 'kʲ', 'i', 'r', 'ə', 'v', 'ə', 'tʲ'], ru_transcript.get_allophones()) 151 | 152 | def test_vowel_o_7(self): # II предударный или заударный после тв.согл. или в начале слога 153 | testing_text = 'молоко' 154 | testing_a_text = 'молоко+' 155 | ru_transcript = RuTranscript(testing_text, testing_a_text) 156 | ru_transcript.transcribe() 157 | print(testing_text, ru_transcript.get_allophones()) 158 | self.assertEqual(['m', 'ə', 'l', 'ɐ', 'kʷ', 'o'], ru_transcript.get_allophones()) 159 | 160 | def test_vowel_o_8(self): # II предударный или заударный после тв.согл. в финальном слоге 161 | testing_text = 'озеро' 162 | testing_a_text = 'о+зеро' 163 | ru_transcript = RuTranscript(testing_text, testing_a_text) 164 | ru_transcript.transcribe() 165 | print(testing_text, ru_transcript.get_allophones()) 166 | self.assertEqual(['o', 'zʲ', 'ɪ.', 'r', 'ʌ'], ru_transcript.get_allophones()) 167 | 168 | def test_vowel_o_9(self): # II предударный или заударный не после тв.согл. 169 | testing_text = 'огайо' 170 | testing_a_text = 'ога+йо' 171 | ru_transcript = RuTranscript(testing_text, testing_a_text) 172 | ru_transcript.transcribe() 173 | print(testing_text, ru_transcript.get_allophones()) 174 | self.assertEqual(['ɐ', 'ɡ', 'a', 'j', 'æ.'], ru_transcript.get_allophones()) 175 | 176 | def test_vowel_o_10(self): # II предударный или заударный после шипящих и 'ц' 177 | testing_text = 'шоколад' 178 | testing_a_text = 'шокола+д' 179 | ru_transcript = RuTranscript(testing_text, testing_a_text) 180 | ru_transcript.transcribe() 181 | print(testing_text, ru_transcript.get_allophones()) 182 | self.assertEqual(['ʂ', 'ə', 'k', 'ɐ', 'l', 'a', 't'], ru_transcript.get_allophones()) 183 | 184 | def test_vowel_e_1(self): # ударный после тв.согл. или в начале слова 185 | testing_text = 'это' 186 | testing_a_text = 'э+то' 187 | ru_transcript = RuTranscript(testing_text, testing_a_text) 188 | ru_transcript.transcribe() 189 | print(testing_text, ru_transcript.get_allophones()) 190 | self.assertEqual(['ɛ', 't', 'ʌ'], ru_transcript.get_allophones()) 191 | 192 | def test_vowel_e_2(self): # ударный не после тв.согл. 193 | testing_text = 'пень' 194 | testing_a_text = 'пе+нь' 195 | ru_transcript = RuTranscript(testing_text, testing_a_text) 196 | ru_transcript.transcribe() 197 | print(testing_text, ru_transcript.get_allophones()) 198 | self.assertEqual(['pʲ', 'e', 'nʲ'], ru_transcript.get_allophones()) 199 | 200 | def test_vowel_e_3(self): # ударный после шипящих и 'ц' 201 | testing_text = 'шест' 202 | testing_a_text = 'ше+ст' 203 | ru_transcript = RuTranscript(testing_text, testing_a_text) 204 | ru_transcript.transcribe() 205 | print(testing_text, ru_transcript.get_allophones()) 206 | self.assertEqual(['ʂˠ', 'ᵻ', 's', 't'], ru_transcript.get_allophones()) 207 | 208 | def test_vowel_e_4(self): # предударный после тв.согл. или в начале слова (ыэ) 209 | testing_text = 'этап' 210 | testing_a_text = 'эта+п' 211 | ru_transcript = RuTranscript(testing_text, testing_a_text) 212 | ru_transcript.transcribe() 213 | print(testing_text, ru_transcript.get_allophones()) 214 | self.assertEqual(['ᵻ', 't', 'a', 'p'], ru_transcript.get_allophones()) 215 | 216 | def test_vowel_e_5(self): # предударный не после тв.согл. и не в начале слова 217 | testing_text = 'велюр' 218 | testing_a_text = 'велю+р' 219 | ru_transcript = RuTranscript(testing_text, testing_a_text) 220 | ru_transcript.transcribe() 221 | print(testing_text, ru_transcript.get_allophones()) 222 | self.assertEqual(['vʲ', 'ɪ', 'lᶣ', 'ʉ', 'r̥'], ru_transcript.get_allophones()) 223 | 224 | def test_vowel_e_6(self): # II предударный или заударный не после тв.согл. или в начале слова 225 | testing_text = 'пепел' 226 | testing_a_text = 'пе+пел' 227 | ru_transcript = RuTranscript(testing_text, testing_a_text) 228 | ru_transcript.transcribe() 229 | print(testing_text, ru_transcript.get_allophones()) 230 | self.assertEqual(['pʲ', 'e', 'pʲ', 'ɪ.', 'l'], ru_transcript.get_allophones()) 231 | 232 | def test_vowel_e_7(self): # предударный, II предударный или заударный после шипящих и 'ц' 233 | testing_text = 'шелестеть' 234 | testing_a_text = 'шелесте+ть' 235 | ru_transcript = RuTranscript(testing_text, testing_a_text) 236 | ru_transcript.transcribe() 237 | print(testing_text, ru_transcript.get_allophones()) 238 | self.assertEqual(['ʂ', 'ə', 'lʲ', 'ɪ', 'sʲ', 'tʲ', 'e', 'tʲ'], ru_transcript.get_allophones()) 239 | 240 | def test_vowel_u_1(self): # ударный после тв.согл. 241 | testing_text = 'пуля' 242 | testing_a_text = 'пу+ля' 243 | ru_transcript = RuTranscript(testing_text, testing_a_text) 244 | ru_transcript.transcribe() 245 | print(testing_text, ru_transcript.get_allophones()) 246 | self.assertEqual(['pʷ', 'u', 'lʲ', 'æ.'], ru_transcript.get_allophones()) 247 | 248 | def test_vowel_u_2(self): # ударный после мягк.согл. 249 | testing_text = 'чуть' 250 | testing_a_text = 'чу+ть' 251 | ru_transcript = RuTranscript(testing_text, testing_a_text) 252 | ru_transcript.transcribe() 253 | print(testing_text, ru_transcript.get_allophones()) 254 | self.assertEqual(['t͡ɕᶣ', 'ʉ', 'tʲ'], ru_transcript.get_allophones()) 255 | 256 | def test_vowel_u_3(self): # предударный после тв.согл. 257 | testing_text = 'мужчина' 258 | testing_a_text = 'мужчи+на' 259 | ru_transcript = RuTranscript(testing_text, testing_a_text) 260 | ru_transcript.transcribe() 261 | print(testing_text, ru_transcript.get_allophones()) 262 | self.assertEqual(['mʷ', 'ʊ', 'ɕː', 'i', 'n', 'ʌ'], ru_transcript.get_allophones()) 263 | 264 | def test_vowel_u_4(self): # предударный не после тв.согл. 265 | testing_text = 'ютиться' 266 | testing_a_text = 'юти+ться' 267 | ru_transcript = RuTranscript(testing_text, testing_a_text) 268 | ru_transcript.transcribe() 269 | print(testing_text, ru_transcript.get_allophones()) 270 | self.assertEqual(['ʝᶣ', 'ᵿ', 'tʲ', 'i', 't͡s', 'ə'], ru_transcript.get_allophones()) 271 | 272 | def test_vowel_u_5(self): # II предударный или заударный после тв.согл. 273 | testing_text = 'музыкальный' 274 | testing_a_text = 'музыка+льный' 275 | ru_transcript = RuTranscript(testing_text, testing_a_text) 276 | ru_transcript.transcribe() 277 | print(testing_text, ru_transcript.get_allophones()) 278 | self.assertEqual(['mʷ', 'ʊ', 'zˠ', 'ᵻ', 'k', 'a', 'lʲ', 'nˠ', 'ᵻ', 'j'], ru_transcript.get_allophones()) 279 | 280 | def test_vowel_u_6(self): # II предударный или заударный не после тв.согл. 281 | testing_text = 'чумовой' 282 | testing_a_text = 'чумово+й' 283 | ru_transcript = RuTranscript(testing_text, testing_a_text) 284 | ru_transcript.transcribe() 285 | print(testing_text, ru_transcript.get_allophones()) 286 | self.assertEqual(['t͡ɕᶣ', 'ᵿ', 'm', 'ɐ', 'vʷ', 'o', 'j'], ru_transcript.get_allophones()) 287 | 288 | def test_vowel_i_1(self): # ударный перед мягк.согл. 289 | testing_text = 'синего' 290 | testing_a_text = 'си+него' 291 | ru_transcript = RuTranscript(testing_text, testing_a_text) 292 | ru_transcript.transcribe() 293 | print(testing_text, ru_transcript.get_allophones()) 294 | self.assertEqual(['sʲ', 'i', 'nʲ', 'ɪ.', 'v', 'ʌ'], ru_transcript.get_allophones()) 295 | 296 | def test_vowel_i_2(self): # ударный, предударный, II предударный или заударный после шипящих и 'ц' 297 | testing_text = 'жизнь' 298 | testing_a_text = 'жи+знь' 299 | ru_transcript = RuTranscript(testing_text, testing_a_text) 300 | ru_transcript.transcribe() 301 | print(testing_text, ru_transcript.get_allophones()) 302 | self.assertEqual(['ʐˠ', 'ɨ', 'zʲ', 'nʲ'], ru_transcript.get_allophones()) 303 | 304 | def test_vowel_i_3(self): # предударный, II предударный или заударный не после гласного или в начале слова 305 | testing_text = 'синица' 306 | testing_a_text = 'сини+ца' 307 | ru_transcript = RuTranscript(testing_text, testing_a_text) 308 | ru_transcript.transcribe() 309 | print(testing_text, ru_transcript.get_allophones()) 310 | self.assertEqual(['sʲ', 'ɪ', 'nʲ', 'i', 't͡s', 'ə'], ru_transcript.get_allophones()) 311 | 312 | def test_vowel_ii_1(self): # ударный после тв.согл. 313 | testing_text = 'ты' 314 | testing_a_text = 'ты+' 315 | ru_transcript = RuTranscript(testing_text, testing_a_text) 316 | ru_transcript.transcribe() 317 | print(testing_text, ru_transcript.get_allophones()) 318 | self.assertEqual(['tˠ', 'ɨ'], ru_transcript.get_allophones()) 319 | 320 | def test_vowel_ii_2(self): # ударный между переднеязычными и велярными согласными 321 | testing_text = 'тыкать' 322 | testing_a_text = 'ты+кать' 323 | ru_transcript = RuTranscript(testing_text, testing_a_text) 324 | ru_transcript.transcribe() 325 | print(testing_text, ru_transcript.get_allophones()) 326 | self.assertEqual(['tˠ', 'ɨ̟', 'k', 'ə', 'tʲ'], ru_transcript.get_allophones()) 327 | 328 | def test_vowel_ii_3(self): # ударный после сочетания губной согласный + 'л' 329 | testing_text = 'плыть' 330 | testing_a_text = 'плы+ть' 331 | ru_transcript = RuTranscript(testing_text, testing_a_text) 332 | ru_transcript.transcribe() 333 | print(testing_text, ru_transcript.get_allophones()) 334 | self.assertEqual(['p', 'lˠ', 'ɯ̟ɨ̟', 'tʲ'], ru_transcript.get_allophones()) 335 | 336 | def test_vowel_ii_4(self): # предударный, II предударный или заударный не после 'ц' 337 | testing_text = 'чтобы' 338 | testing_a_text = 'что+бы' 339 | ru_transcript = RuTranscript(testing_text, testing_a_text) 340 | ru_transcript.transcribe() 341 | print(testing_text, ru_transcript.get_allophones()) 342 | self.assertEqual(['ʂ', 'tʷ', 'o', 'bˠ', 'ᵻ'], ru_transcript.get_allophones()) 343 | 344 | def test_vowel_ii_5(self): # предударный, II предударный или заударный после 'ц' 345 | testing_text = 'танцы' 346 | testing_a_text = 'та+нцы' 347 | ru_transcript = RuTranscript(testing_text, testing_a_text) 348 | ru_transcript.transcribe() 349 | print(testing_text, ru_transcript.get_allophones()) 350 | self.assertEqual(['t', 'a', 'n', 't͡s', 'ə'], ru_transcript.get_allophones()) 351 | 352 | def test_jotised_1(self): 353 | testing_text = 'бульон' 354 | testing_a_text = 'бульо+н' 355 | ru_transcript = RuTranscript(testing_text, testing_a_text) 356 | ru_transcript.transcribe() 357 | print(testing_text, ru_transcript.get_allophones()) 358 | self.assertEqual(['bʷ', 'ʊ', 'lʲ', 'jᶣ', 'ɵ', 'n'], ru_transcript.get_allophones()) 359 | 360 | 361 | if __name__ == '__main__': 362 | unittest.main() 363 | -------------------------------------------------------------------------------- /src/tools/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/suralmasha/RuTranscript/30cbc40c5ac368021bcc8a05002fa33cc50ee9b6/src/tools/__init__.py -------------------------------------------------------------------------------- /src/tools/allophones_tools.py: -------------------------------------------------------------------------------- 1 | import spacy 2 | 3 | from .sounds import allophones, rus_v 4 | 5 | nlp = spacy.load('ru_core_news_sm', disable=["tagger", "morphologizer", "attribute_ruler"]) 6 | 7 | 8 | def get_allophone_info(allophone): 9 | return allophones[allophone] 10 | 11 | 12 | def shch(section: list): 13 | section_copy = section.copy() 14 | for i, current_phon in enumerate(section_copy[:-1]): 15 | try: 16 | next_phon = section_copy[i + 1] 17 | except IndexError: 18 | next_phon = '' 19 | try: 20 | two_current = (section_copy[i], section_copy[i + 1]) 21 | except IndexError: 22 | two_current = '' 23 | 24 | next_allophone = allophones[next_phon] 25 | if ((current_phon == 'ʐ') and (next_allophone.get('voice', '') == 'voiceless') and (next_phon != 's')) \ 26 | or (two_current in {('s', 't͡ɕ'), ('z', 't͡ɕ'), ('ʐ', 't͡ɕ')}): 27 | section_copy[i] = 'ɕː' 28 | del section_copy[i + 1] 29 | 30 | return section_copy 31 | 32 | 33 | def long_ge(section: list): 34 | section_copy = section.copy() 35 | for i, current_phon in enumerate(section_copy[:-1]): 36 | try: 37 | next_phon = section_copy[i + 1] 38 | except IndexError: 39 | next_phon = '' 40 | try: 41 | two_current = (section_copy[i], section_copy[i + 1]) 42 | except IndexError: 43 | two_current = '' 44 | 45 | next_allophone = allophones[next_phon] 46 | if two_current in [('ʐ', 'ʐ'), ('z', 'ʐ')]: 47 | section_copy[i] = 'ʑː' 48 | del section_copy[i + 1] 49 | elif (current_phon == 'ɕː') and (next_allophone.get('voice', '') == 'voiced') \ 50 | and ('nasal' not in next_allophone.get('manner', '')): 51 | section_copy[i] = 'ʑː' 52 | 53 | return section_copy 54 | 55 | 56 | def nasal_m_n(section: list): 57 | section_copy = section.copy() 58 | for i, current_phon in enumerate(section_copy[:-1]): 59 | try: 60 | if allophones[section_copy[i + 1]].get('place', '') != 'labial, labiodental': 61 | continue 62 | except IndexError: 63 | break 64 | 65 | if current_phon in ['m', 'n']: 66 | section_copy[i] = 'ɱ' 67 | elif current_phon in ['mʲ', 'nʲ']: 68 | section_copy[i] = 'ɱʲ' 69 | 70 | return section_copy 71 | 72 | 73 | def silent_r(section: list): 74 | section_copy = section.copy() 75 | for i, current_phon in enumerate(section_copy): 76 | try: 77 | if (i < len(section_copy) - 1) and (allophones[section_copy[i + 1]].get('voice', '') != 'voiceless'): 78 | continue 79 | except IndexError: 80 | break 81 | 82 | if current_phon == 'r': 83 | section_copy[i] = 'r̥' 84 | elif current_phon == 'rʲ': 85 | section_copy[i] = 'r̥ʲ' 86 | 87 | return section_copy 88 | 89 | 90 | def voiced_ts(section: list): 91 | section_copy = section.copy() 92 | for i, current_phon in enumerate(section_copy): 93 | try: 94 | if allophones[section_copy[i + 1]].get('voice', '') != 'voiced': 95 | continue 96 | except IndexError: 97 | break 98 | 99 | if current_phon == 't͡s': 100 | section_copy[i] = 'd̻͡z̪' 101 | 102 | return section_copy 103 | 104 | 105 | def first_jot(phonemes_list_section): 106 | phonemes_list_section_copy = phonemes_list_section.copy() 107 | if phonemes_list_section_copy[0] == 'j': 108 | phonemes_list_section_copy[0] = 'ʝ' 109 | 110 | return phonemes_list_section_copy 111 | 112 | 113 | def fix_jotised(phonemes_list_section, letters_list_section): 114 | phonemes_list_section_copy = phonemes_list_section.copy() 115 | # ---- jotised vowels and i ---- 116 | phonemes_list_to_iterate = phonemes_list_section_copy[:] 117 | letters_list_to_iterate = letters_list_section[:] 118 | 119 | for i, let in enumerate(letters_list_to_iterate): 120 | try: 121 | next_let = letters_list_to_iterate[i + 1] 122 | except IndexError: 123 | next_let = '' 124 | if (let == 'д') and (next_let == 'ж'): 125 | del letters_list_to_iterate[i + 1] 126 | letters_list_to_iterate[i] = 'дж' 127 | elif next_let in ['ь', 'ъ']: 128 | del letters_list_to_iterate[i + 1] 129 | letters_list_to_iterate[i] = letters_list_to_iterate[i] + next_let 130 | 131 | n = 0 132 | for i, current_phon in enumerate(phonemes_list_to_iterate): 133 | sub_symb = False 134 | current_allophone = allophones[current_phon] 135 | if current_allophone['phon'] == 'symb': 136 | continue 137 | if current_phon == 'j' and letters_list_to_iterate[i] != 'й': 138 | letters_list_to_iterate.insert(i, 'й') 139 | current_let = letters_list_to_iterate[i] 140 | try: 141 | if allophones[phonemes_list_to_iterate[i - 1]]['phon'] != 'symb': 142 | previous_let = letters_list_to_iterate[i - 1] 143 | previous_phon = phonemes_list_to_iterate[i - 1] 144 | else: 145 | previous_let = letters_list_to_iterate[i - 2] 146 | previous_phon = phonemes_list_to_iterate[i - 2] 147 | sub_symb = True 148 | except IndexError: 149 | previous_let = '' 150 | previous_phon = '' 151 | try: 152 | next_let = letters_list_to_iterate[i + 1] 153 | except IndexError: 154 | next_let = '' 155 | try: 156 | after_next_let = letters_list_to_iterate[i + 2] 157 | except IndexError: 158 | after_next_let = '' 159 | 160 | previous_allophone = allophones[previous_phon] 161 | if (current_let == 'о') and (previous_let[-1] == 'ь') and (next_let == '+'): 162 | phonemes_list_section_copy.insert(i + n, 'j') 163 | n += 1 164 | 165 | elif current_let in 'ё е я ю'.split(): 166 | if previous_let[-1] in ['ь', 'ъ']: 167 | if (previous_allophone['phon'] == 'C')\ 168 | and ('ʲ' not in previous_phon)\ 169 | and (previous_allophone['palatalization'][0] != 'a'): 170 | phonemes_list_section_copy[i + n - 1 - sub_symb] = previous_phon + 'ʲ' 171 | phonemes_list_section_copy.insert(i + n, 'j') 172 | n += 1 173 | 174 | elif previous_let in rus_v: 175 | phonemes_list_section_copy.insert(i + n, 'j') 176 | n += 1 177 | 178 | elif (current_let != 'э') \ 179 | and (previous_allophone['phon'] == 'C') \ 180 | and ('ʲ' not in previous_phon) \ 181 | and ('a' not in previous_allophone['palatalization'][0]): 182 | phonemes_list_section_copy[i + n - 1 - sub_symb] = previous_phon + 'ʲ' 183 | 184 | elif (after_next_let == '+') and (previous_phon != 'j'): 185 | phonemes_list_section_copy.insert(i + n, 'j') 186 | n += 1 187 | 188 | elif current_let == 'и': 189 | if sub_symb and (phonemes_list_to_iterate[i - 1] == '_') and (previous_allophone['phon'] == 'C'): 190 | phonemes_list_section_copy[i + n] = 'ɨ' 191 | 192 | elif previous_let[-1] in {'ь', 'ъ'}: 193 | phonemes_list_section_copy.insert(i + n, 'j') 194 | n += 1 195 | 196 | elif (previous_allophone['phon'] == 'C') \ 197 | and ('ʲ' not in previous_phon) \ 198 | and (previous_allophone['palatalization'][0] != 'a'): 199 | phonemes_list_section_copy[i + n - 1 - sub_symb] = previous_phon + 'ʲ' 200 | 201 | return phonemes_list_section_copy 202 | 203 | 204 | def assimilative_palatalization(tokens_section, phonemes_list_section): 205 | phonemes_list_section_copy = phonemes_list_section.copy() 206 | exceptions = 'сосиска злить после ёлка день транскрипция джаз неуклюжий шахтёр'.split() 207 | 208 | token_index = 0 209 | token = tokens_section[token_index] 210 | nlp_token = nlp(token)[0] 211 | lemma = nlp_token.lemma_ 212 | 213 | for i, current_phon in enumerate(phonemes_list_section_copy): 214 | if current_phon == '_': 215 | token_index += 1 216 | token = tokens_section[token_index] 217 | nlp_token = nlp(token)[0] 218 | lemma = nlp_token.lemma_ 219 | 220 | current_allophone = allophones[current_phon] 221 | if (lemma not in exceptions) and ('i+zm' not in token): 222 | try: 223 | n = 1 224 | next_phon = phonemes_list_section_copy[i + n] 225 | next_allophone = allophones[next_phon] 226 | while next_allophone['phon'] == 'symb': 227 | n += 1 228 | next_phon = phonemes_list_section_copy[i + n] 229 | next_allophone = allophones[next_phon] 230 | except IndexError: 231 | next_phon = '' 232 | next_allophone = allophones[next_phon] 233 | 234 | # не смягчение перед [л] (для, глина, длинный, блин, злиться, влить, тлеть) 235 | if 'l' in next_phon: 236 | continue 237 | 238 | # доминирует не смягчение зубных перед мягкими губно-зубными ([д’в’]е́рь - [дв’]е́рь) 239 | elif (current_allophone.get('place', '') == 'lingual, dental') \ 240 | and (next_allophone.get('place', '') == 'labial, labiodental'): 241 | continue 242 | 243 | # доминирует не смягчение губных перед мягкими губными (лю[б’в’]и́ - лю[бв’]и́) 244 | elif (current_allophone.get('place', '') == 'labial, bilabial')\ 245 | and (next_allophone.get('place', '') == 'labial, bilabial') and lemma != 'лобби': 246 | continue 247 | 248 | # не смягчение губных и зубных перед мягкими заднеязычными (гри[пк’]и́; ко́[фт’]е) 249 | elif (current_allophone.get('place', '') in ['lingual, dental', 'labial, bilabial'])\ 250 | and (next_allophone.get('place', '') == 'lingual, velar'): 251 | continue 252 | 253 | # не смягчение звуков [р], [г] перед мягкими согласными (а[рт’]и́ст, а[гн’]ия) 254 | elif ('r' in current_phon) or ('ɡ' in current_phon): 255 | continue 256 | 257 | # не смягчение звуков [т], [з], [к] перед [р] ([тр’]и́, тряска, зрелый, транскрипция) 258 | elif ('t' in current_phon or 'z' in current_phon or 'k' in current_phon)\ 259 | and (next_phon in {'rʲ', 'rʲː', 'r̥ʲ'}): 260 | continue 261 | 262 | elif (current_allophone['phon'] == 'C') and (current_allophone.get('palatalization', ' ')[0] != 'a')\ 263 | and ('ʲ' not in current_phon) and ('soft' in next_allophone.get('palatalization', '')): 264 | phonemes_list_section_copy[i] = current_phon + 'ʲ' 265 | 266 | return phonemes_list_section_copy 267 | 268 | 269 | def long_consonants(phonemes_list_section): 270 | n = 0 271 | phonemes_list_to_iterate = phonemes_list_section[:] 272 | for i, current_phon in enumerate(phonemes_list_to_iterate): 273 | add_symb = False 274 | try: 275 | if allophones[phonemes_list_to_iterate[i + 1]]['phon'] != 'symb': 276 | next_phon = phonemes_list_to_iterate[i + 1] 277 | else: 278 | next_phon = phonemes_list_to_iterate[i + 2] 279 | add_symb = True 280 | except IndexError: 281 | next_phon = '' 282 | 283 | if (current_phon[0] in 'ʂbpfkstrlmngdz') and (current_phon == next_phon): 284 | del phonemes_list_section[i + n] 285 | del phonemes_list_section[i + n + add_symb] 286 | phonemes_list_section.insert(i + n, current_phon + 'ː') 287 | n -= 1 288 | 289 | return phonemes_list_section 290 | 291 | 292 | ts = {'t͡s', 't͡sʷ', 't͡sˠ', 'd͡ʒᶣ', 'd͡ʒˠ', 'd̻͡z̪', 'd͡ʒ'} 293 | zh_sh_ts = {'ʐ', 'ʐʷ', 'ʐˠ', 'ʑː', 'ʑːʷ', 'ʑːˠ', 'ʑʲː', 'ʑːᶣ', 294 | 'ʂ', 'ʂʷ', 'ʂˠ', 'ʂː', 'ʂːʷ', 'ʂːˠ', 295 | 't͡s', 't͡sʷ', 't͡sˠ', 'd͡ʒᶣ', 'd͡ʒˠ', 'd̻͡z̪', 'd͡ʒ'} 296 | 297 | 298 | def stunning(segment: list): 299 | segment_copy = segment.copy() 300 | for i, current_phon in enumerate(segment_copy): 301 | try: 302 | if (i < len(segment_copy) - 1) and (segment_copy[i + 1] != '_'): 303 | continue 304 | except IndexError: 305 | break 306 | try: 307 | if (i < len(segment_copy) - 1) and ((allophones[segment_copy[i + 2]].get('voice', '') == 'voiced') 308 | or (allophones[segment_copy[i + 2]]['phon'] == 'V')): 309 | continue 310 | except IndexError: 311 | break 312 | 313 | allophone_info = allophones[current_phon] 314 | pair = allophone_info.get('pair', None) 315 | if (allophone_info.get('voice', '') == 'voiced') and (pair is not None): 316 | segment_copy[i] = pair 317 | 318 | return segment_copy 319 | 320 | 321 | def vowels(section: list): 322 | section_copy = section.copy() 323 | for i, current_phon in enumerate(section_copy): 324 | try: 325 | next_phon = section_copy[i + 1] 326 | except IndexError: 327 | next_phon = '' 328 | try: 329 | after_next_phon = section_copy[i + 2] 330 | except IndexError: 331 | after_next_phon = '' 332 | try: 333 | previous_phon = section_copy[i - 1] 334 | except IndexError: 335 | previous_phon = '' 336 | try: 337 | after_previous_phon = section_copy[i - 2] 338 | except IndexError: 339 | after_previous_phon = '' 340 | 341 | previous_allophone = allophones[previous_phon] 342 | after_next_allophone = allophones[after_next_phon] 343 | after_previous_allophone = allophones[after_previous_phon] 344 | if current_phon == 'a': 345 | if (i != len(section_copy) - 1) and (next_phon != '_') \ 346 | and (i != 0) and (previous_phon != '_'): # not last, not first 347 | 348 | if next_phon == '+': # ударный (not last, not first) 349 | if previous_phon in zh_sh_ts: 350 | section_copy[i] = 'ɐ.' 351 | elif ('hard' in previous_allophone.get('palatalization', '')) and (after_next_phon == 'l'): 352 | section_copy[i] = 'ɑ' 353 | elif 'hard' in previous_allophone.get('palatalization', ''): 354 | section_copy[i] = 'a' 355 | else: 356 | section_copy[i] = 'æ' 357 | 358 | elif next_phon == '-': # первый предударный (not last, not first) 359 | if previous_phon in zh_sh_ts: 360 | section_copy[i] = 'ᵻ' 361 | elif (previous_allophone['phon'] == 'C') and ('hard' in previous_allophone['palatalization']): 362 | section_copy[i] = 'ɐ' 363 | else: 364 | section_copy[i] = 'ɪ' 365 | 366 | else: # заударные / второй предударный (not last, not first) 367 | if ((previous_allophone.get('hissing', '')) == 'hissing' or (previous_phon in ts) 368 | or ('hard' in previous_allophone.get('palatalization', ''))) \ 369 | or (previous_allophone['phon'] == 'V'): 370 | section_copy[i] = 'ə' 371 | else: 372 | section_copy[i] = 'ɪ.' 373 | 374 | elif (i == len(section_copy) - 1) or (next_phon == '_'): # заударные (last) 375 | if (previous_allophone.get('hissing', '') == 'hissing') or (previous_phon in ts): 376 | section_copy[i] = 'ə' 377 | elif 'hard' in previous_allophone.get('palatalization', ''): 378 | section_copy[i] = 'ʌ' 379 | else: 380 | section_copy[i] = 'æ.' 381 | 382 | else: 383 | if next_phon == '-': 384 | section_copy[i] = 'ɐ' # первый предударный (first) 385 | elif next_phon != '+': 386 | section_copy[i] = 'ə' # заударные / второй предударный (first) 387 | 388 | elif current_phon == 'o': 389 | if (i != len(section_copy) - 1) and (next_phon != '_') \ 390 | and (i != 0) and (previous_phon != '_'): # not last, not first 391 | 392 | if next_phon == '+': # ударный (not last, not first) 393 | if previous_phon in zh_sh_ts: 394 | section_copy[i] = 'ɐ.' 395 | elif ('soft' in previous_allophone.get('palatalization', '')) \ 396 | or (previous_allophone['phon'] == 'V'): 397 | section_copy[i] = 'ɵ' 398 | 399 | elif next_phon == '-': # первый предударный (not last, not first) 400 | if previous_phon in zh_sh_ts: 401 | section_copy[i] = 'ᵻ' 402 | elif 'hard' in previous_allophone.get('palatalization', ''): 403 | section_copy[i] = 'ɐ' 404 | else: 405 | section_copy[i] = 'ɪ' 406 | 407 | else: # заударные/второй предударный (not last, not first) 408 | if ((previous_allophone.get('hissing', '') == 'hissing') or (previous_phon in ts) 409 | or ('hard' in previous_allophone.get('palatalization', ''))) \ 410 | or (previous_allophone['phon'] == 'V'): 411 | section_copy[i] = 'ə' 412 | else: 413 | section_copy[i] = 'ɪ.' 414 | 415 | elif (i == len(section_copy) - 1) or (next_phon == '_'): # заударные (last) 416 | if (previous_allophone.get('hissing', '') == 'hissing') or (previous_phon in ts): 417 | section_copy[i] = 'ə' 418 | elif 'hard' in previous_allophone.get('palatalization', ''): 419 | section_copy[i] = 'ʌ' 420 | else: 421 | section_copy[i] = 'æ.' 422 | 423 | else: 424 | if next_phon == '-': 425 | section_copy[i] = 'ɐ' # первый предударный (first) 426 | elif next_phon != '+': 427 | section_copy[i] = 'ə' # заударные / второй предударный (first) 428 | 429 | elif current_phon == 'e': 430 | if (i != len(section_copy) - 1) and (next_phon != '_') \ 431 | and (i != 0) and (previous_phon != '_'): # not last, not first 432 | 433 | if next_phon == '+': # ударный (not last, not first) 434 | if previous_phon in zh_sh_ts: 435 | section_copy[i] = 'ᵻ' 436 | elif 'hard' in previous_allophone.get('palatalization', ''): 437 | section_copy[i] = 'ɛ' 438 | 439 | elif next_phon == '-': # первый предударный (not last, not first) 440 | if (previous_allophone.get('hissing', '') == 'hissing') or (previous_phon in ts): 441 | section_copy[i] = 'ə' 442 | elif 'hard' in previous_allophone.get('palatalization', ''): 443 | section_copy[i] = 'ᵻ' 444 | else: 445 | section_copy[i] = 'ɪ' 446 | 447 | else: # заударные / второй предударный (not last, not first) 448 | if (previous_allophone.get('hissing', '') == 'hissing') or (previous_phon in ts): 449 | section_copy[i] = 'ə' 450 | elif 'hard' in previous_allophone.get('palatalization', ''): 451 | section_copy[i] = 'ᵻ' 452 | else: 453 | section_copy[i] = 'ɪ.' 454 | 455 | elif (i == len(section_copy) - 1) or (next_phon == '_'): # заударные (last) 456 | if (previous_allophone.get('hissing', '') == 'hissing') or (previous_phon in ts): 457 | section_copy[i] = 'ə' 458 | elif 'hard' in previous_allophone.get('palatalization', ''): 459 | section_copy[i] = 'ᵻ' 460 | else: 461 | section_copy[i] = 'æ.' 462 | 463 | else: 464 | if next_phon == '+': 465 | section_copy[i] = 'ɛ' # ударный (first) 466 | elif next_phon == '-': 467 | section_copy[i] = 'ᵻ' # первый предударный (first) 468 | else: 469 | section_copy[i] = 'ɪ.' # заударные / второй предударный (first) 470 | 471 | elif current_phon == 'u': 472 | if (i != len(section_copy) - 1) and (next_phon != '_'): # not last 473 | 474 | if next_phon == '+': # ударный (not last) 475 | if 'soft' in previous_allophone.get('palatalization', ''): 476 | section_copy[i] = 'ʉ' 477 | 478 | else: # первый / второй предударный / заударные (not last) 479 | if 'hard' in previous_allophone.get('palatalization', ''): 480 | section_copy[i] = 'ʊ' 481 | else: 482 | section_copy[i] = 'ᵿ' 483 | 484 | else: # первый / второй предударный / заударные (last) 485 | if 'hard' in previous_allophone.get('palatalization', ''): 486 | section_copy[i] = 'ʊ' 487 | else: 488 | section_copy[i] = 'ᵿ' 489 | 490 | elif (current_phon == 'i') and (previous_allophone['phon'] == 'C'): 491 | # после ж, ш, ц 492 | if previous_phon in zh_sh_ts: 493 | section_copy[i] = 'ɨ' 494 | elif next_phon != '+': # безударный 495 | section_copy[i] = 'ɪ' 496 | 497 | elif current_phon == 'ɨ': 498 | if (i != len(section_copy) - 1) and (next_phon != '_'): # not last 499 | 500 | if next_phon == '+': # ударный (not last) 501 | if (previous_phon == 'l') and (len(section_copy) > 4) \ 502 | and ('lab' in after_previous_allophone.get('place', '')): 503 | section_copy[i] = 'ɯ̟ɨ̟' 504 | elif (previous_allophone.get('place', '') == 'lingual, dental' 505 | and after_next_allophone.get('place', '') == 'lingual, velar')\ 506 | or (previous_allophone.get('place', '') == 'lingual, palatinоdental' 507 | and after_next_allophone.get('place', '') == 'lingual, velar'): 508 | section_copy[i] = 'ɨ̟' 509 | 510 | # предударный / заунарный (not last) 511 | elif (previous_allophone.get('hissing', '') == 'hissing') or (previous_phon in ts): 512 | section_copy[i] = 'ə' 513 | else: 514 | section_copy[i] = 'ᵻ' 515 | 516 | elif (previous_allophone.get('hissing', '') == 'hissing') or (previous_phon in ts): # заударный (last) 517 | section_copy[i] = 'ə' 518 | else: 519 | section_copy[i] = 'ᵻ' 520 | 521 | return section_copy 522 | 523 | 524 | def labia_velar(segment: list): 525 | result_segment = [] 526 | for i, current_phon in enumerate(segment): 527 | if i != 0: 528 | previous_phon = segment[i - 1] 529 | else: 530 | previous_phon = '' 531 | 532 | current_allophone = allophones[current_phon] 533 | previous_allophone = allophones[previous_phon] 534 | if (i != 0) and (current_allophone.get('round', '') == 'round') and (previous_phon != '_')\ 535 | and (previous_allophone['phon'] == 'C') and ('ʷ' not in previous_phon) and ('ᶣ' not in previous_phon): 536 | if 'ʲ' in previous_phon: 537 | new = previous_phon.replace('ʲ', '') + 'ᶣ' 538 | if new in allophones.keys(): 539 | del result_segment[-1] 540 | result_segment.append(new) 541 | result_segment.append(current_phon) 542 | elif previous_allophone.get('palatalization', '') == 'asoft': 543 | new = previous_phon + 'ᶣ' 544 | if new in allophones.keys(): 545 | del result_segment[-1] 546 | result_segment.append(new) 547 | result_segment.append(current_phon) 548 | else: 549 | new = previous_phon + 'ʷ' 550 | if new in allophones.keys(): 551 | del result_segment[-1] 552 | result_segment.append(new) 553 | result_segment.append(current_phon) 554 | 555 | elif (i != 0) and (current_allophone.get('round', '') == 'velarize') and (previous_phon != '_')\ 556 | and (previous_allophone['phon'] == 'C') and ('ˠ' not in previous_phon)\ 557 | and ('soft' not in previous_allophone.get('palatalization', '')): 558 | # в русском нет слов, начинающихся с ы 559 | new = previous_phon + 'ˠ' 560 | if new in allophones.keys(): 561 | del result_segment[-1] 562 | result_segment.append(new) 563 | result_segment.append(current_phon) 564 | 565 | else: 566 | result_segment.append(current_phon) 567 | 568 | return result_segment 569 | -------------------------------------------------------------------------------- /src/tools/main_tools.py: -------------------------------------------------------------------------------- 1 | import re 2 | 3 | import nltk 4 | from num2t4ru import num2text 5 | 6 | # nltk.download('punkt') 7 | # nltk.download('averaged_perceptron_tagger_ru') 8 | 9 | 10 | def apply_differences(words): 11 | differences = {} 12 | for i, (char1, char2) in enumerate(zip(words[0], words[1].replace('+', ''))): 13 | if char1 != char2: 14 | differences[i + 1] = char2 15 | 16 | original_word, changed_word = words 17 | new_word = [] 18 | n = 0 19 | for i, char in enumerate(changed_word): 20 | if char == '+': 21 | n += 1 22 | continue 23 | elif i + n + 1 in differences: 24 | new_word.append(differences[i + n + 1]) 25 | else: 26 | new_word.append(char) 27 | 28 | return ''.join(new_word) 29 | 30 | 31 | def get_punctuation_dict(text): 32 | """ 33 | Returns a dictionary with the indices of punctuation marks as keys and the corresponding 34 | punctuation symbol (either '|' or '||') as values. 35 | """ 36 | punctuation = r'.,:;()\—\|\?\!…' 37 | pause_dict = {} 38 | 39 | i = 1 40 | for char in text: 41 | if char in punctuation: 42 | pause_type = '||' if char in '.?!…' else '|' 43 | pause_dict[i] = pause_type 44 | i += 1 45 | 46 | return pause_dict 47 | 48 | 49 | def custom_num2text(tokens: list): 50 | """ 51 | Turns digits to words. 52 | """ 53 | tokens_normal = [] 54 | cache = {} 55 | 56 | for section_tokens in tokens: 57 | section_normal = [] 58 | for word in section_tokens: 59 | if word.isnumeric(): 60 | if word not in cache: 61 | cache[word] = num2text(int(word)) 62 | word_normal = cache[word] 63 | section_normal.extend(word_normal.split(' ')) 64 | else: 65 | section_normal.append(word) 66 | tokens_normal.append(section_normal) 67 | 68 | return tokens_normal 69 | 70 | 71 | def text_norm_tok(text: str): 72 | """ 73 | Splits text by punctuation (not including ' and ") and than tokenize it. 74 | """ 75 | sections = re.split(r'[.?!,:;()—…]', text) 76 | sections = [re.sub(r'\s+', ' ', w) for w in sections if w != ''] 77 | sections = [re.sub(r'\s$', '', w) for w in sections if w != ''] 78 | sections = [re.sub(r'^\s', '', w) for w in sections if w != ''] 79 | 80 | tokens = [[re.sub(r"[,.\\|/;:()*&^%$#@?!\[\]{}\"—…«»]", '', word) for word in section.split()] 81 | for section in sections] 82 | 83 | return custom_num2text(tokens) 84 | 85 | 86 | adverb_adp = {'после', 'кругом', 'мимо', 'около', 'вокруг', 'напротив', 'поперёк'} 87 | 88 | 89 | def find_clitics(dep, text, indexes=None): 90 | """ 91 | Finds proclitics and enclitics in text by using dependency trees. 92 | Args: 93 | dep (class 'nltk.tree.tree.Tree'): dependency tree 94 | text (list): list of tokens in the text 95 | indexes (list[tuple]): list of tuples with indexes of a main and a dependent words. 96 | """ 97 | if indexes is None: 98 | indexes = set() 99 | functors_pos = {'CCONJ', 'PART', 'ADP'} 100 | str_dep = str(dep) 101 | 102 | if len(str_dep.split(' ')) > 1: 103 | for token in dep: 104 | if isinstance(token, nltk.tree.Tree): 105 | indexes = find_clitics(token, text, indexes) 106 | 107 | elif (token.pos_ in functors_pos) and (token.text not in adverb_adp): 108 | clitic_index = token.i 109 | main_word_index = None 110 | 111 | if (token.i < len(text) - 1) and (text[token.i + 1] in str_dep) \ 112 | and (text[token.i + 1][0] not in 'еёюяи'): # proclitic 113 | main_word_index = token.i + 1 114 | 115 | elif (token.i > 0) and (text[token.i - 1] in str_dep): # enclitic 116 | main_word_index = token.i - 1 117 | 118 | if main_word_index is not None: 119 | indexes.add((main_word_index, clitic_index)) 120 | 121 | return indexes 122 | 123 | 124 | def extract_phrasal_words(phonemes, indexes): 125 | """ 126 | Joins clitics with main words. 127 | Args: 128 | phonemes (list): list of phonemes with '_' for spaces; 129 | indexes (set[tuple]): set of tuples with indexes of a main and a dependent words. 130 | """ 131 | tokens_list = [] 132 | start_token_index = 0 133 | 134 | for i, current_phon in enumerate(phonemes): 135 | if current_phon == '_': 136 | tokens_list.append(phonemes[start_token_index:i]) 137 | start_token_index = i + 1 138 | 139 | tokens_list.append(phonemes[start_token_index:]) 140 | 141 | phrasal_words = tokens_list[:] 142 | n = 0 143 | main_word_cache = [] 144 | enclitic_cache = [] 145 | 146 | for tuple_indexes in indexes: 147 | try: 148 | main_word_index = tuple_indexes[0] 149 | 150 | if tuple_indexes[1] > main_word_index: # проклитика 151 | main_word = phrasal_words[main_word_index + n] if main_word_index in main_word_cache\ 152 | else tokens_list[main_word_index] 153 | main_word_cache.append(main_word_index) 154 | proclitic_index = tuple_indexes[1] 155 | 156 | proclitic = [x for x in tokens_list[proclitic_index] if x != '+'] 157 | phrasal_words.remove(tokens_list[proclitic_index]) 158 | phrasal_words.remove(main_word) 159 | if proclitic_index == 1: 160 | phrasal_words.insert(0, main_word + proclitic) 161 | else: 162 | phrasal_words.insert(proclitic_index - main_word_cache.count(main_word_index), 163 | main_word + proclitic) 164 | n -= 1 165 | 166 | else: # энклитика 167 | main_word = phrasal_words[main_word_index - enclitic_cache.count(main_word_index)] \ 168 | if main_word_index in enclitic_cache \ 169 | else tokens_list[main_word_index] 170 | main_word_cache.append(main_word_index) 171 | enclitic_index = tuple_indexes[1] 172 | enclitic_cache.append(enclitic_index) 173 | 174 | enclitic = [x for x in tokens_list[enclitic_index] if x != '+'] 175 | phrasal_words.remove(tokens_list[enclitic_index]) 176 | phrasal_words.remove(main_word) 177 | phrasal_words.insert(enclitic_index + n + enclitic_cache.count(main_word_index), enclitic + main_word) 178 | n -= 1 179 | 180 | except: 181 | continue 182 | 183 | phrasal_words_result = [] 184 | for token in phrasal_words: 185 | phrasal_words_result.extend(token + ['_']) 186 | del phrasal_words_result[-1] 187 | 188 | return phrasal_words_result 189 | -------------------------------------------------------------------------------- /src/tools/sounds.py: -------------------------------------------------------------------------------- 1 | from os.path import join, dirname, abspath 2 | from collections import defaultdict 3 | 4 | epi_starterpack = 'a b bʲ v vʲ ɡ ɡʲ d dʲ e ʒ z zʲ i j k kʲ l lʲ m mʲ n nʲ o '\ 5 | 'p pʲ r rʲ s sʲ t tʲ u f fʲ x xʲ t͡s t͡ɕ ʂ ɕː ɨ d͡ʒ'.split() 6 | ru_starterpack = 'ё й ц у к е н г ш щ з х ъ ф ы в а п р о л д ж э я ч с м и т ь б ю'.split() 7 | rus_v = 'а е ё и о у э ю я ы'.split() # russian vowels 8 | 9 | ROOT_DIR = dirname(abspath(__file__)) 10 | 11 | with open(join(ROOT_DIR, '../data/alphabet.txt'), encoding='utf-8') as f: 12 | alphabet = f.read().split(', ') 13 | 14 | with open(join(ROOT_DIR, '../data/sorted_allophones.txt'), encoding='utf-8') as f: 15 | sorted_phonemes_txt = (line.replace('\n', '') for line in f) 16 | sorted_phonemes_1 = {} 17 | for group in sorted_phonemes_txt: 18 | group_name, phonemes = group.split(' = ') 19 | sorted_phonemes_1[group_name] = phonemes.split(', ') 20 | 21 | sorted_phonemes = defaultdict(list) 22 | for key, value in sorted_phonemes_1.items(): 23 | for element in value: 24 | sorted_phonemes[element].append(key) 25 | 26 | with open(join(ROOT_DIR, '../data/paired_consonants.txt'), encoding='utf-8') as f: 27 | paired_c_txt = f.read().replace(')', ')_').split('_, ') 28 | paired_c = {voiced.replace('(', ''): silent.replace(')', '') 29 | for voiced, silent in (pair.split(', ') for pair in paired_c_txt)} 30 | 31 | # creating a dictionary with all allophones 32 | allophones = {key: {'phon': 'V', 'row': None, 'rise': None, 'round': None, 'class': 'vowel'} if 'total_v' in sorted_phonemes[key] 33 | else {'phon': 'C', 'place': None, 'manner': None, 'palatalization': None, 'voice': None, 'pair': None, 34 | 'hissing': None, 'class': None} 35 | for key in alphabet} 36 | # vowels 37 | # row 38 | row_map = {'front_v': 'front', 'near_front_v': 'near front', 'central_v': 'central', 'near_back_v': 'near back', 39 | 'back_v': 'back'} 40 | # rise 41 | rise_map = {'close_v': 'close', 'near_close_v': 'near close', 'close_mid_v': 'close mid', 'mid_v': 'mid', 42 | 'open_mid_v': 'open mid', 'near_open_v': 'near open', 'open_v': 'open'} 43 | # round / velarize 44 | round_map = {'rounded_v': 'round', 'velarize_v': 'velarize'} 45 | # consonants 46 | # place 47 | place_map = {'bilabial_c': 'labial, bilabial', 'labiodental_c': 'labial, labiodental', 'dental_c': 'lingual, dental', 48 | 'palatinodental_c': 'lingual, palatinоdental', 'palatal_c': 'lingual, palatal', 49 | 'velar_c': 'lingual, velar', 'glottal_c': 'glottal'} 50 | # manner 51 | manner_map = {'explosive_c': 'obstruent, explosive', 'affricate_c': 'obstruent, affricate', 52 | 'fricative_c': 'obstruent, fricative', 'nasal_c': 'sonorant, nasal', 53 | 'lateral_c': 'sonorant, lateral', 'vibrant_c': 'sonorant, vibrant'} 54 | # hard / soft 55 | palatalization_map = {'hard_c': 'hard', 'always_hard_c': 'ahard', 'soft_c': 'soft', 'always_soft_c': 'asoft'} 56 | # voice / silent 57 | voice_map = {'voiced_c': 'voiced', 'voiceless_c': 'voiceless'} 58 | paired_c_inv = {v: k for k, v in paired_c.items()} 59 | # hissing sounds 60 | hissing_map = {'hissing_c': 'hissing'} 61 | # class 62 | class_map = {'sonorous_class': 'sonorous', 'voiced_class': 'voiced', 63 | 'voiceless_class': 'voiceless', 'hissing_class': 'hissing'} 64 | # experiments 65 | # allophones = {key: {'phon': 'V', 'row': None, 'rise': None, 'round': None, 'class': 'vowel', 'experiment': None} if 'total_v' in sorted_phonemes[key] else {'phon': 'C', 'place': None, 'manner': None, 'palatalization': None, 'voice': None, 'pair': None, 'hissing': None, 'class': None, 'experiment': None} for key in alphabet} 66 | # experiment_map = {'complex_experiment': 'complex', 'rare_experiment': 'rare', 'random_vowels_experiment': 'random_vowel', 'long_consonants_experiment': 'long_consonant'} 67 | 68 | for key in allophones.keys(): 69 | for group in sorted_phonemes[key]: 70 | # experiments 71 | # experiment = experiment_map.get(group, None) 72 | # allophones[key]['experiment'] = experiment if experiment is not None else allophones[key]['experiment'] 73 | 74 | # vowels 75 | if allophones[key]['phon'] == 'V': 76 | row = row_map.get(group, None) 77 | allophones[key]['row'] = row if row is not None else allophones[key]['row'] 78 | rise = rise_map.get(group, None) 79 | allophones[key]['rise'] = rise if rise is not None else allophones[key]['rise'] 80 | round_ph = round_map.get(group, None) 81 | allophones[key]['round'] = round_ph if round_ph is not None else allophones[key]['round'] 82 | 83 | # consonants 84 | if allophones[key]['phon'] == 'C': 85 | place = place_map.get(group, None) 86 | allophones[key]['place'] = place if place is not None else allophones[key]['place'] 87 | manner = manner_map.get(group, None) 88 | allophones[key]['manner'] = manner if manner is not None else allophones[key]['manner'] 89 | palatalization = palatalization_map.get(group, None) 90 | allophones[key]['palatalization'] = palatalization if palatalization is not None \ 91 | else allophones[key]['palatalization'] 92 | hissing = hissing_map.get(group, None) 93 | allophones[key]['hissing'] = hissing if hissing is not None else allophones[key]['hissing'] 94 | class_consonants = class_map.get(group, None) 95 | allophones[key]['class'] = class_consonants 96 | voice = voice_map.get(group, None) 97 | allophones[key]['voice'] = voice if voice is not None else allophones[key]['voice'] 98 | if (allophones[key]['voice'] == 'voiced') and (key in paired_c.keys()): 99 | allophones[key]['pair'] = paired_c[key] 100 | elif (allophones[key]['voice'] == 'voiceless') and (key in paired_c.values()): 101 | allophones[key]['pair'] = paired_c_inv[key] 102 | 103 | # symbols 104 | allophones.update({symbol: {'phon': 'symb'} for symbol in ['+', '-', '|', '||', '_', '']}) 105 | -------------------------------------------------------------------------------- /src/tools/stress_tools.py: -------------------------------------------------------------------------------- 1 | from os.path import join, dirname, abspath 2 | 3 | from stressrnn import StressRNN 4 | 5 | from .sounds import rus_v 6 | 7 | ROOT_DIR = dirname(abspath(__file__)) 8 | 9 | with open(join(ROOT_DIR, '../data/error_words_stresses_default.txt'), encoding='utf-8') as file: 10 | error_words_stresses = file.readlines() 11 | stress_default_dict = {} 12 | for word in error_words_stresses: 13 | stress_default_dict[word.replace('+', '').replace('\n', '')] = word.replace('\n', '') 14 | 15 | stress_rnn = StressRNN() 16 | 17 | 18 | def place_stress(token: str, stress_accuracy_threshold: float): 19 | """ 20 | Places an accent. 21 | Args: 22 | :param token: token without an accent. 23 | :param stress_accuracy_threshold: 24 | """ 25 | if token in stress_default_dict.keys(): 26 | return stress_default_dict[token] 27 | 28 | token_list = list(token) 29 | 30 | if 'ё' in token: 31 | token_list.insert(token.index('ё') + 1, '+') 32 | return ''.join(token_list) 33 | 34 | vowels_count = sum(token.count(let) for let in token if let in rus_v) 35 | 36 | if vowels_count == 1: 37 | for i, let in enumerate(token): 38 | if let in rus_v: 39 | token_list.insert(i + 1, '+') 40 | return ''.join(token_list) 41 | 42 | if vowels_count == 0: 43 | return ''.join(token_list) 44 | 45 | # raise ValueError("Unfortunately, the automatic stress placement function is not yet available. " 46 | # f"Add stresses yourselves.\nThere is no stress for the word {token}") 47 | return stress_rnn.put_stress(token, accuracy_threshold=stress_accuracy_threshold) 48 | 49 | 50 | def replace_stress(token): 51 | """ 52 | Replaces an accent from a place before a stressed vowel to a place after it. 53 | Args: 54 | token (str): token which needs to be refactored. 55 | """ 56 | plus_index = token.find('+') 57 | new_token_split = list(token) 58 | new_token_split.remove('+') 59 | new_token_split.insert(plus_index + 1, '+') 60 | return ''.join(new_token_split) 61 | 62 | 63 | def remove_extra_stresses(string: str): 64 | first_plus_index = string.find('+') 65 | return string[:first_plus_index + 1] + string[first_plus_index + 1:].replace('+', '') 66 | 67 | 68 | def replace_stress_before(text): 69 | if isinstance(text, str): 70 | text = list(text) 71 | 72 | text_copy = text.copy() 73 | for i, char in enumerate(text): 74 | if char == '+': 75 | text_copy.pop(i) 76 | text_copy.insert(i - 1, '+') 77 | return text_copy 78 | 79 | 80 | def put_stresses(tokens_list: list, stress_place: str = 'after', stress_accuracy_threshold: float = 0.86): 81 | """ 82 | Puts or replaces stresses. 83 | 84 | :param tokens_list: List of tokens. 85 | :param stress_place: 'after' - to place the stress symbol after the stressed vowel, 86 | 'before' - to place the stress symbol before the stressed vowel. 87 | :param stress_accuracy_threshold: A threshold for the accuracy of stress placement for StressRNN. 88 | :return: List of tokens. 89 | """ 90 | res = [] 91 | for token in tokens_list: 92 | if ('+' in token) and (stress_place == 'before'): # need to replace 93 | res.append(replace_stress(token)) 94 | elif '+' not in token: # use StressRNN 95 | res.append(place_stress(token, stress_accuracy_threshold)) 96 | else: 97 | res.append(token) 98 | 99 | return res 100 | 101 | 102 | """ 103 | [ 104 | replace_stress(token) if ('+' in token) and (stress_place == 'before') # need to replace 105 | else place_stress(token, stress_accuracy_threshold) if ('+' not in token) # use StressRNN 106 | else token 107 | for token in tokens_list 108 | ] 109 | """ -------------------------------------------------------------------------------- /src/tools/syntax_tree.py: -------------------------------------------------------------------------------- 1 | import spacy 2 | from nltk import Tree 3 | 4 | 5 | class SyntaxTree: 6 | def __init__(self): 7 | self.dependency_tree = None 8 | self.nlp = spacy.load('ru_core_news_sm') 9 | 10 | def to_nltk_tree(self, node): 11 | if node.n_lefts + node.n_rights > 0: 12 | return Tree(node, [self.to_nltk_tree(child) for child in node.children]) 13 | 14 | return node 15 | 16 | def make_dependency_tree(self, text): 17 | """ 18 | Makes a dependency tree. 19 | Args: 20 | text (str): original text 21 | """ 22 | doc = self.nlp(text) 23 | for sent in doc.sents: 24 | self.dependency_tree = self.to_nltk_tree(sent.root) 25 | 26 | return self.dependency_tree 27 | --------------------------------------------------------------------------------