├── .gitignore
├── README.md
├── example.py
├── requirements.txt
├── setup.py
└── src
    ├── RuTranscript.py
    ├── __init__.py
    ├── data
        ├── alphabet.txt
        ├── error_words_stresses_default.txt
        ├── irregular_exceptions.xlsx
        ├── paired_consonants.txt
        └── sorted_allophones.txt
    ├── tests
        ├── test_consonants.py
        ├── test_modules.py
        ├── test_phrases.py
        └── test_vowels.py
    └── tools
        ├── __init__.py
        ├── allophones_tools.py
        ├── main_tools.py
        ├── sounds.py
        ├── stress_tools.py
        └── syntax_tree.py


/.gitignore:
--------------------------------------------------------------------------------
1 | *.egg-info
2 | __pycache__/
3 | .vscode/
4 | .idea
5 | 
6 | dist/
7 | build/
8 | temp/
9 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # RuTranscript
 2 | 
 3 | This package was created in order to make a phonetic transcription in russian. 
 4 | The library is based on the literary norm of phonetic transcription for the Russian language and uses symbols 
 5 | of the International Phonetic Alphabet. Transcription takes into account the allocation of allophones. 
 6 | The resulting library can be used in automatic speech recognition and synthesis tasks.
 7 | 
 8 | At the moment, there is no functional for division into syllables in this framework, due to its variability. 
 9 | Therefore, allophones that depend on the place in the syllable 
10 | (for example, *j* at the beginning of the syllable - *ʝ*) are allocated only in cases where the beginning of 
11 | the syllable coincides with the beginning of the word or the end of the syllable coincides with the end of the word.
12 | 
13 | For a more detailed description of how the framework works, see the article: https://www.dialog-21.ru/media/5722/badasyana137.pdf
14 | 
15 | # Installation
16 | 
17 | ```
18 | pip install git+https://github.com/suralmasha/RuTranscript
19 | pip install -r requirements.txt
20 | ```
21 | 
22 | # Usage
23 | 
24 | Put your text in the appropriate variable (in the example - `text`). 
25 | Pass it to the `RuTranscript()` and use method `transcribe()`.
26 | 
27 | ```
28 | from ru_transcript import RuTranscript
29 | 
30 | text = 'Как получить транскрипцию?'
31 | ru_transcript = RuTranscript(text)
32 | ru_transcript.transcribe()
33 | ```
34 | 
35 | You may define stresses both for one word and for all words in the text. 
36 | To do this, put a stress symbol (preferably '+') before or after the stressed vowel 
37 | and put the stressed text in an additional variable (in the example - `stressed_text_if_have`). 
38 | To define where you've putted the stress mark use the parameter `stress_place` (possible values: `'after'` or `'before'`).
39 | 
40 | **Important!** The number of words in these two texts must match.
41 | 
42 | ```
43 | text = 'Как получить транскрипцию?'
44 | stressed_text_if_have = 'Как получи+ть транскрипцию?'
45 | ru_transcript = RuTranscript(text, stressed_text_if_have)
46 | ru_transcript.transcribe()
47 | ```
48 | 
49 | or
50 | 
51 | ```
52 | text = 'Как получить транскрипцию?'
53 | stressed_text_if_have = 'Как получ+ить транскрипцию?'
54 | ru_transcript = RuTranscript(text, stressed_text_if_have, stress_place='before')
55 | ru_transcript.transcribe()
56 | ```
57 | 
58 | Pauses are arranged according to punctuation: the end of a sentence is indicated by a long pause (`'||'`), 
59 | punctuation marks inside a sentence are indicated by short pauses (`'|'`). 
60 | 
61 | You can get a list of **allophones** by using method `get_allophones()`.
62 | 
63 | ```
64 | print(ru_transcript.get_allophones())
65 | ```
66 | 
67 | Output:
68 | ```
69 | ['k', 'a', 'k', 'p', 'ə', 'lʷ', 'ʊ', 't͡ɕ', 'i', 'tʲ', 't', 'r', 'ɐ', 'n', 's', 'k', 'rʲ', 'i', 'p', 't͡sˠ', 'ɨ', 'jᶣ', 'ᵿ']
70 | ```
71 | 
72 | You can get a list of **phonemes (main allophones)** by using method `get_phonemes()` - 
73 | this is a less detailed sort of transcription.
74 | 
75 | ```
76 | print(ru_transcript.get_phonemes())
77 | ```
78 | 
79 | Output:
80 | ```
81 | ['k', 'a', 'k', 'p', 'o', 'l', 'u', 't͡ɕ', 'i', 'tʲ', 't', 'r', 'a', 'n', 's', 'k', 'rʲ', 'i', 'p', 't͡s', 'i', 'j', 'u']
82 | ```
83 | 
84 | You can see **how stresses were placed** by using method `get_stressed_text`.
85 | 
86 | ```
87 | print(ru_transcript.get_stressed_text())
88 | ```
89 | 
90 | Output:
91 | ```
92 | 'ка+к получи+ть транскри+пцию'
93 | ```
94 | 
95 | You can also find an example of using the framework in `example.py`.
96 | 


--------------------------------------------------------------------------------
/example.py:
--------------------------------------------------------------------------------
 1 | from src import RuTranscript
 2 | 
 3 | if __name__ == "__main__":
 4 |     text = 'Как получить транскрипцию?'
 5 |     ru_transcript = RuTranscript(text)
 6 |     ru_transcript.transcribe()
 7 | 
 8 |     print("{:<15} {:}".format('Original text:', text))
 9 |     print("{:<15} {:}".format('Stressed text:', ru_transcript.get_stressed_text(
10 |         stress_place='before',
11 |         stress_symbol='+'
12 |     )))
13 | 
14 |     print('------------------------------------------------------')
15 |     print('Transcription (allophones):')
16 |     print(' '.join(ru_transcript.get_allophones()))
17 |     print('Transcription (allophones) with spaces, pauses and stresses:')
18 |     print(' '.join(ru_transcript.get_allophones(
19 |         stress_place='before',
20 |         save_stresses=True,
21 |         save_spaces=True,
22 |         save_pauses=True,
23 |         stress_symbol='+'
24 |     )))
25 | 
26 |     print('------------------------------------------------------')
27 |     print('Transcription (phonemes):')
28 |     print(' '.join(ru_transcript.get_phonemes()))
29 |     print('Transcription (phonemes) with spaces, pauses and stresses:')
30 |     print(' '.join(ru_transcript.get_phonemes(
31 |         stress_place='before',
32 |         save_stresses=True,
33 |         save_spaces=True,
34 |         save_pauses=True,
35 |         stress_symbol='+'
36 |     )))
37 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | spacy==3.4.4
 2 | ru_core_news_sm @ https://github.com/explosion/spacy-models/releases/download/ru_core_news_sm-3.4.0/ru_core_news_sm-3.4.0-py3-none-any.whl
 3 | numpy==1.23.3
 4 | nltk==3.7
 5 | epitran==1.24
 6 | openpyxl==3.1.1
 7 | -e git+https://github.com/sovaai/sova-tts-tps@v1.2.0#egg=tps
 8 | -e git+https://github.com/Desklop/StressRNN#egg=stressrnn
 9 | -e git+https://github.com/seriyps/ru_number_to_text#egg=num2t4ru
10 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python3
 2 | # -*- coding: utf-8 -*-
 3 | # Author: suralmasha - Badasyan Alexandra
 4 | 
 5 | import os
 6 | from setuptools import setup, find_packages
 7 | from shutil import copytree, copy, rmtree, ignore_patterns
 8 | from os.path import join
 9 | 
10 | 
11 | if __name__ == "__main__":
12 |     # Package constants
13 |     PACKAGE_NAME = 'ru_transcript'
14 |     PACKAGE_VERSION = '1.0'
15 |     PACKAGE_DESCRIPTION = 'Phonetic transcription in russian'
16 |     PACKAGE_SOURCES_URL = 'https://github.com/suralmasha/RuTranscript'
17 | 
18 |     # Variables
19 |     sources_dir = './src'
20 |     temp_dir = 'temp'
21 |     excluded_files = ignore_patterns('setup.py', '.git', 'dist', 'tests', 'example.py', 'jpt_example.ipynb')
22 | 
23 |     # Prepare temp folders
24 |     rmtree(temp_dir, ignore_errors=True)
25 |     copytree(sources_dir, join(temp_dir, 'ru_transcript'), copy_function=copy, ignore=excluded_files)
26 | 
27 |     # Read long_description
28 |     with open('README.md', encoding='utf8') as f:
29 |         long_description = f.read().splitlines()
30 |         long_description = '/n'.join(long_description)
31 | 
32 |     # Read requirements from file excluded comments
33 |     with open('requirements.txt', encoding='utf8') as f:
34 |         install_requires = f.read().splitlines()
35 | 
36 |     # Prepare data files
37 |     data_files = [join('data', '*.txt'), join('data', '.xlsx')]
38 | 
39 |     # Classifiers
40 |     classifiers = [
41 |         'Natural Language :: Russian',
42 |         'Programming Language :: Python :: 3.8',
43 |         'Topic :: Text Processing :: Linguistic :: NLP'
44 |     ]
45 | 
46 |     # Build package
47 |     setup(
48 |         name=PACKAGE_NAME,  # package name
49 |         version=PACKAGE_VERSION,  # version
50 |         description=PACKAGE_DESCRIPTION,  # short description
51 |         long_description=long_description,
52 |         url=PACKAGE_SOURCES_URL,  # package URL
53 |         author='Badasyan Alexandra',
54 |         author_email='sashabadasyan@icloud.com',
55 |         classifiers=classifiers,
56 |         keywords='nlp russian transcription phonetics linguistic',
57 |         install_requires=install_requires,  # list of packages this package depends on
58 |         packages=find_packages(temp_dir), # return a list of str representing the packages it could find in source dir
59 |         package_dir={'': temp_dir},  # set up sources dir
60 |         package_data={'': data_files},  # append all external files to package
61 |         include_package_data=True,
62 |         zip_safe=False
63 |     )
64 | 


--------------------------------------------------------------------------------
/src/RuTranscript.py:
--------------------------------------------------------------------------------
  1 | import warnings
  2 | from os.path import join, dirname, abspath
  3 | 
  4 | import spacy
  5 | import epitran
  6 | from openpyxl import load_workbook
  7 | from nltk.stem.snowball import SnowballStemmer
  8 | from tps import find, download
  9 | from tps import modules as md
 10 | 
 11 | from .tools.main_tools import get_punctuation_dict, text_norm_tok, find_clitics, extract_phrasal_words, \
 12 |     apply_differences
 13 | from .tools.stress_tools import put_stresses, remove_extra_stresses, replace_stress_before
 14 | from .tools.allophones_tools import nasal_m_n, silent_r, voiced_ts, shch, long_ge, fix_jotised, long_consonants, \
 15 |     vowels, labia_velar, stunning, assimilative_palatalization, first_jot
 16 | from .tools.sounds import epi_starterpack, allophones
 17 | from .tools.syntax_tree import SyntaxTree
 18 | 
 19 | snowball = SnowballStemmer('russian')
 20 | nlp = spacy.load('ru_core_news_sm', disable=["tagger", "morphologizer", "attribute_ruler"])
 21 | 
 22 | ROOT_DIR = dirname(abspath(__file__))
 23 | wb = load_workbook(join(ROOT_DIR, 'data/irregular_exceptions.xlsx'))
 24 | sheet = wb.active
 25 | irregular_exceptions = {sheet[f'A{i}'].value: sheet[f'B{i}'].value for i in range(2, sheet.max_row + 1)}
 26 | irregular_exceptions_stems = {snowball.stem(ex): pron for ex, pron in irregular_exceptions.items()}
 27 | 
 28 | epi = epitran.Epitran('rus-Cyrl')
 29 | second_silent = 'стн стл здн рдн нтск ндск лвств'.split()
 30 | first_silent = 'лнц дц вств'.split()
 31 | hissing_rd = {'сш': 'шш', 'зш': 'шш', 'сж': 'жж', 'сч': 'щ'}
 32 | non_ipa_symbols = {'t͡ɕʲ': 't͡ɕ', 'ʂʲː': 'ʂ', 'ɕːʲ': 'ɕː', 'ʒ': 'ʐ', 'd͡ʐ': 'd͡ʒ'}
 33 | 
 34 | try:
 35 |     yo_dict = find("yo.dict", raise_exception=True)
 36 | except FileNotFoundError:
 37 |     yo_dict = download("yo.dict")
 38 | 
 39 | try:
 40 |     e_dict = find("e.dict", raise_exception=True)
 41 | except FileNotFoundError:
 42 |     e_dict = download("e.dict")
 43 | 
 44 | e_replacer = md.Replacer([e_dict, "plane"])
 45 | yo_replacer = md.Replacer([yo_dict, "plane"])
 46 | syntax_tree = SyntaxTree()
 47 | 
 48 | 
 49 | class RuTranscript:
 50 |     def __init__(self, text: str, stressed_text: str = None, stress_place: str = 'after', replacement_dict: dict = None,
 51 |                  stress_accuracy_threshold: float = 0.86):
 52 |         """
 53 |         Makes a phonetic transcription in russian using IPA.
 54 | 
 55 |         :param text: A text to transcribe.
 56 |         :param stressed_text: The same (!) text with stresses.
 57 |             You may define stresses both for one word and for all words in the text.
 58 |             To do this, put a stress symbol (preferably '+') before or after the stressed vowel.
 59 |         :param stress_place: 'after' - if the stress symbol is after the stressed vowel,
 60 |             'before' - if the stress symbol is before the stressed vowel.
 61 |         :param replacement_dict: Custom dictionary for replacing words (for example, {'tts': 'синтез речи'}).
 62 |         :param stress_accuracy_threshold: A threshold for the accuracy of stress placement for StressRNN.
 63 |         """
 64 |         text, stressed_text = self._get_text_and_stressed_text(text, stressed_text, replacement_dict)
 65 |         self._pause_dict = get_punctuation_dict(text)
 66 |         self._tokens = text_norm_tok(text)
 67 |         self._sections_len = len(self._tokens)
 68 |         self._stressed_tokens = text_norm_tok(stressed_text)
 69 | 
 70 |         self._stress_accuracy_threshold = stress_accuracy_threshold
 71 |         self._stress_place = stress_place
 72 | 
 73 |         self._phrasal_words_indexes = []
 74 |         self._letters_list = []
 75 |         self._phonemes_list = []
 76 |         self._allophones_list = [[]] * self._sections_len
 77 |         self._transliterated_tokens = [[]] * self._sections_len
 78 |         self._phrasal_words = [[]] * self._sections_len
 79 |         self._stressed_text = [[]] * self._sections_len
 80 | 
 81 |     def _get_text_and_stressed_text(self, text, stressed_text, replacement_dict):
 82 |         text = ' '.join(['—' if word == '-' else word for word in text.replace('\n', ' ').lower().split()])
 83 |         stressed_text = ' '.join(['—' if word == '-' else word
 84 |                                   for word in stressed_text.replace('\n', ' ').lower().split()]) \
 85 |             if stressed_text is not None else text
 86 | 
 87 |         if replacement_dict is not None:
 88 |             user_replacer = md.Replacer([replacement_dict, "plane"])
 89 |             text = user_replacer(text)
 90 |             stressed_text = user_replacer(stressed_text)
 91 | 
 92 |         return text, stressed_text
 93 | 
 94 |     def _remove_dashes(self, section_num):
 95 |         section = self._tokens[section_num]
 96 |         a_section = self._stressed_tokens[section_num]
 97 |         self._tokens[section_num] = [token.replace('-', '') for token in section]
 98 |         self._stressed_tokens[section_num] = [token.replace('-', '') if token.count('+') == 1
 99 |                                               else remove_extra_stresses(token).replace('-', '')
100 |                                               for token in a_section]
101 | 
102 |     def _tps(self, section_num):
103 |         """
104 |         Makes replaces 'е - э' and 'е - ё'
105 |         """
106 |         default_section = self._tokens[section_num]
107 |         self._tokens[section_num] = [e_replacer(token.replace('+', '')) for token in self._tokens[section_num]]
108 |         self._tokens[section_num] = [yo_replacer(token.replace('+', '')) for token in self._tokens[section_num]]
109 | 
110 |         if self._tokens[section_num] != [token.replace('+', '') for token in default_section]:
111 |             self._stressed_tokens[section_num] = [
112 |                 apply_differences([default_section[i], self._tokens[section_num][i]])
113 |                 for i in range(len(default_section))
114 |             ]
115 | 
116 |     def _join_phonemes(self, transliterated_tokens, limit=10000):
117 |         section_phonemes_list = []
118 |         joined_tokens = '_'.join(transliterated_tokens)
119 |         joined_tokens = joined_tokens.replace('‑', '-')
120 |         i = 0
121 |         counter = 0
122 |         default_len = len(joined_tokens)
123 |         while i < default_len:
124 |             if joined_tokens[i] not in ['+', '-']:
125 |                 n = 4
126 |                 if i != default_len - 1:
127 |                     while (joined_tokens[i: i + n] not in epi_starterpack + ['_', '|', '||', 'γ', 'ʐ']) and (n > 0):
128 |                         counter += 1
129 |                         if counter > limit:
130 |                             raise IndexError('Endless loop')
131 |                         n -= 1
132 |                     section_phonemes_list.append(joined_tokens[i: i + n])
133 |                 elif joined_tokens[i] in epi_starterpack + ['||', 'γ']:
134 |                     section_phonemes_list.append(joined_tokens[i])
135 |                 i += n
136 |             else:
137 |                 section_phonemes_list.append(joined_tokens[i])
138 |                 i += 1
139 | 
140 |         section_phonemes_list = [x for x in section_phonemes_list if x not in ['', 'ʲ']]
141 | 
142 |         n = 0
143 |         for allophone_index in range(len(section_phonemes_list) - 1):
144 |             allophone = section_phonemes_list[allophone_index + n]
145 |             next_allophone = section_phonemes_list[allophone_index + n + 1]
146 |             if (allophone == 't͡s' and next_allophone == 's') or (allophone == 'd͡ʒ' and next_allophone == 'ʐ'):
147 |                 del section_phonemes_list[allophone_index + n + 1]
148 |                 n -= 1
149 | 
150 |         # print(section_phonemes_list)
151 |         return section_phonemes_list
152 | 
153 |     @staticmethod
154 |     def add_prestressed_syllable_sign(section: list):
155 |         section_result = section[:]
156 |         n = 0
157 |         for symb_i, symb in enumerate(section):
158 |             if symb == '+':
159 |                 preavi = [phon_i for phon_i, phon in enumerate(section[:symb_i - 1]) if
160 |                           allophones[phon]['phon'] == 'V' and '_' not in section[phon_i + n:symb_i]]
161 |                 if preavi:
162 |                     section_result.insert(preavi[-1] + n + 1, '-')
163 |                     n += 1
164 | 
165 |         return section_result
166 | 
167 |     def _lpt_1(self, section_num):
168 |         """
169 |         Letter-phoneme transformation by B.M. Lobanov. Part 1 - Irregular exceptions.
170 |         """
171 |         for i, token in enumerate(self._tokens[section_num]):
172 |             stem = snowball.stem(token)
173 |             if stem in irregular_exceptions_stems:
174 |                 try:
175 |                     new_token = irregular_exceptions[token]
176 |                 except KeyError:
177 |                     ending = token[len(stem):]
178 |                     dif = - (len(token) - len(stem))
179 |                     new_token = irregular_exceptions_stems[stem][:dif] + ending
180 | 
181 |                 self._tokens[section_num][i] = new_token
182 |                 accent_index = self._stressed_tokens[section_num][i].index('+')
183 |                 self._stressed_tokens[section_num][i] = new_token[:accent_index] + '+' + new_token[accent_index:]
184 | 
185 |     def _lpt_2(self, section_num):
186 |         """
187 |         Letter-phoneme transformation by B.M. Lobanov. Part 2 - Regular exceptions.
188 |         """
189 |         for i, token in enumerate(self._stressed_tokens[section_num]):
190 |             # adjective endings 'ого его'
191 |             if token != 'ого+' and (token.replace('+', '').startswith('какого')
192 |                                     or token.replace('+', '').endswith('ого')
193 |                                     or token.replace('+', '').endswith('его')):
194 |                 accent_index = token.index('+')
195 |                 token = token.replace('+', '').replace('ого', 'ово').replace('его', 'ево')
196 |                 self._stressed_tokens[section_num][i] = token[:accent_index] + '+' + token[accent_index:]
197 | 
198 |             # 'что' --> 'што'
199 |             if 'что' in self._stressed_tokens[section_num][i]:
200 |                 self._stressed_tokens[section_num][i] = token.replace('что', 'што')
201 | 
202 |             # verb endings 'тся ться'
203 |             if token not in {'заботься', 'отметься'}:
204 |                 if token[-3:] == 'тся':
205 |                     self._stressed_tokens[section_num][i] = token[:-3] + 'ца'
206 |                 elif token[-4:] == 'ться':
207 |                     self._stressed_tokens[section_num][i] = token[:-4] + 'ца'
208 | 
209 |             # noun endings 'ия ие ию'
210 |             if (token[-2:] in {'ия', 'ие', 'ию'}) and (token[-3] not in {'ц', 'щ'}):
211 |                 if token[-3] not in {'ж', 'ш'}:
212 |                     self._stressed_tokens[section_num][i] = token[:-2] + 'ь' + token[-1]
213 |                 else:
214 |                     self._stressed_tokens[section_num][i] = token[:-2] + 'й' + token[-1]
215 | 
216 |             # unpronounceable consonants
217 |             for sub in first_silent + second_silent:
218 |                 if sub in token:
219 |                     new_sub = sub.translate(str.maketrans('', '', 'ьъ'))
220 |                     self._stressed_tokens[section_num][i] = token.translate(str.maketrans(sub, new_sub))
221 | 
222 |             # combinations with hissing consonants
223 |             stem = snowball.stem(token)
224 |             if ('зч' in token or 'тч' in token or 'дч' in token) and (stem[-3:] == 'чик' or stem[-3:] == 'чиц'):
225 |                 self._stressed_tokens[section_num][i] = token.replace('зч', 'щ').replace('тч', 'ч').replace('дч',
226 |                                                                                                             'ч')
227 |             for key, value in hissing_rd.items():
228 |                 if key in token:
229 |                     self._stressed_tokens[section_num][i] = token.replace(key, value)
230 | 
231 |     def _lpt_3(self, section_num):
232 |         """
233 |         Letter-phoneme transformation by B.M. Lobanov. Part 3 - Transliteration
234 |         """
235 |         self._transliterated_tokens[section_num] = [epi.transliterate(token).replace('6', '').replace('4', '')
236 |                                                     for token in self._stressed_tokens[section_num]]
237 |         for i, token in enumerate(self._transliterated_tokens[section_num]):
238 |             for key, value in non_ipa_symbols.items():
239 |                 if key in token:
240 |                     token = token.replace(key, value)
241 |                     self._transliterated_tokens[section_num][i] = token
242 | 
243 |     def _lpt_4(self, section_num):
244 |         """
245 |         Letter-phoneme transformation by B.M. Lobanov. Part 4 - Common Rules
246 |         """
247 |         # fricative g
248 |         for i, token in enumerate(self._transliterated_tokens[section_num]):
249 |             try:
250 |                 next_token = self._transliterated_tokens[section_num][i + 1]
251 |             except IndexError:
252 |                 next_token = ' '
253 | 
254 |             token_let = self._tokens[section_num][i]
255 |             nlp_token = nlp(token_let)[0]
256 |             lemma = nlp_token.lemma_
257 | 
258 |             if lemma in {'ага', 'ого', 'угу', 'господь', 'господи', 'бог'}:
259 |                 self._transliterated_tokens[section_num][i] = token.replace('ɡ', 'γ', 1)
260 |             elif token_let in {'ах', 'эх', 'ох', 'ух'}:
261 |                 next_token_allophone = allophones.get(next_token[0], {})
262 |                 if next_token_allophone.get('voice', '') == 'voiced':
263 |                     self._transliterated_tokens[section_num][i] = token.replace('x', 'γ', 1)
264 | 
265 |         # ---- Join phonemes ----
266 |         joined_phonemes = self._join_phonemes(self._transliterated_tokens[section_num], limit=10000)
267 |         self._phonemes_list.append(joined_phonemes)
268 | 
269 |         # ---- Join letters ----
270 |         joined_letters = list('_'.join(self._stressed_tokens[section_num]))
271 |         self._letters_list.append(joined_letters)
272 | 
273 |         # ---- Continue LPC-4. Common rules ----
274 |         self._phonemes_list[section_num] = fix_jotised(self._phonemes_list[section_num],
275 |                                                        self._letters_list[section_num])
276 |         self._phonemes_list[section_num] = shch(self._phonemes_list[section_num])
277 |         self._phonemes_list[section_num] = long_ge(self._phonemes_list[section_num])
278 |         self._phonemes_list[section_num] = assimilative_palatalization(self._tokens[section_num],
279 |                                                                        self._phonemes_list[section_num])
280 |         self._phonemes_list[section_num] = long_consonants(self._phonemes_list[section_num])
281 |         self._phonemes_list[section_num] = stunning(self._phonemes_list[section_num])
282 | 
283 |     def transcribe(self):
284 |         for section_num in range(self._sections_len):
285 |             self._tps(section_num)
286 |             # ---- Accenting ----
287 |             self._stressed_tokens[section_num] = put_stresses(
288 |                 tokens_list=self._stressed_tokens[section_num],
289 |                 stress_place=self._stress_place,
290 |                 stress_accuracy_threshold=self._stress_accuracy_threshold)
291 |             self._stressed_text[section_num] = self._stressed_tokens[section_num]
292 |             # ---- Removing dashes ----
293 |             self._remove_dashes(section_num)
294 |             # ---- Phrasal words extraction ----
295 |             dep = syntax_tree.make_dependency_tree(' '.join(self._tokens[section_num]))
296 |             self._phrasal_words_indexes.append(find_clitics(dep, self._tokens[section_num]))
297 |             # ---- Letter-phoneme transformation ----
298 |             self._lpt_1(section_num)
299 |             self._lpt_2(section_num)
300 |             self._lpt_3(section_num)
301 |             self._lpt_4(section_num)
302 |             # ---- Allophones - consonants ----
303 |             self._allophones_list[section_num] = first_jot(self._phonemes_list[section_num])
304 |             self._allophones_list[section_num] = nasal_m_n(self._allophones_list[section_num])
305 |             self._allophones_list[section_num] = silent_r(self._allophones_list[section_num])
306 |             self._allophones_list[section_num] = voiced_ts(self._allophones_list[section_num])
307 |             # ---- Extract phrasal words ----
308 |             self._phrasal_words[section_num] = extract_phrasal_words(self._allophones_list[section_num],
309 |                                                                      self._phrasal_words_indexes[section_num])
310 |             #  ---- Allophones - vowels ----
311 |             self._phrasal_words[section_num] = self.add_prestressed_syllable_sign(self._phrasal_words[section_num])
312 |             self._allophones_list[section_num] = vowels(self._phrasal_words[section_num])
313 |             self._allophones_list[section_num] = labia_velar(self._allophones_list[section_num])
314 | 
315 |     def _insert_pauses(self, sounds_list: list):
316 |         for i, key in enumerate(self._pause_dict):
317 |             sounds_list.insert(i + key, self._pause_dict[key])
318 | 
319 |     def _get_escape_symbols(self, save_stresses: bool = False, save_spaces: bool = False):
320 |         escape_symbols = ['+', '-', '_']
321 |         if save_stresses:
322 |             escape_symbols.remove('+')
323 |         if save_spaces:
324 |             escape_symbols.remove('_')
325 | 
326 |         return escape_symbols
327 | 
328 |     def _join_sounds(self, escape_symbols: list, sounds_list: list):
329 |         return ' '.join(
330 |             [' '.join([x for x in section if x not in escape_symbols])
331 |              if section != '||'
332 |              else section
333 |              for section in sounds_list]
334 |         )
335 | 
336 |     def get_allophones(self, stress_place: str = None, save_stresses: bool = False, save_spaces: bool = False,
337 |                        save_pauses: bool = False, stress_symbol: str = '+'):
338 |         """
339 |         :param stress_place: 'after' - to place the stress symbol after the stressed vowel,
340 |             'before' - to place the stress symbol before the stressed vowel.
341 |         :param stress_symbol: A symbol that you want to indicate the stress.
342 |             Be careful not to use letters and signs from the following list
343 |             ['.', '_', '-', 'ʲ', 'ᶣ', 'ʷ', 'ˠ', 'ː', '͡']!
344 |         :param save_spaces: Will replace spaces with '_'.
345 |         :param save_stresses: Will replace stresses with the stress_symbol.
346 |         :param save_pauses: Will replace punctuation with '||' for long pauses ('.', '?', '!', '…')
347 |             and '|' for short pauses (other symbols).
348 |         :return: List of allophones.
349 |         """
350 |         if stress_symbol in ['.', '_', '-', 'ʲ', 'ᶣ', 'ʷ', 'ˠ', 'ː', '͡']:
351 |             warnings.warn("The stress symbol intersects with the IPA transcription signs "
352 |                           "or the internal sighs of the framework.\nIt may cause an unpredictable behaviour.\n"
353 |                           "Better don't use signs from the following list "
354 |                           "['.', '_', '-', 'ʲ', 'ᶣ', 'ʷ', 'ˠ', 'ː', '͡']!")
355 | 
356 |         if save_pauses:
357 |             self._insert_pauses(self._allophones_list)
358 | 
359 |         escape_symbols = self._get_escape_symbols(save_stresses=save_stresses, save_spaces=save_spaces)
360 |         res = self._join_sounds(escape_symbols, self._allophones_list).split()
361 | 
362 |         if stress_place is None:
363 |             stress_place = self._stress_place
364 |         if stress_place == 'before':
365 |             res = replace_stress_before(res)
366 | 
367 |         if (stress_symbol != '+') and ('+' not in escape_symbols):
368 |             res = [x.replace('+', stress_symbol) for x in res]
369 | 
370 |         return res
371 | 
372 |     def get_phonemes(self, stress_place: str = None, save_stresses: bool = False, save_spaces: bool = False,
373 |                      save_pauses: bool = False, stress_symbol: str = '+'):
374 |         """
375 |         :param stress_place: 'after' - to place the stress symbol after the stressed vowel,
376 |             'before' - to place the stress symbol before the stressed vowel.
377 |         :param stress_symbol: A symbol that you want to indicate the stress.
378 |             Be careful not to use letters and signs from the following list
379 |             ['.', '_', '-', 'ʲ', 'ᶣ', 'ʷ', 'ˠ', 'ː', '͡']!
380 |         :param save_spaces: Will replace spaces with '_'.
381 |         :param save_stresses: Will replace stresses with the stress_symbol.
382 |         :param save_pauses: Will replace punctuation with '||' for long pauses ('.', '?', '!', '…')
383 |             and '|' for short pauses (other symbols).
384 |         :return: List of phonemes.
385 |         """
386 |         if stress_symbol in ['.', '_', '-', 'ʲ', 'ᶣ', 'ʷ', 'ˠ', 'ː', '͡']:
387 |             warnings.warn("The stress symbol intersects with the IPA transcription signs "
388 |                           "or the internal sighs of the framework.\nIt may cause an unpredictable behaviour.\n"
389 |                           "Better don't use signs from the following list "
390 |                           "['.', '_', '-', 'ʲ', 'ᶣ', 'ʷ', 'ˠ', 'ː', '͡']!")
391 | 
392 |         if save_pauses:
393 |             self._insert_pauses(self._phonemes_list)
394 | 
395 |         escape_symbols = self._get_escape_symbols(save_stresses=save_stresses, save_spaces=save_spaces)
396 |         res = self._join_sounds(escape_symbols, self._phonemes_list).split()
397 | 
398 |         if stress_place is None:
399 |             stress_place = self._stress_place
400 |         if stress_place == 'before':
401 |             res = replace_stress_before(res)
402 | 
403 |         if (stress_symbol != '+') and ('+' not in escape_symbols):
404 |             res = [x.replace('+', stress_symbol) for x in res]
405 | 
406 |         return res
407 | 
408 |     def get_stressed_text(self, stress_place: str = None, stress_symbol: str = '+'):
409 |         """
410 |         :param stress_place: 'after' - to place the stress symbol after the stressed vowel,
411 |             'before' - to place the stress symbol before the stressed vowel.
412 |         :param stress_symbol: A symbol that you want to indicate the stress.
413 |             Be careful not to use signs from the following list ['.', '_', '-', 'ʲ', 'ᶣ', 'ʷ', 'ˠ', 'ː', '͡']!
414 |         :return: A text string with stresses.
415 |         """
416 |         if stress_symbol in ['.', '_', '-', 'ʲ', 'ᶣ', 'ʷ', 'ˠ', 'ː', '͡']:
417 |             warnings.warn("The stress symbol intersects with the IPA transcription signs "
418 |                           "or the internal sighs of the framework.\nIt may cause an unpredictable behaviour.\n"
419 |                           "Better don't use signs from the following list "
420 |                           "['.', '_', '-', 'ʲ', 'ᶣ', 'ʷ', 'ˠ', 'ː', '͡']!")
421 | 
422 |         if stress_place is None:
423 |             stress_place = self._stress_place
424 |         if stress_place == 'before':
425 |             res = ' '.join([''.join(replace_stress_before(' '.join(section))) for section in self._stressed_text])
426 |         else:
427 |             res = ' '.join([' '.join(section) for section in self._stressed_text])
428 | 
429 |         if stress_symbol != '+':
430 |             res = res.replace('+', stress_symbol)
431 | 
432 |         return res
433 | 


--------------------------------------------------------------------------------
/src/__init__.py:
--------------------------------------------------------------------------------
 1 | from .RuTranscript import RuTranscript
 2 | from .tools.allophones_tools import get_allophone_info
 3 | from .tools.main_tools import text_norm_tok
 4 | 
 5 | __all__ = [
 6 |     'RuTranscript',
 7 |     'get_allophone_info',
 8 |     'text_norm_tok'
 9 | ]
10 | 


--------------------------------------------------------------------------------
/src/data/alphabet.txt:
--------------------------------------------------------------------------------
1 | a, ɑ, æ, æ., ɐ., ɐ, ə, ʌ, b, bʷ, bː, bːʷ, bˠ, bʲ, bᶣ, v, vʷ, vˠ, vʲ, vᶣ, ɡ, ɡʷ, ɡˠ, ɡʲ, ɡᶣ, ɡː, γ, γʷ, d, dʷ, dˠ, dʲ, dᶣ, dː, dːʷ, dːˠ, dʲː, dːᶣ, ʐ, ʐʷ, ʐˠ, ʑː, ʑːʷ, ʑːˠ, ʑʲː, ʑːᶣ, d͡ʒ, d͡ʒᶣ, z, zʷ, zˠ, zʲ, zᶣ, zː, zʲː, i, ɪ, ɪ., j, ʝ, jʷ, jᶣ, ʝʷ, ʝᶣ, k, kʷ, kˠ, kʲ, kː, kːʷ, kʲː, kᶣ, l, lʷ, lˠ, lʲ, lᶣ, lː, lːʷ, lʲː, lːᶣ, m, mʷ, mˠ, mʲ, mː, mːʷ, mʲː, mːˠ, mᶣ, ɱ, ɱʲ, n, nʷ, nˠ, nʲ, nː, nːʷ, nːˠ, nʲː, nᶣ, o, ɵ, p, pʷ, pː, pːʷ, pʲː, pˠ, pʲ, pᶣ, r, rʷ, rˠ, rʲ, rː, rːʷ, rʲː, rᶣ, r̥, r̥ʲ, s, sʷ, sˠ, sʲ, sː, sːʷ, sʲː, sᶣ, t, tʷ, tˠ, tʲ, tː, tʲː, tᶣ, u, ʉ, ʊ, ᵿ, f, fʷ, fˠ, fʲ, fʲː, fᶣ, x, xʷ, xˠ, xʲ, xᶣ, t͡s, t͡sʷ, t͡sˠ, t͡sː, t͡sːʷ, t͡sːˠ, d̻͡z̪, t͡ɕ, t͡ɕᶣ, t͡ɕː, t͡ɕːᶣ, ʂ, ʂʷ, ʂˠ, ʂː, ʂːʷ, ʂːˠ, ɕː, ɕːᶣ, ɨ, ɨ̟, ɯ̟ɨ̟, ᵻ, ɛ, e, ɪ., ʔ


--------------------------------------------------------------------------------
/src/data/error_words_stresses_default.txt:
--------------------------------------------------------------------------------
  1 | выходно+го
  2 | бого+в
  3 | о+строго
  4 | ско+рого
  5 | вы+скочившего
  6 | общечелове+ческого
  7 | слепо+го
  8 | манхэ+ттэнского
  9 | заре+цкого
 10 | ви+дного
 11 | уси+ленного
 12 | джа+зового
 13 | люби+мовского
 14 | мару+синого
 15 | циа+нистого
 16 | непро+шеного
 17 | шоссэ+
 18 | сму+глого
 19 | исаа+киевского
 20 | голуби+ного
 21 | како+гото
 22 | ку+пишь
 23 | головно+го
 24 | тре+звого
 25 | тройно+го
 26 | канцеля+рского
 27 | то+лстого
 28 | до+лгого
 29 | кладби+щенского
 30 | грохо+чущего
 31 | четырнадцатиле+тнего
 32 | редакцио+нного
 33 | до+рого
 34 | то+нущего
 35 | изра+ильского
 36 | меньшеви+стского
 37 | худо+го
 38 | давни+шнего
 39 | бе+гающего
 40 | трудолюби+вого
 41 | ку+пит
 42 | воро+вского
 43 | шестидеся+того
 44 | неистреби+мого
 45 | до+лжного
 46 | произво+дственно
 47 | техни+ческого
 48 | обще+ственного
 49 | запоро+жского
 50 | избало+ванного
 51 | семьна+дцатого
 52 | водопрово+дного
 53 | бродя+чего
 54 | крикли+вого
 55 | седовла+сого
 56 | комари+ного
 57 | нену+жного
 58 | цини+чного
 59 | отставно+го
 60 | рога+того
 61 | души+стого
 62 | пусто+го
 63 | военнослу+жащего
 64 | та+ющего
 65 | портно+го
 66 | многомиллио+нного
 67 | како+гонибудь
 68 | со+льного
 69 | кафэ+
 70 | единоутро+бного
 71 | изоби+лующего
 72 | ржано+го
 73 | посторо+ннего
 74 | туре+цкого
 75 | индонези+йского
 76 | уе+здного
 77 | ми+лого
 78 | но+белевского
 79 | было+го
 80 | не+рвного
 81 | гости+ничного
 82 | внешта+тного
 83 | городско+го
 84 | журнали+стского
 85 | ежеме+сячного
 86 | предыду+щего
 87 | а+нгельского
 88 | ро+вного
 89 | ну+жного
 90 | недостаю+щего
 91 | купэ+
 92 | зре+лого
 93 | ни+щего
 94 | прие+зжего
 95 | купи+те
 96 | не+жного
 97 | го+ночного
 98 | за+сранного
 99 | семидеся+того
100 | дзержи+нского
101 | цвета+стого
102 | га+нсовского
103 | пифаго+ровского
104 | листово+го
105 | распа+хнутого
106 | жуко+вского
107 | та+ллиннского
108 | несомне+нного
109 | расти+тельного
110 | уби+того
111 | са+мого
112 | слы+шавшего
113 | диссиде+нтствующего
114 | транскри+пцию
115 | транскри+пция
116 | литературнохудо+жественный


--------------------------------------------------------------------------------
/src/data/irregular_exceptions.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/suralmasha/RuTranscript/30cbc40c5ac368021bcc8a05002fa33cc50ee9b6/src/data/irregular_exceptions.xlsx


--------------------------------------------------------------------------------
/src/data/paired_consonants.txt:
--------------------------------------------------------------------------------
1 | (b, p), (ɡ, k), (z, s), (v, f), (d, t), (ʐ, ʂ), (ʑː, ʂː), (bʲ, pʲ), (ɡʲ, kʲ), (zʲ, sʲ), (zʲː, sʲː), (vʲ, fʲ), (dʲ, tʲ), (ʐʲ, ʂʲ), (ʑʲː, ʂʲː), (bʷ, pʷ), (ɡʷ, kʷ), (zʷ, sʷ), (vʷ, fʷ), (dʷ, tʷ), (ʐʷ, ʂʷ), (ʑːʷ, ʂːʷ), (bᶣ, pᶣ), (ɡᶣ, kᶣ), (zᶣ, sᶣ), (vᶣ, fᶣ), (dᶣ, tᶣ), (ʐᶣ, ʂᶣ), (ʑːᶣ, ɕːᶣ), (bˠ, pˠ), (ɡˠ, kˠ), (zˠ, sˠ), (vˠ, fˠ), (dˠ, tˠ), (ʐˠ, ʂˠ), (ʑːˠ, ʂːˠ)


--------------------------------------------------------------------------------
/src/data/sorted_allophones.txt:
--------------------------------------------------------------------------------
 1 | total_c = b, bʷ, bˠ, bʲ, bᶣ, bː, bːʷ, v, vʷ, vˠ, vʲ, vᶣ, ɡ, ɡʷ, ɡˠ, ɡː, γ, γʷ, ɡʲ, ɡᶣ, d, dʷ, dˠ, dʲ, dᶣ, dː, dːʷ, dːˠ, dʲː, dːᶣ, ʐ, ʐʷ, ʐˠ, ʑː, ʑːʷ, ʑːˠ, ʑʲː, ʑːᶣ, d͡ʒ, d͡ʒᶣ, z, zʷ, zˠ, zʲ, zᶣ, zː, zʲː, j, ʝ, jʷ, jᶣ, ʝʷ, ʝᶣ, k, kʷ, kˠ, kʲ, kᶣ, kː, kːʷ, kʲː, l, lʷ, lˠ, lʲ, lᶣ, lː, lːʷ, lʲː, lːᶣ, m, mʷ, mˠ, mʲ, mᶣ, ɱ, ɱʲ, mː, mːʷ, mʲː, n, nʷ, nˠ, nʲ, nᶣ, nː, nːʷ, nːˠ, nʲː, p, pʷ, pˠ, pʲ, pᶣ, pː, pːʷ, pʲː, r, rʷ, rˠ, rʲ, rᶣ, rː, rːʷ, rʲː, r̥, r̥ʲ, s, sʷ, sˠ, sʲ, sᶣ, sː, sːʷ, sʲː, t, tʷ, tˠ, tʲ, tᶣ, tː, tʲː, f, fʷ, fˠ, fʲ, fᶣ, fʲː, x, xʷ, xˠ, xʲ, xᶣ, t͡s, t͡sʷ, t͡sˠ, t͡sː, t͡sːʷ, t͡sːˠ, d̻͡z̪, t͡ɕ, t͡ɕᶣ, t͡ɕː, t͡ɕːᶣ, ʂ, ʂʷ, ʂˠ, ʂː, ʂːʷ, ʂːˠ, ɕː, ɕːᶣ, ʔ
 2 | voiceless_c = ʔ, p, pʷ, pˠ, pʲ, pᶣ, pː, pːʷ, pʲː, f, fʷ, fˠ, fʲ, fᶣ, fʲː, k, kʷ, kˠ, kʲ, kᶣ, kː, kːʷ, kʲː, t, tʷ, tˠ, tʲ, tᶣ, tː, tʲː, ʂ, ʂʷ, ʂˠ, ʂː, ʂːʷ, ʂːˠ, s, sʷ, sˠ, sʲ, sᶣ, sː, sʲː, sːʷ, x, xʷ, xˠ, xʲ, xᶣ, t͡s, t͡sʷ, t͡sˠ, t͡sː, t͡sːʷ, t͡sːˠ, t͡ɕ, t͡ɕᶣ, t͡ɕː, t͡ɕːᶣ, ɕː, ɕːᶣ
 3 | voiced_c = b, bʷ, bˠ, bʲ, bᶣ, bː, bːʷ, v, vʷ, vˠ, vʲ, vᶣ, ɡ, ɡʷ, ɡˠ, ɡʲ, ɡᶣ, ɡː, γ, γʷ, d, dʷ, dˠ, dʲ, dᶣ, dː, dːʷ, dːˠ, dʲː, dːᶣ, ʐ, dʷ, dˠ, ʑː, ʑːʷ, ʑːˠ, ʑʲː, ʑːᶣ, z, zʷ, zˠ, zʲ, zᶣ, zː, zʲː, j, jʷ, jᶣ, ʝ, ʝʷ, ʝᶣ, m, mʷ, mˠ, mʲ, mᶣ, mː, mːʷ, mʲː, mːˠ, ɱ, ɱʲ, n, nʷ, nˠ, nʲ, nᶣ, nː, nːʷ, nːˠ, nʲː, r, rʷ, rˠ, rʲ, rᶣ, rː, rːʷ, rʲː, r̥, r̥ʲ, d͡ʒ, d͡ʒᶣ, d̻͡z̪, l, lʷ, lˠ, lʲ, lᶣ, lː, lʲː, lːʷ, ʐ, ʐʷ, ʐˠ
 4 | soft_c = bʲ, bᶣ, dʲ, dᶣ, dʲː, dːᶣ, vʲ, vᶣ, ɡʲ, ɡᶣ, zʲ, zʲː, zᶣ, lʲ, lᶣ, lʲː, lːᶣ, mʲ, mᶣ, mʲː, ɱ, ɱʲ, nʲ, nᶣ, nʲː, nᶣ, rʲ, rᶣ, rʲː, r̥ʲ, pʲ, pᶣ, pʲː, tʲ, tᶣ, tʲː, fʲ, fᶣ, fʲː, kʲ, kᶣ, kʲː, sʲ, sᶣ, sʲː, xʲ, xᶣ, ʑʲː, ʑːᶣ
 5 | always_soft_c = j, ʝ, jʷ, ʝʷ, jᶣ, ʝᶣ, ɕː, ɕːᶣ, t͡ɕ, t͡ɕᶣ, t͡ɕː, t͡ɕːᶣ, d͡ʒ, d͡ʒᶣ
 6 | hard_c = b, bʷ, bˠ, bː, bːʷ, d, dʷ, dˠ, dː, dːʷ, dːˠ, v, vʷ, vˠ, ɡ, ɡʷ, ɡˠ, ɡː, z, zʷ, zˠ, zː, l, lʷ, lˠ, lː, lːʷ, m, mʷ, mˠ, mː, mːʷ, mːˠ, n, nʷ, nˠ, nː, nːʷ, nːˠ, r, rʷ, rˠ, rː, rːʷ, r̥, p, pʷ, pˠ, pː, pːʷ, t, tʷ, tˠ, tː, f, fʷ, fˠ, k, kʷ, kˠ, kː, kːʷ, s, sʷ, sˠ, sː, sːʷ, x, xʷ, xˠ
 7 | always_hard_c = ʔ, γ, γʷ, ʂ, ʂʷ, ʂˠ, ʂː, ʂːʷ, ʂːˠ, t͡s, t͡sʷ, t͡sˠ, t͡sː, t͡sːʷ, t͡sːˠ, d̻͡z̪, ʐ, ʐʷ, ʐˠ, ʑː, ʑːʷ, ʑːˠ
 8 | hissing_c = ɕː, ʑː, ʑːʷ, ʑːˠ, ʑʲː, ʑːᶣ, ʐ, ʂ, ʂʷ, ʐʷ, d͡ʒ, d͡ʒᶣ, ɕːᶣ, ʂˠ, ʂː, ʂːʷ, ʂːˠ, ʐˠ, t͡ɕ, t͡ɕᶣ, t͡ɕː, t͡ɕːᶣ
 9 | bilabial_c = b, bʷ, bˠ, bʲ, bᶣ, bː, bːʷ, p, pʷ, pˠ, pʲ, pᶣ, pː, pːʷ, pʲː, m, mʷ, mˠ, mʲ, mᶣ, mː, mːʷ, mʲː, mːˠ, ɱ, ɱʲ
10 | labiodental_c = f, fʷ, fˠ, fʲ, fᶣ, fʲː, v, vʷ, vˠ, vʲ, vᶣ
11 | dental_c = t, tʷ, tˠ, tʲ, tᶣ, tː, tʲː, d, dʷ, dˠ, dʲ, dᶣ, dː, dːʷ, dːˠ, dʲː, dːᶣ, s, sʷ, sˠ, sʲ, sᶣ, sː, sːʷ, sʲː, z, zʷ, zˠ, zʲ, zᶣ, zː, zʲː, n, nʷ, nˠ, nʲ, nᶣ, nː, nːʷ, nːˠ, nʲː, l, lʷ, lˠ, lʲ, lᶣ, lː, lːʷ, lʲː, lːᶣ, t͡s, t͡sʷ, t͡sˠ, t͡sː, t͡sːʷ, t͡sːˠ, d̻͡z̪
12 | palatinodental_c = t͡ɕ, t͡ɕᶣ, t͡ɕː, t͡ɕːᶣ, d͡ʒ, d͡ʒᶣ, ʂ, ʂʷ, ʂˠ, ʂː, ʂːʷ, ʂːˠ, ɕː, ɕːᶣ, ʐ, ʐʷ, ʐˠ, ʑː, ʑːʷ, ʑːˠ, ʑʲː, ʑːᶣ, r, rʷ, rˠ, rː, rːʷ, rʲː, r̥, rʲ, rᶣ, r̥ʲ
13 | palatal_c = j, jʷ, jᶣ, ʝᶣ, ʝ, ʝʷ
14 | velar_c = k, kʷ, kˠ, kʲ, kᶣ, kː, kːʷ, kʲː, ɡ, ɡʷ, ɡˠ, ɡʲ, ɡᶣ, ɡː, γ, γʷ, x, xʷ, xˠ, xʲ, xᶣ
15 | glottal_c = ʔ
16 | explosive_c = ʔ, b, bʷ, bˠ, bʲ, bᶣ, bː, bːʷ, p, pʷ, pˠ, pʲ, pᶣ, pː, pːʷ, pʲː, t, tʷ, tˠ, tʲ, tᶣ, tː, tʲː, d, dʷ, dˠ, dʲ, dᶣ, dː, dːʷ, dːˠ, dʲː, dːᶣ, k, kʷ, kˠ, kʲ, kᶣ, kː, kːʷ, kʲː, ɡ, ɡʷ, ɡˠ, ɡʲ, ɡᶣ, ɡː, γ, γʷ
17 | affricate_c = t͡s, t͡sʷ, t͡sˠ, t͡sː, t͡sːʷ, t͡sːˠ, d̻͡z̪, t͡ɕ, t͡ɕᶣ,  t͡ɕː, t͡ɕːᶣ, d͡ʒ, d͡ʒᶣ
18 | fricative_c = f, fʷ, fˠ, fʲ, fᶣ, fʲː, v, vʷ, vˠ, vʲ, vᶣ, s, sʷ, sˠ, sʲ, sᶣ, sː, sːʷ, sʲː, z, zʷ, zˠ, zʲ, zᶣ, zː, zʲː, ʂ, ʂʷ, ʂˠ, ʂː, ʂːʷ, ʂːˠ, ɕː, ɕːᶣ, ʐ, ʐʷ, ʐˠ, ʑː, ʑːʷ, ʑːˠ, ʑʲː, ʑːᶣ, j, jʷ, jᶣ, ʝᶣ, ʝ, ʝʷ, x, xʷ, xˠ, xʲ, xᶣ
19 | nasal_c = m, mʷ, mˠ, mʲ, mᶣ, mː, mːʷ, mʲː, mːˠ, ɱ, ɱʲ, n, nʷ, nˠ, nʲ, nᶣ, nː, nːʷ, nːˠ, nʲː
20 | lateral_c = l, lʷ, lˠ, lʲ, lᶣ, lː, lːʷ, lʲː, lːᶣ
21 | vibrant_c = r, rʷ, rˠ, rʲ, rᶣ, rː, rːʷ, rʲː, r̥, r̥ʲ
22 | paired_c = (b, p), (ɡ, k), (z, s), (v, f), (d, t), (ʐ, ʂ), (ʑː, ʂː), (bʲ, pʲ), (ɡʲ, kʲ), (zʲ, sʲ), (zʲː, sʲː), (vʲ, fʲ), (dʲ, tʲ), (ʐʲ, ʂʲ), (ʑʲː, ʂʲː), (bʷ, pʷ), (ɡʷ, kʷ), (zʷ, sʷ), (vʷ, fʷ), (dʷ, tʷ), (ʐʷ, ʂʷ), (ʑːʷ, ʂːʷ), (bᶣ, pᶣ), (ɡᶣ, kᶣ), (zᶣ, sᶣ), (vᶣ, fᶣ), (dᶣ, tᶣ), (ʐᶣ, ʂᶣ), (ʑːᶣ, ɕːᶣ), (bˠ, pˠ), (ɡˠ, kˠ), (zˠ, sˠ), (vˠ, fˠ), (dˠ, tˠ), (ʐˠ, ʂˠ), (ʑːˠ, ʂːˠ)
23 | total_v = a, ɑ, æ, æ., ɐ., ɐ, ə, ʌ, i, ɪ, ɪ., o, ɵ, u, ʉ, ʊ, ᵿ, ɨ, ᵻ, ɨ̟, ɯ̟ɨ̟, ɛ, e
24 | front_v = i, e, ɛ, æ, æ.
25 | near_front_v = ɪ, ɪ.
26 | central_v = ɨ, ɨ̟, ɯ̟ɨ̟, ᵻ, ɵ, ə, a
27 | near_back_v = ʊ, ᵿ, ɐ., ɐ
28 | back_v = u, ʉ, o, ʌ, ɑ
29 | close_v = i, ɨ, ɨ̟, ɯ̟ɨ̟, u, ʉ
30 | near_close_v = ɪ, ɪ., ʊ, ᵿ, ᵻ
31 | close_mid_v = e, ɵ, o
32 | mid_v = ə
33 | open_mid_v = ɛ, ʌ
34 | near_open_v = æ, æ., ɐ., ɐ
35 | open_v = a, ɑ
36 | rounded_v = o, ɵ, u, ʉ, ʊ, ᵿ
37 | velarize_v = ɨ, ɨ̟, ɯ̟ɨ̟, ᵻ
38 | sonorous_class = r, rʷ, rˠ, rʲ, rᶣ, rː, rːʷ, rʲː, r̥, r̥ʲ l, lʷ, lˠ, lʲ, lᶣ, lː, lːʷ, lʲː, lːᶣ m, mʷ, mˠ, mʲ, mᶣ, mː, mːʷ, mʲː, mːˠ, ɱ, ɱʲ, n, nʷ, nˠ, nʲ, nᶣ, nː, nːʷ, nːˠ, nʲː
39 | voiced_class = ʒ, ʒʷ, ʒˠ, ʒː, ʒːʷ, ʒːˠ, ʒʲː, ʒːᶣ, d͡ʒ, d͡ʒᶣ, d̻͡z̪, ʐ, ʐʷ, ʐˠ, ʑː, ʑːʷ, ʑːˠ, ʑʲː, ʑːᶣ, b, bʷ, bˠ, bʲ, bᶣ, bː, bːʷ, v, vʷ, vˠ, vʲ, vᶣ, ɡ, ɡʷ, ɡˠ, ɡʲ, ɡᶣ, ɡː, γ, γʷ, d, dʷ, dˠ, dʲ, dᶣ, dː, dːʷ, dːˠ, dʲː, dːᶣ, dʷ, dˠ, z, zʷ, zˠ, zʲ, zᶣ, zː, zʲː, j, jʷ, jᶣ, ʝ, ʝʷ, ʝᶣ
40 | voiceless_class = t͡ɕ, t͡ɕᶣ, t͡ɕː, t͡ɕːᶣ, ɕː, ɕːᶣ, ʂ, ʂʷ, ʂˠ, ʂː, ʂːʷ, ʂːˠ, ʔ, p, pʷ, pˠ, pʲ, pᶣ, pː, pːʷ, pʲː, f, fʷ, fˠ, fʲ, fᶣ, fʲː, k, kʷ, kˠ, kʲ, kᶣ, kː, kːʷ, kʲː, t, tʷ, tˠ, tʲ, tᶣ, tː, tʲː, s, sʷ, sˠ, sʲ, sᶣ, sː, sʲː, sːʷ, x, xʷ, xˠ, xʲ, xᶣ, t͡s, t͡sʷ, t͡sˠ, t͡sː, t͡sːʷ, t͡sːˠ
41 | complex_experiment = l, lʷ, lˠ, lʲ, lᶣ, lː, lːʷ, lʲː, lːᶣ, ɨ, ᵻ, ɨ̟, ɯ̟ɨ̟, j, jʷ, jᶣ, ʝ, ʝʷ, ʝᶣ, o, ɵ, u, ʉ, ʊ, ᵿ
42 | rare_experiment = zː, d̻͡z̪, tʲː, r̥ʲ, ʑː, d͡ʒ, ɱ, tː, lː, nʲː, mː, sʲː, sː, xʲ, fʲ, f, ɡʲ, ʝ, nː, r̥, zʲ, ɕː
43 | random_vowels_experiment = ɐ., u, ɛ, ɯ̟ɨ̟, ɵ
44 | long_consonants_experiment = bː, ɡː, dː, dːʷ, dːˠ, dʲː, dːᶣ, zː, zʲː, kː, kːʷ, kʲː, lː, lːʷ, lʲː, lːᶣ, mː, mːʷ, mʲː, nː, nːʷ, nːˠ, nʲː, pː, pːʷ, pʲː, rː, rːʷ, rʲː, sː, sːʷ, sʲː, tː, tʲː, fʲː
45 | 


--------------------------------------------------------------------------------
/src/tests/test_consonants.py:
--------------------------------------------------------------------------------
  1 | import unittest
  2 | 
  3 | from .. import RuTranscript
  4 | 
  5 | 
  6 | class TestConsonants(unittest.TestCase):
  7 | 
  8 |     def test_fricative_g_1(self):
  9 |         testing_text = 'господи'
 10 |         testing_a_text = 'го+споди'
 11 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
 12 |         ru_transcript.transcribe()
 13 |         print(testing_text, ru_transcript.get_allophones())
 14 |         self.assertEqual(['γʷ', 'o', 's', 'p', 'ə', 'dʲ', 'ɪ'], ru_transcript.get_allophones())
 15 | 
 16 |     def test_fricative_g_2(self):
 17 |         testing_text = 'ах да'
 18 |         testing_a_text = 'а+х да+'
 19 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
 20 |         ru_transcript.transcribe()
 21 |         print(testing_text, ru_transcript.get_allophones())
 22 |         self.assertEqual(['a', 'γ', 'd', 'ʌ'], ru_transcript.get_allophones())
 23 | 
 24 |     def test_nasal_m_n(self):  # 'м' / 'н' перед губно-зубными согласными
 25 |         testing_text = 'амфора'
 26 |         testing_a_text = 'а+мфора'
 27 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
 28 |         ru_transcript.transcribe()
 29 |         print(testing_text, ru_transcript.get_allophones())
 30 |         self.assertEqual(['a', 'ɱ', 'f', 'ə', 'r', 'ʌ'], ru_transcript.get_allophones())
 31 | 
 32 |     def test_silent_r(self):  # 'р' перед глухими согласными и в конце слова
 33 |         testing_text = 'арфа'
 34 |         testing_a_text = 'а+рфа'
 35 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
 36 |         ru_transcript.transcribe()
 37 |         print(testing_text, ru_transcript.get_allophones())
 38 |         self.assertEqual(['a', 'r̥', 'f', 'ʌ'], ru_transcript.get_allophones())
 39 | 
 40 |     def test_long_sh(self):  # долгий 'ш', в сочетании 'сш'
 41 |         testing_text = 'сшить'
 42 |         testing_a_text = 'сши+ть'
 43 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
 44 |         ru_transcript.transcribe()
 45 |         print(testing_text, ru_transcript.get_allophones())
 46 |         self.assertEqual(['ʂːˠ', 'ɨ', 'tʲ'], ru_transcript.get_allophones())
 47 | 
 48 |     def test_ts(self):  # 'ц'
 49 |         testing_text = 'цапля'
 50 |         testing_a_text = 'ца+пля'
 51 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
 52 |         ru_transcript.transcribe()
 53 |         print(testing_text, ru_transcript.get_allophones())
 54 |         self.assertEqual(['t͡s', 'ɐ.', 'p', 'lʲ', 'æ.'], ru_transcript.get_allophones())
 55 | 
 56 |     def test_voiced_ts(self):  # 'ц' перед звонкой согласной
 57 |         testing_text = 'плацдарм'
 58 |         testing_a_text = 'плацда+рм'
 59 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
 60 |         ru_transcript.transcribe()
 61 |         print(testing_text, ru_transcript.get_allophones())
 62 |         self.assertEqual(['p', 'l', 'ɐ', 'd̻͡z̪', 'd', 'a', 'r', 'm'], ru_transcript.get_allophones())
 63 | 
 64 |     def test_dj(self):  # сочетание 'дж'
 65 |         testing_text = 'джунгли'
 66 |         testing_a_text = 'джу+нгли'
 67 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
 68 |         ru_transcript.transcribe()
 69 |         print(testing_text, ru_transcript.get_allophones())
 70 |         self.assertEqual(['d͡ʒᶣ', 'ʉ', 'n', 'ɡ', 'lʲ', 'ɪ'], ru_transcript.get_allophones())
 71 | 
 72 |     def test_shch_1(self):  # 'щ'
 73 |         testing_text = 'щегол'
 74 |         testing_a_text = 'щего+л'
 75 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
 76 |         ru_transcript.transcribe()
 77 |         print(testing_text, ru_transcript.get_allophones())
 78 |         self.assertEqual(['ɕː', 'ə', 'ɡʷ', 'o', 'l'], ru_transcript.get_allophones())
 79 | 
 80 |     def test_shch_2(self):  # 'ж' перед глух.согл.
 81 |         testing_text = 'мужчина'
 82 |         testing_a_text = 'мужчи+на'
 83 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
 84 |         ru_transcript.transcribe()
 85 |         print(testing_text, ru_transcript.get_allophones())
 86 |         self.assertEqual(['mʷ', 'ʊ', 'ɕː', 'i', 'n', 'ʌ'], ru_transcript.get_allophones())
 87 | 
 88 |     def test_shch_3(self):  # сочетания 'сч', 'зч', 'жч'
 89 |         testing_text = 'считать'
 90 |         testing_a_text = 'счита+ть'
 91 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
 92 |         ru_transcript.transcribe()
 93 |         print(testing_text, ru_transcript.get_allophones())
 94 |         self.assertEqual(['ɕː', 'ɪ', 't', 'a', 'tʲ'], ru_transcript.get_allophones())
 95 | 
 96 |     def test_ch(self):  # 'ч'
 97 |         testing_text = 'течь'
 98 |         testing_a_text = 'те+чь'
 99 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
100 |         ru_transcript.transcribe()
101 |         print(testing_text, ru_transcript.get_allophones())
102 |         self.assertEqual(['tʲ', 'e', 't͡ɕ'], ru_transcript.get_allophones())
103 | 
104 |     def test_long_ge_1(self):  # 'ж' долгий
105 |         testing_text = 'жужжать'
106 |         testing_a_text = 'жужжа+ть'
107 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
108 |         ru_transcript.transcribe()
109 |         print(testing_text, ru_transcript.get_allophones())
110 |         self.assertEqual(['ʐʷ', 'ʊ', 'ʑː', 'ɐ.', 'tʲ'], ru_transcript.get_allophones())
111 | 
112 |     def test_long_ge_2(self):  # 'щ' пред звонкой согласной
113 |         testing_text = 'вещдок'
114 |         testing_a_text = 'вещдо+к'
115 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
116 |         ru_transcript.transcribe()
117 |         print(testing_text, ru_transcript.get_allophones())
118 |         self.assertEqual(['vʲ', 'ɪ', 'ʑː', 'dʷ', 'o', 'k'], ru_transcript.get_allophones())
119 | 
120 |     def test_long_ge_3(self):  # сочетание 'зж'
121 |         testing_text = 'заезжий'
122 |         testing_a_text = 'зае+зжий'
123 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
124 |         ru_transcript.transcribe()
125 |         print(testing_text, ru_transcript.get_allophones())
126 |         self.assertEqual(['z', 'ɐ', 'j', 'e', 'ʑːˠ', 'ɨ', 'j'], ru_transcript.get_allophones())
127 | 
128 |     def test_j_1(self):  # 'й'
129 |         testing_text = 'май'
130 |         testing_a_text = 'ма+й'
131 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
132 |         ru_transcript.transcribe()
133 |         print(testing_text, ru_transcript.get_allophones())
134 |         self.assertEqual(['m', 'a', 'j'], ru_transcript.get_allophones())
135 | 
136 |     def test_j_2(self):  # йотированный гласный после разделительных ъ и ь
137 |         testing_text = 'объявление'
138 |         testing_a_text = 'объявле+ние'
139 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
140 |         ru_transcript.transcribe()
141 |         print(testing_text, ru_transcript.get_allophones())
142 |         self.assertEqual(['ə', 'bʲ', 'j', 'ɪ', 'v', 'lʲ', 'e', 'nʲ', 'j', 'æ.'], ru_transcript.get_allophones())
143 | 
144 |     def test_j_3(self):  # йотированный гласный между двумя гласными
145 |         testing_text = 'заяц'
146 |         testing_a_text = 'за+яц'
147 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
148 |         ru_transcript.transcribe()
149 |         print(testing_text, ru_transcript.get_allophones())
150 |         self.assertEqual(['z', 'a', 'j', 'ɪ.', 't͡s'], ru_transcript.get_allophones())
151 | 
152 |     def test_j_4(self):  # йотированный гласный перед ударным гласным
153 |         testing_text = 'заезжий'
154 |         testing_a_text = 'зае+зжий'
155 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
156 |         ru_transcript.transcribe()
157 |         print(testing_text, ru_transcript.get_allophones())
158 |         self.assertEqual(['z', 'ɐ', 'j', 'e', 'ʑːˠ', 'ɨ', 'j'], ru_transcript.get_allophones())
159 | 
160 |     def test_j_5(self):  # йотированный гласный в начале слова
161 |         testing_text = 'яхта'
162 |         testing_a_text = 'я+хта'
163 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
164 |         ru_transcript.transcribe()
165 |         print(testing_text, ru_transcript.get_allophones())
166 |         self.assertEqual(['ʝ', 'æ', 'x', 't', 'ʌ'], ru_transcript.get_allophones())
167 | 
168 |     def test_j_first(self):  # йот в начале слога
169 |         testing_text = 'я'
170 |         testing_a_text = 'я+'
171 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
172 |         ru_transcript.transcribe()
173 |         print(testing_text, ru_transcript.get_allophones())
174 |         self.assertEqual(['ʝ', 'æ'], ru_transcript.get_allophones())
175 | 
176 |     def test_long_consonant_junction_of_words(self):  # долгий согласный на стыке слов
177 |         testing_text = 'вот так'
178 |         testing_a_text = 'вот так'
179 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
180 |         ru_transcript.transcribe()
181 |         print(testing_text, ru_transcript.get_allophones())
182 |         self.assertEqual(['v', 'ɐ', 'tː', 'a', 'k'], ru_transcript.get_allophones())
183 | 
184 |     def test_consonants_stunning_in_the_end_of_a_word(self):
185 |         testing_text = 'всерьёз'
186 |         ru_transcript = RuTranscript(testing_text)
187 |         ru_transcript.transcribe()
188 |         print(testing_text, ru_transcript.get_allophones())
189 |         self.assertEqual(['vʲ', 'sʲ', 'ɪ', 'rʲ', 'jᶣ', 'ɵ', 's'], ru_transcript.get_allophones())
190 | 
191 | 
192 | if __name__ == '__main__':
193 |     unittest.main()
194 | 


--------------------------------------------------------------------------------
/src/tests/test_modules.py:
--------------------------------------------------------------------------------
 1 | import unittest
 2 | 
 3 | from .. import RuTranscript
 4 | from .. import text_norm_tok
 5 | 
 6 | 
 7 | class TestModules(unittest.TestCase):
 8 | 
 9 |     def test_stress_one_syllable(self):
10 |         testing_text = 'нос'
11 |         ru_transcript = RuTranscript(testing_text)
12 |         ru_transcript.transcribe()
13 |         print(testing_text, ru_transcript.get_stressed_text())
14 |         self.assertEqual('но+с', ru_transcript.get_stressed_text())
15 | 
16 |     def test_stress_yo(self):
17 |         testing_text = 'ёлка'
18 |         ru_transcript = RuTranscript(testing_text)
19 |         ru_transcript.transcribe()
20 |         print(testing_text, ru_transcript.get_stressed_text())
21 |         self.assertEqual('ё+лка', ru_transcript.get_stressed_text())
22 | 
23 |     def test_stress_readme_transcription(self):
24 |         testing_text = 'Как получить транскрипцию?'
25 |         ru_transcript = RuTranscript(testing_text)
26 |         ru_transcript.transcribe()
27 |         print(testing_text, ru_transcript.get_stressed_text())
28 |         self.assertEqual('ка+к получи+ть транскри+пцию', ru_transcript.get_stressed_text())
29 | 
30 |     def test_replace_e(self):
31 |         testing_text = 'синтез речи в библиотеке'
32 |         ru_transcript = RuTranscript(testing_text)
33 |         ru_transcript.transcribe()
34 |         print(testing_text, ru_transcript._tokens)
35 |         self.assertEqual([['синтэз', 'речи', 'в', 'библиотеке']], ru_transcript._tokens)
36 | 
37 |     def test_replace_yo(self):
38 |         testing_text = 'елка для ее ежика перышка подвел конек мед'
39 |         ru_transcript = RuTranscript(testing_text)
40 |         ru_transcript.transcribe()
41 |         print(testing_text, ru_transcript._tokens)
42 |         self.assertEqual([['ёлка', 'для', 'её', 'ёжика', 'пёрышка', 'подвёл', 'конёк', 'мёд']], ru_transcript._tokens)
43 | 
44 |     def test_replace_user_dict(self):
45 |         testing_text = 'TTS - это увлекательно'
46 |         ru_transcript = RuTranscript(testing_text, replacement_dict={"tts": "синтез речи"})
47 |         ru_transcript.transcribe()
48 |         print(testing_text, ru_transcript._tokens)
49 |         self.assertEqual([['синтэз', 'речи'], ['это', 'увлекательно']], ru_transcript._tokens)
50 | 
51 |     def test_dirty_text(self):
52 |         testing_text = 'синтез речи - это#$ «увлекательно»'
53 |         res = text_norm_tok(testing_text)
54 |         print(testing_text, res)
55 |         self.assertEqual([['синтез', 'речи', '-', 'это', 'увлекательно']], res)
56 | 
57 |     def test_error_stress(self):
58 |         testing_text = 'литературнохудожественный'
59 |         ru_transcript = RuTranscript(testing_text)
60 |         ru_transcript.transcribe()
61 |         print(testing_text, ru_transcript.get_stressed_text())
62 |         self.assertEqual('литературнохудо+жественный', ru_transcript.get_stressed_text())
63 | 
64 | 
65 | if __name__ == '__main__':
66 |     unittest.main()
67 | 


--------------------------------------------------------------------------------
/src/tests/test_phrases.py:
--------------------------------------------------------------------------------
  1 | import unittest
  2 | 
  3 | from .. import RuTranscript
  4 | 
  5 | 
  6 | class TestPhrases(unittest.TestCase):
  7 | 
  8 |     def test_readme_transcription(self):
  9 |         testing_text = 'Как получить транскрипцию?'
 10 |         testing_a_text = 'Ка+к получи+ть транскри+пцию?'
 11 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
 12 |         ru_transcript.transcribe()
 13 |         print(testing_text, ru_transcript.get_allophones())
 14 |         self.assertEqual(['k', 'a', 'k', 'p', 'ə', 'lʷ', 'ʊ', 't͡ɕ', 'i', 'tʲ', 't', 'r', 'ɐ', 'n', 's', 'k', 'rʲ',
 15 |                           'i', 'p', 't͡sˠ', 'ɨ', 'jᶣ', 'ᵿ'], ru_transcript.get_allophones())
 16 | 
 17 |     def test_readme_comma(self):
 18 |         testing_text = 'Мышка, кошка и собака'
 19 |         testing_a_text = 'Мы+шка, ко+шка и+ соба+ка'
 20 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
 21 |         ru_transcript.transcribe()
 22 |         print(testing_text, ru_transcript.get_allophones())
 23 |         self.assertEqual(['mˠ', 'ɨ', 'ʂ', 'k', 'ʌ', 'kʷ', 'o', 'ʂ', 'k', 'ʌ', 'i', 's', 'ɐ', 'b', 'a', 'k', 'ʌ'],
 24 |                          ru_transcript.get_allophones())
 25 | 
 26 |     def test_1(self):
 27 |         testing_text = 'И никогда, ни в единой самой убогой самой фантастической петербургской компании ' \
 28 |                        '— меня не объявляли гением.'
 29 |         testing_a_text = 'И никогд+а, н+и в ед+иной с+амой уб+огой с+амой фантаст+ической петерб+ургской комп+ании ' \
 30 |                          '— мен+я н+е объявл+яли г+ением.'
 31 |         ru_transcript = RuTranscript(testing_text, testing_a_text, stress_place='before')
 32 |         ru_transcript.transcribe()
 33 |         print(testing_text, ru_transcript.get_allophones())
 34 |         self.assertEqual(['i', 'nʲ', 'ɪ', 'k', 'ɐ', 'ɡ', 'd', 'a', 'nʲ', 'ɪ', 'vʲ', 'j', 'ɪ', 'dʲ', 'i', 'n', 'ə', 'j',
 35 |                           's', 'a', 'm', 'ə', 'j', 'ᵿ', 'bʷ', 'o', 'ɡ', 'ə', 'j', 's', 'a', 'm', 'ə', 'j', 'f', 'ə',
 36 |                           'n', 't', 'ɐ', 'sʲ', 'tʲ', 'i', 't͡ɕ', 'ə', 's', 'k', 'ə', 'j', 'pʲ', 'ɪ.', 'tʲ', 'ɪ', 'r',
 37 |                           'bʷ', 'u', 'r', 'ʐ', 's', 'k', 'ə', 'j', 'k', 'ɐ', 'm', 'p', 'a', 'nʲ', 'ɪ', 'i', 'mʲ', 'ɪ',
 38 |                           'nʲ', 'æ', 'nʲ', 'ɪ.', 'ə', 'bʲ', 'j', 'ɪ', 'v', 'lʲ', 'æ', 'lʲ', 'ɪ', 'ɡʲ', 'e', 'nʲ',
 39 |                           'ɪ', 'j', 'ɪ.', 'm'], ru_transcript.get_allophones())
 40 | 
 41 |     def test_2(self):
 42 |         testing_text = 'Но против Агнии Францевны, у меня было сильное оружие — вежливость.'
 43 |         testing_a_text = 'Н+о пр+отив +Агнии Фр+анцевны, у мен+я б+ыло с+ильное ор+ужие — в+ежливость.'
 44 |         ru_transcript = RuTranscript(testing_text, testing_a_text, stress_place='before')
 45 |         ru_transcript.transcribe()
 46 |         print(testing_text, ru_transcript.get_allophones())
 47 |         self.assertEqual(['n', 'ɐ', 'p', 'rʷ', 'o', 'tʲ', 'ɪ', 'v', 'a', 'ɡ', 'nʲ', 'ɪ', 'i', 'f', 'r', 'a', 'n',
 48 |                           't͡s', 'ə', 'v', 'nˠ', 'ᵻ', 'ᵿ', 'mʲ', 'ɪ', 'nʲ', 'æ', 'bˠ', 'ɨ', 'l', 'ʌ', 'sʲ', 'i', 'lʲ',
 49 |                           'n', 'ə', 'j', 'æ.', 'ɐ', 'rʷ', 'u', 'ʐ', 'j', 'æ.', 'vʲ', 'e', 'ʐ', 'lʲ', 'ɪ', 'v', 'ə',
 50 |                           'sʲ', 'tʲ'], ru_transcript.get_allophones())
 51 | 
 52 |     def test_3(self):
 53 |         testing_text = 'Что апперцепция у Бальзака неорганична.'
 54 |         testing_a_text = 'Чт+о апперц+епция у Бальз+ака неорган+ична.'
 55 |         ru_transcript = RuTranscript(testing_text, testing_a_text, stress_place='before')
 56 |         ru_transcript.transcribe()
 57 |         print(testing_text, ru_transcript.get_allophones())
 58 |         self.assertEqual(['ʂ', 'tʷ', 'o', 'ə', 'p', 'pʲ', 'ɪ', 'r̥', 't͡sˠ', 'ᵻ', 'p', 't͡sˠ', 'ɨ', 'j', 'æ.', 'ᵿ', 'b',
 59 |                           'ɐ', 'lʲ', 'z', 'a', 'k', 'ʌ', 'nʲ', 'ɪ.', 'ə', 'r', 'ɡ', 'ɐ', 'nʲ', 'i', 't͡ɕ', 'n', 'ʌ'],
 60 |                          ru_transcript.get_allophones())
 61 | 
 62 |     def test_4(self):
 63 |         testing_text = 'Башкирия, Уфа, эвакуация, мне три недели.'
 64 |         testing_a_text = 'Башк+ирия, Уф+а, эваку+ация, мн+е тр+и нед+ели.'
 65 |         ru_transcript = RuTranscript(testing_text, testing_a_text, stress_place='before')
 66 |         ru_transcript.transcribe()
 67 |         print(testing_text, ru_transcript.get_allophones())
 68 |         self.assertEqual(['b', 'ɐ', 'ʂ', 'kʲ', 'i', 'rʲ', 'j', 'æ.', 'ᵿ', 'f', 'a', 'ɪ.', 'v', 'ə', 'kʷ', 'ʊ', 'æ',
 69 |                           't͡sˠ', 'ɨ', 'j', 'æ.', 'mʲ', 'nʲ', 'e', 't', 'rʲ', 'i', 'nʲ', 'ɪ', 'dʲ', 'e', 'lʲ', 'ɪ'],
 70 |                          ru_transcript.get_allophones())
 71 | 
 72 |     def test_5(self):
 73 |         testing_text = 'Настоящие мужчины гибнут на передовой.'
 74 |         testing_a_text = 'Насто+ящие мужч+ины г+ибнут н+а передов+ой.'
 75 |         ru_transcript = RuTranscript(testing_text, testing_a_text, stress_place='before')
 76 |         ru_transcript.transcribe()
 77 |         print(testing_text, ru_transcript.get_allophones())
 78 |         self.assertEqual(['n', 'ə', 's', 't', 'ɐ', 'j', 'æ', 'ɕː', 'ɪ', 'j', 'æ.', 'mʷ', 'ʊ', 'ɕː', 'i', 'nˠ', 'ᵻ',
 79 |                           'ɡʲ', 'i', 'b', 'nʷ', 'ʊ', 't', 'n', 'ə', 'pʲ', 'ɪ.', 'rʲ', 'ɪ.', 'd', 'ɐ', 'vʷ', 'o', 'j'],
 80 |                          ru_transcript.get_allophones())
 81 | 
 82 |     def test_6(self):
 83 |         testing_text = 'Неуклюжие эпиграммы.'
 84 |         testing_a_text = 'Неукл+южие эпигр+аммы.'
 85 |         ru_transcript = RuTranscript(testing_text, testing_a_text, stress_place='before')
 86 |         ru_transcript.transcribe()
 87 |         print(testing_text, ru_transcript.get_allophones())
 88 |         self.assertEqual(['nʲ', 'ɪ.', 'ᵿ', 'k', 'lᶣ', 'ʉ', 'ʐ', 'j', 'æ.', 'ɪ.', 'pʲ', 'ɪ', 'ɡ', 'r', 'a', 'mːˠ', 'ᵻ'],
 89 |                          ru_transcript.get_allophones())
 90 | 
 91 |     def test_7(self):
 92 |         testing_text = 'Да и с Вольфом у меня хорошие отношения.'
 93 |         testing_a_text = 'Д+а и с В+ольфом у мен+я хор+ошие отнош+ения.'
 94 |         ru_transcript = RuTranscript(testing_text, testing_a_text, stress_place='before')
 95 |         ru_transcript.transcribe()
 96 |         self.assertEqual(['d', 'a', 'i', 's', 'vʷ', 'o', 'lʲ', 'f', 'ə', 'm', 'ᵿ', 'mʲ', 'ɪ', 'nʲ', 'æ', 'x', 'ɐ',
 97 |                           'rʷ', 'o', 'ʂ', 'j', 'æ.', 'ə', 't', 'n', 'ɐ', 'ʂˠ', 'ᵻ', 'nʲ', 'j', 'æ.'],
 98 |                          ru_transcript.get_allophones())
 99 | 
100 |     def test_8(self):
101 |         testing_text = 'Хотя наиболее чудовищные эпатирующие подробности лагерной жизни, я как говорится опустил.'
102 |         testing_a_text = 'Хот+я наиб+олее чуд+овищные эпат+ирующие подр+обности л+агерной ж+изни, я к+ак говор+ится ' \
103 |                          'опуст+ил.'
104 |         ru_transcript = RuTranscript(testing_text, testing_a_text, stress_place='before')
105 |         ru_transcript.transcribe()
106 |         print(testing_text, ru_transcript.get_allophones())
107 |         self.assertEqual(['x', 'ɐ', 'tʲ', 'æ', 'n', 'ə', 'i', 'bʷ', 'o', 'lʲ', 'ɪ.', 'j', 'æ.', 't͡ɕᶣ', 'ᵿ', 'dʷ', 'o',
108 |                           'vʲ', 'ɪ', 'ɕː', 'nˠ', 'ᵻ', 'j', 'æ.', 'ɪ.', 'p', 'ɐ', 'tʲ', 'i', 'rʷ', 'ʊ', 'jᶣ', 'ᵿ', 'ɕː',
109 |                           'ɪ', 'j', 'æ.', 'p', 'ɐ', 'd', 'rʷ', 'o', 'b', 'n', 'ə', 'sʲ', 'tʲ', 'ɪ', 'l', 'a', 'ɡʲ',
110 |                           'ɪ.', 'r', 'n', 'ɐ', 'j', 'ʐˠ', 'ɨ', 'zʲ', 'nʲ', 'ɪ', 'ʝ', 'æ', 'k', 'a', 'k', 'ɡ', 'ə', 'v',
111 |                           'ɐ', 'rʲ', 'i', 't͡s', 'ə', 'ə', 'pʷ', 'ʊ',
112 |                           'sʲ', 'tʲ', 'i', 'l'], ru_transcript.get_allophones())
113 | 
114 |     def test_yo(self):
115 |         testing_text = 'елка для ее ежика перышка подвел конек мед.'
116 |         ru_transcript = RuTranscript(testing_text)
117 |         ru_transcript.transcribe()
118 |         print(testing_text, ru_transcript.get_allophones())
119 |         self.assertEqual(['ʝᶣ', 'ɵ', 'l', 'k', 'ʌ', 'd', 'lʲ', 'æ', 'j', 'ɪ', 'jᶣ', 'ɵ', 'jᶣ', 'ɵ', 'ʐˠ', 'ɨ', 'k', 'ʌ',
120 |                           'pᶣ', 'ɵ', 'rˠ', 'ᵻ', 'ʂ', 'k', 'ʌ', 'p', 'ɐ', 'd', 'vᶣ', 'ɵ', 'l', 'k', 'ɐ', 'nᶣ', 'ɵ', 'kʲ',
121 |                           'mᶣ', 'ɵ', 't'], ru_transcript.get_allophones())
122 | 
123 |     def test_dashes(self):
124 |         testing_text = 'Синтез речи - это что-то увлекательное!'
125 |         ru_transcript = RuTranscript(testing_text)
126 |         ru_transcript.transcribe()
127 |         print(testing_text, ru_transcript.get_allophones())
128 |         self.assertEqual(['sʲ', 'i', 'n', 'tˠ', 'ᵻ', 'z', 'rʲ', 'e', 't͡ɕ', 'ɪ', 'ɛ', 't', 'ʌ', 'ʂ', 'tʷ', 'o', 't',
129 |                           'ʌ', 'ᵿ', 'v', 'lʲ', 'ɪ', 'k', 'a', 'tʲ', 'ɪ.', 'lʲ', 'n', 'ə', 'j', 'æ.'],
130 |                          ru_transcript.get_allophones())
131 | 
132 |     def test_spaces(self):
133 |         testing_text = 'Синтез речи     - это что-то        увлекательное!\n'
134 |         ru_transcript = RuTranscript(testing_text)
135 |         ru_transcript.transcribe()
136 |         print(testing_text, ru_transcript.get_allophones())
137 |         self.assertEqual(['sʲ', 'i', 'n', 'tˠ', 'ᵻ', 'z', 'rʲ', 'e', 't͡ɕ', 'ɪ', 'ɛ', 't', 'ʌ', 'ʂ', 'tʷ', 'o', 't',
138 |                           'ʌ', 'ᵿ', 'v', 'lʲ', 'ɪ', 'k', 'a', 'tʲ', 'ɪ.', 'lʲ', 'n', 'ə', 'j', 'æ.'],
139 |                          ru_transcript.get_allophones())
140 | 
141 |     def test_skipping_proclitic(self):
142 |         testing_text = 'Они расцветают и становятся заметными лишь на фоне какого-нибудь безобразия.'
143 |         testing_a_text = 'Он+и расцвет+ают и стан+овятся зам+етными л+ишь н+а ф+оне какого-нибудь безобр+азия.'
144 |         ru_transcript = RuTranscript(testing_text, testing_a_text, stress_place='before')
145 |         ru_transcript.transcribe()
146 |         print(testing_text, ru_transcript.get_allophones())
147 |         self.assertEqual(['ɐ', 'nʲ', 'i', 'r', 'ə', 's', 'd̻͡z̪', 'vʲ', 'ɪ', 't', 'a', 'jᶣ', 'ᵿ', 't', 'ᵻ', 's', 't',
148 |                           'ɐ', 'nʷ', 'o', 'vʲ', 'ɪ.', 't͡s', 'ə', 'z', 'ɐ', 'mʲ', 'e', 't', 'nˠ', 'ᵻ', 'mʲ', 'ɪ',
149 |                           'lʲ', 'ɪ', 'ʂ', 'n', 'ɐ', 'fʷ', 'o', 'nʲ', 'æ.', 'k', 'ɐ', 'kʷ', 'o', 'v', 'ə', 'nʲ', 'ɪ',
150 |                           'bʷ', 'ʊ', 'dʲ', 'bʲ', 'ɪ.', 'z', 'ɐ', 'b', 'r', 'a', 'zʲ', 'j', 'æ.'],
151 |                          ru_transcript.get_allophones())
152 | 
153 |     def test_skipping_enclitic(self):
154 |         testing_text = 'Да это же писатель!'
155 |         ru_transcript = RuTranscript(testing_text)
156 |         ru_transcript.transcribe()
157 |         print(testing_text, ru_transcript.get_allophones())
158 |         self.assertEqual(['d', 'ɐ', 'e', 't', 'ə', 'ʐ', 'ə', 'pʲ', 'ɪ', 's', 'a', 'tʲ', 'ɪ.', 'lʲ'],
159 |                          ru_transcript.get_allophones())
160 | 
161 | 
162 | if __name__ == '__main__':
163 |     unittest.main()
164 | 


--------------------------------------------------------------------------------
/src/tests/test_vowels.py:
--------------------------------------------------------------------------------
  1 | import unittest
  2 | 
  3 | from .. import RuTranscript
  4 | 
  5 | 
  6 | class TestVowels(unittest.TestCase):
  7 | 
  8 |     def test_vowel_a_1(self):  # ударный после тв. согл.
  9 |         testing_text = 'трава'
 10 |         testing_a_text = 'трава+'
 11 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
 12 |         ru_transcript.transcribe()
 13 |         print(testing_text, ru_transcript.get_allophones())
 14 |         self.assertEqual(['t', 'r', 'ɐ', 'v', 'a'], ru_transcript.get_allophones())
 15 | 
 16 |     def test_vowel_a_2(self):  # ударный после тв. согл. перед 'л'
 17 |         testing_text = 'палка'
 18 |         testing_a_text = 'па+лка'
 19 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
 20 |         ru_transcript.transcribe()
 21 |         print(testing_text, ru_transcript.get_allophones())
 22 |         self.assertEqual(['p', 'ɑ', 'l', 'k', 'ʌ'], ru_transcript.get_allophones())
 23 | 
 24 |     def test_vowel_a_3(self):  # ударный не после тв. согл.
 25 |         testing_text = 'пять'
 26 |         testing_a_text = 'пя+ть'
 27 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
 28 |         ru_transcript.transcribe()
 29 |         print(testing_text, ru_transcript.get_allophones())
 30 |         self.assertEqual(['pʲ', 'æ', 'tʲ'], ru_transcript.get_allophones())
 31 | 
 32 |     def test_vowel_a_4(self):  # ударный после шипящих и ц
 33 |         testing_text = 'цапнуть'
 34 |         testing_a_text = 'ца+пнуть'
 35 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
 36 |         ru_transcript.transcribe()
 37 |         print(testing_text, ru_transcript.get_allophones())
 38 |         self.assertEqual(['t͡s', 'ɐ.', 'p', 'nʷ', 'ʊ', 'tʲ'], ru_transcript.get_allophones())
 39 | 
 40 |     def test_vowel_a_5(self):  # предударный после тв.согл. или в начале слова
 41 |         testing_text = 'паром'
 42 |         testing_a_text = 'паро+м'
 43 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
 44 |         ru_transcript.transcribe()
 45 |         print(testing_text, ru_transcript.get_allophones())
 46 |         self.assertEqual(['p', 'ɐ', 'rʷ', 'o', 'm'], ru_transcript.get_allophones())
 47 | 
 48 |     def test_vowel_a_6(self):  # предударный не после тв.согл.
 49 |         testing_text = 'тяжёлый'
 50 |         testing_a_text = 'тяжё+лый'
 51 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
 52 |         ru_transcript.transcribe()
 53 |         print(testing_text, ru_transcript.get_allophones())
 54 |         self.assertEqual(['tʲ', 'ɪ', 'ʐ', 'ɐ.', 'lˠ', 'ᵻ', 'j'], ru_transcript.get_allophones())
 55 | 
 56 |     def test_vowel_a_7(self):  # предударный после шипящих и 'ц'
 57 |         testing_text = 'жалеть'
 58 |         testing_a_text = 'жале+ть'
 59 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
 60 |         ru_transcript.transcribe()
 61 |         print(testing_text, ru_transcript.get_allophones())
 62 |         self.assertEqual(['ʐˠ', 'ᵻ', 'lʲ', 'e', 'tʲ'], ru_transcript.get_allophones())
 63 | 
 64 |     def test_vowel_a_8(self):  # II предударный или заударный после тв.согл. или в начале слова
 65 |         testing_text = 'акварель'
 66 |         testing_a_text = 'акваре+ль'
 67 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
 68 |         ru_transcript.transcribe()
 69 |         print(testing_text, ru_transcript.get_allophones())
 70 |         self.assertEqual(['ə', 'k', 'v', 'ɐ', 'rʲ', 'e', 'lʲ'], ru_transcript.get_allophones())
 71 | 
 72 |     def test_vowel_a_9(self):  # II предударный или заударный после тв.согл. в финальном слоге
 73 |         testing_text = 'собака'
 74 |         testing_a_text = 'соба+ка'
 75 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
 76 |         ru_transcript.transcribe()
 77 |         print(testing_text, ru_transcript.get_allophones())
 78 |         self.assertEqual(['s', 'ɐ', 'b', 'a', 'k', 'ʌ'], ru_transcript.get_allophones())
 79 | 
 80 |     def test_vowel_a_10(self):  # II предударный или заударный не после тв.согл. не в окончании
 81 |         testing_text = 'тяжеленный'
 82 |         testing_a_text = 'тяжеле+нный'
 83 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
 84 |         ru_transcript.transcribe()
 85 |         print(testing_text, ru_transcript.get_allophones())
 86 |         self.assertEqual(['tʲ', 'ɪ.', 'ʐ', 'ə', 'lʲ', 'e', 'nːˠ', 'ᵻ', 'j'], ru_transcript.get_allophones())
 87 | 
 88 |     def test_vowel_a_11(self):  # II предударный или заударный не после тв.согл. (только в окончании)
 89 |         testing_text = 'гуляя'
 90 |         testing_a_text = 'гуля+я'
 91 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
 92 |         ru_transcript.transcribe()
 93 |         print(testing_text, ru_transcript.get_allophones())
 94 |         self.assertEqual(['ɡʷ', 'ʊ', 'lʲ', 'æ', 'j', 'æ.'], ru_transcript.get_allophones())
 95 | 
 96 |     def test_vowel_a_12(self):  # II предударный или заударный после шипящих и 'ц'
 97 |         testing_text = 'дача'
 98 |         testing_a_text = 'да+ча'
 99 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
100 |         ru_transcript.transcribe()
101 |         print(testing_text, ru_transcript.get_allophones())
102 |         self.assertEqual(['d', 'a', 't͡ɕ', 'ə'], ru_transcript.get_allophones())
103 | 
104 |     def test_vowel_o_1(self):  # ударный после тв.согл. или в начале слова
105 |         testing_text = 'облако'
106 |         testing_a_text = 'о+блако'
107 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
108 |         ru_transcript.transcribe()
109 |         print(testing_text, ru_transcript.get_allophones())
110 |         self.assertEqual(['o', 'b', 'l', 'ə', 'k', 'ʌ'], ru_transcript.get_allophones())
111 | 
112 |     def test_vowel_o_2(self):  # ударный не после тв.согл.
113 |         testing_text = 'тётя'
114 |         testing_a_text = 'тё+тя'
115 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
116 |         ru_transcript.transcribe()
117 |         print(testing_text, ru_transcript.get_allophones())
118 |         self.assertEqual(['tᶣ', 'ɵ', 'tʲ', 'æ.'], ru_transcript.get_allophones())
119 | 
120 |     def test_vowel_o_3(self):  # ударный после шипящих и 'ц'
121 |         testing_text = 'цокать'
122 |         testing_a_text = 'цо+кать'
123 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
124 |         ru_transcript.transcribe()
125 |         print(testing_text, ru_transcript.get_allophones())
126 |         self.assertEqual(['t͡s', 'ɐ.', 'k', 'ə', 'tʲ'], ru_transcript.get_allophones())
127 | 
128 |     def test_vowel_o_4(self):  # предударный после тв.согл. или в начале слова
129 |         testing_text = 'стопа'
130 |         testing_a_text = 'стопа+'
131 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
132 |         ru_transcript.transcribe()
133 |         print(testing_text, ru_transcript.get_allophones())
134 |         self.assertEqual(['s', 't', 'ɐ', 'p', 'a'], ru_transcript.get_allophones())
135 | 
136 |     def test_vowel_o_5(self):  # предударный не после тв.согл.
137 |         testing_text = 'йодированный'
138 |         testing_a_text = 'йоди+рованный'
139 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
140 |         ru_transcript.transcribe()
141 |         print(testing_text, ru_transcript.get_allophones())
142 |         self.assertEqual(['ʝ', 'ɪ', 'dʲ', 'i', 'r', 'ə', 'v', 'ə', 'nːˠ', 'ᵻ', 'j'], ru_transcript.get_allophones())
143 | 
144 |     def test_vowel_o_6(self):  # предударный после шипящих и ц
145 |         testing_text = 'шокировать'
146 |         testing_a_text = 'шоки+ровать'
147 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
148 |         ru_transcript.transcribe()
149 |         print(testing_text, ru_transcript.get_allophones())
150 |         self.assertEqual(['ʂˠ', 'ᵻ', 'kʲ', 'i', 'r', 'ə', 'v', 'ə', 'tʲ'], ru_transcript.get_allophones())
151 | 
152 |     def test_vowel_o_7(self):  # II предударный или заударный после тв.согл. или в начале слога
153 |         testing_text = 'молоко'
154 |         testing_a_text = 'молоко+'
155 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
156 |         ru_transcript.transcribe()
157 |         print(testing_text, ru_transcript.get_allophones())
158 |         self.assertEqual(['m', 'ə', 'l', 'ɐ', 'kʷ', 'o'], ru_transcript.get_allophones())
159 | 
160 |     def test_vowel_o_8(self):  # II предударный или заударный после тв.согл. в финальном слоге
161 |         testing_text = 'озеро'
162 |         testing_a_text = 'о+зеро'
163 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
164 |         ru_transcript.transcribe()
165 |         print(testing_text, ru_transcript.get_allophones())
166 |         self.assertEqual(['o', 'zʲ', 'ɪ.', 'r', 'ʌ'], ru_transcript.get_allophones())
167 | 
168 |     def test_vowel_o_9(self):  # II предударный или заударный не после тв.согл.
169 |         testing_text = 'огайо'
170 |         testing_a_text = 'ога+йо'
171 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
172 |         ru_transcript.transcribe()
173 |         print(testing_text, ru_transcript.get_allophones())
174 |         self.assertEqual(['ɐ', 'ɡ', 'a', 'j', 'æ.'], ru_transcript.get_allophones())
175 | 
176 |     def test_vowel_o_10(self):  # II предударный или заударный после шипящих и 'ц'
177 |         testing_text = 'шоколад'
178 |         testing_a_text = 'шокола+д'
179 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
180 |         ru_transcript.transcribe()
181 |         print(testing_text, ru_transcript.get_allophones())
182 |         self.assertEqual(['ʂ', 'ə', 'k', 'ɐ', 'l', 'a', 't'], ru_transcript.get_allophones())
183 | 
184 |     def test_vowel_e_1(self):  # ударный после тв.согл. или в начале слова
185 |         testing_text = 'это'
186 |         testing_a_text = 'э+то'
187 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
188 |         ru_transcript.transcribe()
189 |         print(testing_text, ru_transcript.get_allophones())
190 |         self.assertEqual(['ɛ', 't', 'ʌ'], ru_transcript.get_allophones())
191 | 
192 |     def test_vowel_e_2(self):  # ударный не после тв.согл.
193 |         testing_text = 'пень'
194 |         testing_a_text = 'пе+нь'
195 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
196 |         ru_transcript.transcribe()
197 |         print(testing_text, ru_transcript.get_allophones())
198 |         self.assertEqual(['pʲ', 'e', 'nʲ'], ru_transcript.get_allophones())
199 | 
200 |     def test_vowel_e_3(self):  # ударный после шипящих и 'ц'
201 |         testing_text = 'шест'
202 |         testing_a_text = 'ше+ст'
203 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
204 |         ru_transcript.transcribe()
205 |         print(testing_text, ru_transcript.get_allophones())
206 |         self.assertEqual(['ʂˠ', 'ᵻ', 's', 't'], ru_transcript.get_allophones())
207 | 
208 |     def test_vowel_e_4(self):  # предударный после тв.согл. или в начале слова (ыэ)
209 |         testing_text = 'этап'
210 |         testing_a_text = 'эта+п'
211 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
212 |         ru_transcript.transcribe()
213 |         print(testing_text, ru_transcript.get_allophones())
214 |         self.assertEqual(['ᵻ', 't', 'a', 'p'], ru_transcript.get_allophones())
215 | 
216 |     def test_vowel_e_5(self):  # предударный не после тв.согл. и не в начале слова
217 |         testing_text = 'велюр'
218 |         testing_a_text = 'велю+р'
219 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
220 |         ru_transcript.transcribe()
221 |         print(testing_text, ru_transcript.get_allophones())
222 |         self.assertEqual(['vʲ', 'ɪ', 'lᶣ', 'ʉ', 'r̥'], ru_transcript.get_allophones())
223 | 
224 |     def test_vowel_e_6(self):  # II предударный или заударный не после тв.согл. или в начале слова
225 |         testing_text = 'пепел'
226 |         testing_a_text = 'пе+пел'
227 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
228 |         ru_transcript.transcribe()
229 |         print(testing_text, ru_transcript.get_allophones())
230 |         self.assertEqual(['pʲ', 'e', 'pʲ', 'ɪ.', 'l'], ru_transcript.get_allophones())
231 | 
232 |     def test_vowel_e_7(self):  # предударный, II предударный или заударный после шипящих и 'ц'
233 |         testing_text = 'шелестеть'
234 |         testing_a_text = 'шелесте+ть'
235 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
236 |         ru_transcript.transcribe()
237 |         print(testing_text, ru_transcript.get_allophones())
238 |         self.assertEqual(['ʂ', 'ə', 'lʲ', 'ɪ', 'sʲ', 'tʲ', 'e', 'tʲ'], ru_transcript.get_allophones())
239 | 
240 |     def test_vowel_u_1(self):  # ударный после тв.согл.
241 |         testing_text = 'пуля'
242 |         testing_a_text = 'пу+ля'
243 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
244 |         ru_transcript.transcribe()
245 |         print(testing_text, ru_transcript.get_allophones())
246 |         self.assertEqual(['pʷ', 'u', 'lʲ', 'æ.'], ru_transcript.get_allophones())
247 | 
248 |     def test_vowel_u_2(self):  # ударный после мягк.согл.
249 |         testing_text = 'чуть'
250 |         testing_a_text = 'чу+ть'
251 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
252 |         ru_transcript.transcribe()
253 |         print(testing_text, ru_transcript.get_allophones())
254 |         self.assertEqual(['t͡ɕᶣ', 'ʉ', 'tʲ'], ru_transcript.get_allophones())
255 | 
256 |     def test_vowel_u_3(self):  # предударный после тв.согл.
257 |         testing_text = 'мужчина'
258 |         testing_a_text = 'мужчи+на'
259 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
260 |         ru_transcript.transcribe()
261 |         print(testing_text, ru_transcript.get_allophones())
262 |         self.assertEqual(['mʷ', 'ʊ', 'ɕː', 'i', 'n', 'ʌ'], ru_transcript.get_allophones())
263 | 
264 |     def test_vowel_u_4(self):  # предударный не после тв.согл.
265 |         testing_text = 'ютиться'
266 |         testing_a_text = 'юти+ться'
267 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
268 |         ru_transcript.transcribe()
269 |         print(testing_text, ru_transcript.get_allophones())
270 |         self.assertEqual(['ʝᶣ', 'ᵿ', 'tʲ', 'i', 't͡s', 'ə'], ru_transcript.get_allophones())
271 | 
272 |     def test_vowel_u_5(self):  # II предударный или заударный после тв.согл.
273 |         testing_text = 'музыкальный'
274 |         testing_a_text = 'музыка+льный'
275 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
276 |         ru_transcript.transcribe()
277 |         print(testing_text, ru_transcript.get_allophones())
278 |         self.assertEqual(['mʷ', 'ʊ', 'zˠ', 'ᵻ', 'k', 'a', 'lʲ', 'nˠ', 'ᵻ', 'j'], ru_transcript.get_allophones())
279 | 
280 |     def test_vowel_u_6(self):  # II предударный или заударный не после тв.согл.
281 |         testing_text = 'чумовой'
282 |         testing_a_text = 'чумово+й'
283 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
284 |         ru_transcript.transcribe()
285 |         print(testing_text, ru_transcript.get_allophones())
286 |         self.assertEqual(['t͡ɕᶣ', 'ᵿ', 'm', 'ɐ', 'vʷ', 'o', 'j'], ru_transcript.get_allophones())
287 | 
288 |     def test_vowel_i_1(self):  # ударный перед мягк.согл.
289 |         testing_text = 'синего'
290 |         testing_a_text = 'си+него'
291 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
292 |         ru_transcript.transcribe()
293 |         print(testing_text, ru_transcript.get_allophones())
294 |         self.assertEqual(['sʲ', 'i', 'nʲ', 'ɪ.', 'v', 'ʌ'], ru_transcript.get_allophones())
295 | 
296 |     def test_vowel_i_2(self):  # ударный, предударный, II предударный или заударный после шипящих и 'ц'
297 |         testing_text = 'жизнь'
298 |         testing_a_text = 'жи+знь'
299 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
300 |         ru_transcript.transcribe()
301 |         print(testing_text, ru_transcript.get_allophones())
302 |         self.assertEqual(['ʐˠ', 'ɨ', 'zʲ', 'nʲ'], ru_transcript.get_allophones())
303 | 
304 |     def test_vowel_i_3(self):  # предударный, II предударный или заударный не после гласного или в начале слова
305 |         testing_text = 'синица'
306 |         testing_a_text = 'сини+ца'
307 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
308 |         ru_transcript.transcribe()
309 |         print(testing_text, ru_transcript.get_allophones())
310 |         self.assertEqual(['sʲ', 'ɪ', 'nʲ', 'i', 't͡s', 'ə'], ru_transcript.get_allophones())
311 | 
312 |     def test_vowel_ii_1(self):  # ударный после тв.согл.
313 |         testing_text = 'ты'
314 |         testing_a_text = 'ты+'
315 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
316 |         ru_transcript.transcribe()
317 |         print(testing_text, ru_transcript.get_allophones())
318 |         self.assertEqual(['tˠ', 'ɨ'], ru_transcript.get_allophones())
319 | 
320 |     def test_vowel_ii_2(self):  # ударный между переднеязычными и велярными согласными
321 |         testing_text = 'тыкать'
322 |         testing_a_text = 'ты+кать'
323 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
324 |         ru_transcript.transcribe()
325 |         print(testing_text, ru_transcript.get_allophones())
326 |         self.assertEqual(['tˠ', 'ɨ̟', 'k', 'ə', 'tʲ'], ru_transcript.get_allophones())
327 | 
328 |     def test_vowel_ii_3(self):  # ударный после сочетания губной согласный + 'л'
329 |         testing_text = 'плыть'
330 |         testing_a_text = 'плы+ть'
331 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
332 |         ru_transcript.transcribe()
333 |         print(testing_text, ru_transcript.get_allophones())
334 |         self.assertEqual(['p', 'lˠ', 'ɯ̟ɨ̟', 'tʲ'], ru_transcript.get_allophones())
335 | 
336 |     def test_vowel_ii_4(self):  # предударный, II предударный или заударный не после 'ц'
337 |         testing_text = 'чтобы'
338 |         testing_a_text = 'что+бы'
339 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
340 |         ru_transcript.transcribe()
341 |         print(testing_text, ru_transcript.get_allophones())
342 |         self.assertEqual(['ʂ', 'tʷ', 'o', 'bˠ', 'ᵻ'], ru_transcript.get_allophones())
343 | 
344 |     def test_vowel_ii_5(self):  # предударный, II предударный или заударный после 'ц'
345 |         testing_text = 'танцы'
346 |         testing_a_text = 'та+нцы'
347 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
348 |         ru_transcript.transcribe()
349 |         print(testing_text, ru_transcript.get_allophones())
350 |         self.assertEqual(['t', 'a', 'n', 't͡s', 'ə'], ru_transcript.get_allophones())
351 | 
352 |     def test_jotised_1(self):
353 |         testing_text = 'бульон'
354 |         testing_a_text = 'бульо+н'
355 |         ru_transcript = RuTranscript(testing_text, testing_a_text)
356 |         ru_transcript.transcribe()
357 |         print(testing_text, ru_transcript.get_allophones())
358 |         self.assertEqual(['bʷ', 'ʊ', 'lʲ', 'jᶣ', 'ɵ', 'n'], ru_transcript.get_allophones())
359 | 
360 | 
361 | if __name__ == '__main__':
362 |     unittest.main()
363 | 


--------------------------------------------------------------------------------
/src/tools/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/suralmasha/RuTranscript/30cbc40c5ac368021bcc8a05002fa33cc50ee9b6/src/tools/__init__.py


--------------------------------------------------------------------------------
/src/tools/allophones_tools.py:
--------------------------------------------------------------------------------
  1 | import spacy
  2 | 
  3 | from .sounds import allophones, rus_v
  4 | 
  5 | nlp = spacy.load('ru_core_news_sm', disable=["tagger", "morphologizer", "attribute_ruler"])
  6 | 
  7 | 
  8 | def get_allophone_info(allophone):
  9 |     return allophones[allophone]
 10 | 
 11 | 
 12 | def shch(section: list):
 13 |     section_copy = section.copy()
 14 |     for i, current_phon in enumerate(section_copy[:-1]):
 15 |         try:
 16 |             next_phon = section_copy[i + 1]
 17 |         except IndexError:
 18 |             next_phon = ''
 19 |         try:
 20 |             two_current = (section_copy[i], section_copy[i + 1])
 21 |         except IndexError:
 22 |             two_current = ''
 23 | 
 24 |         next_allophone = allophones[next_phon]
 25 |         if ((current_phon == 'ʐ') and (next_allophone.get('voice', '') == 'voiceless') and (next_phon != 's')) \
 26 |                 or (two_current in {('s', 't͡ɕ'), ('z', 't͡ɕ'), ('ʐ', 't͡ɕ')}):
 27 |             section_copy[i] = 'ɕː'
 28 |             del section_copy[i + 1]
 29 | 
 30 |     return section_copy
 31 | 
 32 | 
 33 | def long_ge(section: list):
 34 |     section_copy = section.copy()
 35 |     for i, current_phon in enumerate(section_copy[:-1]):
 36 |         try:
 37 |             next_phon = section_copy[i + 1]
 38 |         except IndexError:
 39 |             next_phon = ''
 40 |         try:
 41 |             two_current = (section_copy[i], section_copy[i + 1])
 42 |         except IndexError:
 43 |             two_current = ''
 44 | 
 45 |         next_allophone = allophones[next_phon]
 46 |         if two_current in [('ʐ', 'ʐ'), ('z', 'ʐ')]:
 47 |             section_copy[i] = 'ʑː'
 48 |             del section_copy[i + 1]
 49 |         elif (current_phon == 'ɕː') and (next_allophone.get('voice', '') == 'voiced') \
 50 |                 and ('nasal' not in next_allophone.get('manner', '')):
 51 |             section_copy[i] = 'ʑː'
 52 | 
 53 |     return section_copy
 54 | 
 55 | 
 56 | def nasal_m_n(section: list):
 57 |     section_copy = section.copy()
 58 |     for i, current_phon in enumerate(section_copy[:-1]):
 59 |         try:
 60 |             if allophones[section_copy[i + 1]].get('place', '') != 'labial, labiodental':
 61 |                 continue
 62 |         except IndexError:
 63 |             break
 64 | 
 65 |         if current_phon in ['m', 'n']:
 66 |             section_copy[i] = 'ɱ'
 67 |         elif current_phon in ['mʲ', 'nʲ']:
 68 |             section_copy[i] = 'ɱʲ'
 69 | 
 70 |     return section_copy
 71 | 
 72 | 
 73 | def silent_r(section: list):
 74 |     section_copy = section.copy()
 75 |     for i, current_phon in enumerate(section_copy):
 76 |         try:
 77 |             if (i < len(section_copy) - 1) and (allophones[section_copy[i + 1]].get('voice', '') != 'voiceless'):
 78 |                 continue
 79 |         except IndexError:
 80 |             break
 81 | 
 82 |         if current_phon == 'r':
 83 |             section_copy[i] = 'r̥'
 84 |         elif current_phon == 'rʲ':
 85 |             section_copy[i] = 'r̥ʲ'
 86 | 
 87 |     return section_copy
 88 | 
 89 | 
 90 | def voiced_ts(section: list):
 91 |     section_copy = section.copy()
 92 |     for i, current_phon in enumerate(section_copy):
 93 |         try:
 94 |             if allophones[section_copy[i + 1]].get('voice', '') != 'voiced':
 95 |                 continue
 96 |         except IndexError:
 97 |             break
 98 | 
 99 |         if current_phon == 't͡s':
100 |             section_copy[i] = 'd̻͡z̪'
101 | 
102 |     return section_copy
103 | 
104 | 
105 | def first_jot(phonemes_list_section):
106 |     phonemes_list_section_copy = phonemes_list_section.copy()
107 |     if phonemes_list_section_copy[0] == 'j':
108 |         phonemes_list_section_copy[0] = 'ʝ'
109 | 
110 |     return phonemes_list_section_copy
111 | 
112 | 
113 | def fix_jotised(phonemes_list_section, letters_list_section):
114 |     phonemes_list_section_copy = phonemes_list_section.copy()
115 |     # ---- jotised vowels and i ----
116 |     phonemes_list_to_iterate = phonemes_list_section_copy[:]
117 |     letters_list_to_iterate = letters_list_section[:]
118 | 
119 |     for i, let in enumerate(letters_list_to_iterate):
120 |         try:
121 |             next_let = letters_list_to_iterate[i + 1]
122 |         except IndexError:
123 |             next_let = ''
124 |         if (let == 'д') and (next_let == 'ж'):
125 |             del letters_list_to_iterate[i + 1]
126 |             letters_list_to_iterate[i] = 'дж'
127 |         elif next_let in ['ь', 'ъ']:
128 |             del letters_list_to_iterate[i + 1]
129 |             letters_list_to_iterate[i] = letters_list_to_iterate[i] + next_let
130 | 
131 |     n = 0
132 |     for i, current_phon in enumerate(phonemes_list_to_iterate):
133 |         sub_symb = False
134 |         current_allophone = allophones[current_phon]
135 |         if current_allophone['phon'] == 'symb':
136 |             continue
137 |         if current_phon == 'j' and letters_list_to_iterate[i] != 'й':
138 |             letters_list_to_iterate.insert(i, 'й')
139 |         current_let = letters_list_to_iterate[i]
140 |         try:
141 |             if allophones[phonemes_list_to_iterate[i - 1]]['phon'] != 'symb':
142 |                 previous_let = letters_list_to_iterate[i - 1]
143 |                 previous_phon = phonemes_list_to_iterate[i - 1]
144 |             else:
145 |                 previous_let = letters_list_to_iterate[i - 2]
146 |                 previous_phon = phonemes_list_to_iterate[i - 2]
147 |                 sub_symb = True
148 |         except IndexError:
149 |             previous_let = ''
150 |             previous_phon = ''
151 |         try:
152 |             next_let = letters_list_to_iterate[i + 1]
153 |         except IndexError:
154 |             next_let = ''
155 |         try:
156 |             after_next_let = letters_list_to_iterate[i + 2]
157 |         except IndexError:
158 |             after_next_let = ''
159 | 
160 |         previous_allophone = allophones[previous_phon]
161 |         if (current_let == 'о') and (previous_let[-1] == 'ь') and (next_let == '+'):
162 |             phonemes_list_section_copy.insert(i + n, 'j')
163 |             n += 1
164 | 
165 |         elif current_let in 'ё е я ю'.split():
166 |             if previous_let[-1] in ['ь', 'ъ']:
167 |                 if (previous_allophone['phon'] == 'C')\
168 |                         and ('ʲ' not in previous_phon)\
169 |                         and (previous_allophone['palatalization'][0] != 'a'):
170 |                     phonemes_list_section_copy[i + n - 1 - sub_symb] = previous_phon + 'ʲ'
171 |                 phonemes_list_section_copy.insert(i + n, 'j')
172 |                 n += 1
173 | 
174 |             elif previous_let in rus_v:
175 |                 phonemes_list_section_copy.insert(i + n, 'j')
176 |                 n += 1
177 | 
178 |             elif (current_let != 'э') \
179 |                     and (previous_allophone['phon'] == 'C') \
180 |                     and ('ʲ' not in previous_phon) \
181 |                     and ('a' not in previous_allophone['palatalization'][0]):
182 |                 phonemes_list_section_copy[i + n - 1 - sub_symb] = previous_phon + 'ʲ'
183 | 
184 |             elif (after_next_let == '+') and (previous_phon != 'j'):
185 |                 phonemes_list_section_copy.insert(i + n, 'j')
186 |                 n += 1
187 | 
188 |         elif current_let == 'и':
189 |             if sub_symb and (phonemes_list_to_iterate[i - 1] == '_') and (previous_allophone['phon'] == 'C'):
190 |                 phonemes_list_section_copy[i + n] = 'ɨ'
191 | 
192 |             elif previous_let[-1] in {'ь', 'ъ'}:
193 |                 phonemes_list_section_copy.insert(i + n, 'j')
194 |                 n += 1
195 | 
196 |             elif (previous_allophone['phon'] == 'C') \
197 |                     and ('ʲ' not in previous_phon) \
198 |                     and (previous_allophone['palatalization'][0] != 'a'):
199 |                 phonemes_list_section_copy[i + n - 1 - sub_symb] = previous_phon + 'ʲ'
200 | 
201 |     return phonemes_list_section_copy
202 | 
203 | 
204 | def assimilative_palatalization(tokens_section, phonemes_list_section):
205 |     phonemes_list_section_copy = phonemes_list_section.copy()
206 |     exceptions = 'сосиска злить после ёлка день транскрипция джаз неуклюжий шахтёр'.split()
207 | 
208 |     token_index = 0
209 |     token = tokens_section[token_index]
210 |     nlp_token = nlp(token)[0]
211 |     lemma = nlp_token.lemma_
212 | 
213 |     for i, current_phon in enumerate(phonemes_list_section_copy):
214 |         if current_phon == '_':
215 |             token_index += 1
216 |             token = tokens_section[token_index]
217 |             nlp_token = nlp(token)[0]
218 |             lemma = nlp_token.lemma_
219 | 
220 |         current_allophone = allophones[current_phon]
221 |         if (lemma not in exceptions) and ('i+zm' not in token):
222 |             try:
223 |                 n = 1
224 |                 next_phon = phonemes_list_section_copy[i + n]
225 |                 next_allophone = allophones[next_phon]
226 |                 while next_allophone['phon'] == 'symb':
227 |                     n += 1
228 |                     next_phon = phonemes_list_section_copy[i + n]
229 |                     next_allophone = allophones[next_phon]
230 |             except IndexError:
231 |                 next_phon = ''
232 |                 next_allophone = allophones[next_phon]
233 | 
234 |             # не смягчение перед [л] (для, глина, длинный, блин, злиться, влить, тлеть)
235 |             if 'l' in next_phon:
236 |                 continue
237 | 
238 |             # доминирует не смягчение зубных перед мягкими губно-зубными ([д’в’]е́рь - [дв’]е́рь)
239 |             elif (current_allophone.get('place', '') == 'lingual, dental') \
240 |                     and (next_allophone.get('place', '') == 'labial, labiodental'):
241 |                 continue
242 | 
243 |             # доминирует не смягчение губных перед мягкими губными (лю[б’в’]и́ - лю[бв’]и́)
244 |             elif (current_allophone.get('place', '') == 'labial, bilabial')\
245 |                     and (next_allophone.get('place', '') == 'labial, bilabial') and lemma != 'лобби':
246 |                 continue
247 | 
248 |             # не смягчение губных и зубных перед мягкими заднеязычными (гри[пк’]и́; ко́[фт’]е)
249 |             elif (current_allophone.get('place', '') in ['lingual, dental', 'labial, bilabial'])\
250 |                     and (next_allophone.get('place', '') == 'lingual, velar'):
251 |                 continue
252 | 
253 |             # не смягчение звуков [р], [г] перед мягкими согласными (а[рт’]и́ст, а[гн’]ия)
254 |             elif ('r' in current_phon) or ('ɡ' in current_phon):
255 |                 continue
256 | 
257 |             # не смягчение звуков [т], [з], [к] перед [р] ([тр’]и́, тряска, зрелый, транскрипция)
258 |             elif ('t' in current_phon or 'z' in current_phon or 'k' in current_phon)\
259 |                     and (next_phon in {'rʲ', 'rʲː', 'r̥ʲ'}):
260 |                 continue
261 | 
262 |             elif (current_allophone['phon'] == 'C') and (current_allophone.get('palatalization', ' ')[0] != 'a')\
263 |                     and ('ʲ' not in current_phon) and ('soft' in next_allophone.get('palatalization', '')):
264 |                 phonemes_list_section_copy[i] = current_phon + 'ʲ'
265 | 
266 |     return phonemes_list_section_copy
267 | 
268 | 
269 | def long_consonants(phonemes_list_section):
270 |     n = 0
271 |     phonemes_list_to_iterate = phonemes_list_section[:]
272 |     for i, current_phon in enumerate(phonemes_list_to_iterate):
273 |         add_symb = False
274 |         try:
275 |             if allophones[phonemes_list_to_iterate[i + 1]]['phon'] != 'symb':
276 |                 next_phon = phonemes_list_to_iterate[i + 1]
277 |             else:
278 |                 next_phon = phonemes_list_to_iterate[i + 2]
279 |                 add_symb = True
280 |         except IndexError:
281 |             next_phon = ''
282 | 
283 |         if (current_phon[0] in 'ʂbpfkstrlmngdz') and (current_phon == next_phon):
284 |             del phonemes_list_section[i + n]
285 |             del phonemes_list_section[i + n + add_symb]
286 |             phonemes_list_section.insert(i + n, current_phon + 'ː')
287 |             n -= 1
288 | 
289 |     return phonemes_list_section
290 | 
291 | 
292 | ts = {'t͡s', 't͡sʷ', 't͡sˠ', 'd͡ʒᶣ', 'd͡ʒˠ', 'd̻͡z̪', 'd͡ʒ'}
293 | zh_sh_ts = {'ʐ', 'ʐʷ', 'ʐˠ', 'ʑː', 'ʑːʷ', 'ʑːˠ', 'ʑʲː', 'ʑːᶣ',
294 |             'ʂ', 'ʂʷ', 'ʂˠ', 'ʂː', 'ʂːʷ', 'ʂːˠ',
295 |             't͡s', 't͡sʷ', 't͡sˠ', 'd͡ʒᶣ', 'd͡ʒˠ', 'd̻͡z̪', 'd͡ʒ'}
296 | 
297 | 
298 | def stunning(segment: list):
299 |     segment_copy = segment.copy()
300 |     for i, current_phon in enumerate(segment_copy):
301 |         try:
302 |             if (i < len(segment_copy) - 1) and (segment_copy[i + 1] != '_'):
303 |                 continue
304 |         except IndexError:
305 |             break
306 |         try:
307 |             if (i < len(segment_copy) - 1) and ((allophones[segment_copy[i + 2]].get('voice', '') == 'voiced')
308 |                                            or (allophones[segment_copy[i + 2]]['phon'] == 'V')):
309 |                 continue
310 |         except IndexError:
311 |             break
312 | 
313 |         allophone_info = allophones[current_phon]
314 |         pair = allophone_info.get('pair', None)
315 |         if (allophone_info.get('voice', '') == 'voiced') and (pair is not None):
316 |             segment_copy[i] = pair
317 | 
318 |     return segment_copy
319 | 
320 | 
321 | def vowels(section: list):
322 |     section_copy = section.copy()
323 |     for i, current_phon in enumerate(section_copy):
324 |         try:
325 |             next_phon = section_copy[i + 1]
326 |         except IndexError:
327 |             next_phon = ''
328 |         try:
329 |             after_next_phon = section_copy[i + 2]
330 |         except IndexError:
331 |             after_next_phon = ''
332 |         try:
333 |             previous_phon = section_copy[i - 1]
334 |         except IndexError:
335 |             previous_phon = ''
336 |         try:
337 |             after_previous_phon = section_copy[i - 2]
338 |         except IndexError:
339 |             after_previous_phon = ''
340 | 
341 |         previous_allophone = allophones[previous_phon]
342 |         after_next_allophone = allophones[after_next_phon]
343 |         after_previous_allophone = allophones[after_previous_phon]
344 |         if current_phon == 'a':
345 |             if (i != len(section_copy) - 1) and (next_phon != '_') \
346 |                     and (i != 0) and (previous_phon != '_'):  # not last, not first
347 | 
348 |                 if next_phon == '+':  # ударный (not last, not first)
349 |                     if previous_phon in zh_sh_ts:
350 |                         section_copy[i] = 'ɐ.'
351 |                     elif ('hard' in previous_allophone.get('palatalization', '')) and (after_next_phon == 'l'):
352 |                         section_copy[i] = 'ɑ'
353 |                     elif 'hard' in previous_allophone.get('palatalization', ''):
354 |                         section_copy[i] = 'a'
355 |                     else:
356 |                         section_copy[i] = 'æ'
357 | 
358 |                 elif next_phon == '-':  # первый предударный (not last, not first)
359 |                     if previous_phon in zh_sh_ts:
360 |                         section_copy[i] = 'ᵻ'
361 |                     elif (previous_allophone['phon'] == 'C') and ('hard' in previous_allophone['palatalization']):
362 |                         section_copy[i] = 'ɐ'
363 |                     else:
364 |                         section_copy[i] = 'ɪ'
365 | 
366 |                 else:  # заударные / второй предударный (not last, not first)
367 |                     if ((previous_allophone.get('hissing', '')) == 'hissing' or (previous_phon in ts)
368 |                         or ('hard' in previous_allophone.get('palatalization', ''))) \
369 |                             or (previous_allophone['phon'] == 'V'):
370 |                         section_copy[i] = 'ə'
371 |                     else:
372 |                         section_copy[i] = 'ɪ.'
373 | 
374 |             elif (i == len(section_copy) - 1) or (next_phon == '_'):   # заударные (last)
375 |                 if (previous_allophone.get('hissing', '') == 'hissing') or (previous_phon in ts):
376 |                     section_copy[i] = 'ə'
377 |                 elif 'hard' in previous_allophone.get('palatalization', ''):
378 |                     section_copy[i] = 'ʌ'
379 |                 else:
380 |                     section_copy[i] = 'æ.'
381 | 
382 |             else:
383 |                 if next_phon == '-':
384 |                     section_copy[i] = 'ɐ'  # первый предударный (first)
385 |                 elif next_phon != '+':
386 |                     section_copy[i] = 'ə'  # заударные / второй предударный (first)
387 | 
388 |         elif current_phon == 'o':
389 |             if (i != len(section_copy) - 1) and (next_phon != '_') \
390 |                     and (i != 0) and (previous_phon != '_'):  # not last, not first
391 | 
392 |                 if next_phon == '+':  # ударный (not last, not first)
393 |                     if previous_phon in zh_sh_ts:
394 |                         section_copy[i] = 'ɐ.'
395 |                     elif ('soft' in previous_allophone.get('palatalization', '')) \
396 |                             or (previous_allophone['phon'] == 'V'):
397 |                         section_copy[i] = 'ɵ'
398 | 
399 |                 elif next_phon == '-':  # первый предударный (not last, not first)
400 |                     if previous_phon in zh_sh_ts:
401 |                         section_copy[i] = 'ᵻ'
402 |                     elif 'hard' in previous_allophone.get('palatalization', ''):
403 |                         section_copy[i] = 'ɐ'
404 |                     else:
405 |                         section_copy[i] = 'ɪ'
406 | 
407 |                 else:  # заударные/второй предударный (not last, not first)
408 |                     if ((previous_allophone.get('hissing', '') == 'hissing') or (previous_phon in ts)
409 |                         or ('hard' in previous_allophone.get('palatalization', ''))) \
410 |                             or (previous_allophone['phon'] == 'V'):
411 |                         section_copy[i] = 'ə'
412 |                     else:
413 |                         section_copy[i] = 'ɪ.'
414 | 
415 |             elif (i == len(section_copy) - 1) or (next_phon == '_'):  # заударные (last)
416 |                 if (previous_allophone.get('hissing', '') == 'hissing') or (previous_phon in ts):
417 |                     section_copy[i] = 'ə'
418 |                 elif 'hard' in previous_allophone.get('palatalization', ''):
419 |                     section_copy[i] = 'ʌ'
420 |                 else:
421 |                     section_copy[i] = 'æ.'
422 | 
423 |             else:
424 |                 if next_phon == '-':
425 |                     section_copy[i] = 'ɐ'  # первый предударный (first)
426 |                 elif next_phon != '+':
427 |                     section_copy[i] = 'ə'  # заударные / второй предударный (first)
428 | 
429 |         elif current_phon == 'e':
430 |             if (i != len(section_copy) - 1) and (next_phon != '_') \
431 |                     and (i != 0) and (previous_phon != '_'):  # not last, not first
432 | 
433 |                 if next_phon == '+':  # ударный (not last, not first)
434 |                     if previous_phon in zh_sh_ts:
435 |                         section_copy[i] = 'ᵻ'
436 |                     elif 'hard' in previous_allophone.get('palatalization', ''):
437 |                         section_copy[i] = 'ɛ'
438 | 
439 |                 elif next_phon == '-':  # первый предударный (not last, not first)
440 |                     if (previous_allophone.get('hissing', '') == 'hissing') or (previous_phon in ts):
441 |                         section_copy[i] = 'ə'
442 |                     elif 'hard' in previous_allophone.get('palatalization', ''):
443 |                         section_copy[i] = 'ᵻ'
444 |                     else:
445 |                         section_copy[i] = 'ɪ'
446 | 
447 |                 else:  # заударные / второй предударный (not last, not first)
448 |                     if (previous_allophone.get('hissing', '') == 'hissing') or (previous_phon in ts):
449 |                         section_copy[i] = 'ə'
450 |                     elif 'hard' in previous_allophone.get('palatalization', ''):
451 |                         section_copy[i] = 'ᵻ'
452 |                     else:
453 |                         section_copy[i] = 'ɪ.'
454 | 
455 |             elif (i == len(section_copy) - 1) or (next_phon == '_'):  # заударные (last)
456 |                 if (previous_allophone.get('hissing', '') == 'hissing') or (previous_phon in ts):
457 |                     section_copy[i] = 'ə'
458 |                 elif 'hard' in previous_allophone.get('palatalization', ''):
459 |                     section_copy[i] = 'ᵻ'
460 |                 else:
461 |                     section_copy[i] = 'æ.'
462 | 
463 |             else:
464 |                 if next_phon == '+':
465 |                     section_copy[i] = 'ɛ'  # ударный (first)
466 |                 elif next_phon == '-':
467 |                     section_copy[i] = 'ᵻ'  # первый предударный (first)
468 |                 else:
469 |                     section_copy[i] = 'ɪ.'  # заударные / второй предударный (first)
470 | 
471 |         elif current_phon == 'u':
472 |             if (i != len(section_copy) - 1) and (next_phon != '_'):  # not last
473 | 
474 |                 if next_phon == '+':  # ударный (not last)
475 |                     if 'soft' in previous_allophone.get('palatalization', ''):
476 |                         section_copy[i] = 'ʉ'
477 | 
478 |                 else:  # первый / второй предударный / заударные (not last)
479 |                     if 'hard' in previous_allophone.get('palatalization', ''):
480 |                         section_copy[i] = 'ʊ'
481 |                     else:
482 |                         section_copy[i] = 'ᵿ'
483 | 
484 |             else:  # первый / второй предударный / заударные (last)
485 |                 if 'hard' in previous_allophone.get('palatalization', ''):
486 |                     section_copy[i] = 'ʊ'
487 |                 else:
488 |                     section_copy[i] = 'ᵿ'
489 | 
490 |         elif (current_phon == 'i') and (previous_allophone['phon'] == 'C'):
491 |             # после ж, ш, ц
492 |             if previous_phon in zh_sh_ts:
493 |                 section_copy[i] = 'ɨ'
494 |             elif next_phon != '+':  # безударный
495 |                 section_copy[i] = 'ɪ'
496 | 
497 |         elif current_phon == 'ɨ':
498 |             if (i != len(section_copy) - 1) and (next_phon != '_'):  # not last
499 | 
500 |                 if next_phon == '+':  # ударный (not last)
501 |                     if (previous_phon == 'l') and (len(section_copy) > 4) \
502 |                             and ('lab' in after_previous_allophone.get('place', '')):
503 |                         section_copy[i] = 'ɯ̟ɨ̟'
504 |                     elif (previous_allophone.get('place', '') == 'lingual, dental'
505 |                           and after_next_allophone.get('place', '') == 'lingual, velar')\
506 |                             or (previous_allophone.get('place', '') == 'lingual, palatinоdental'
507 |                                 and after_next_allophone.get('place', '') == 'lingual, velar'):
508 |                         section_copy[i] = 'ɨ̟'
509 | 
510 |                 # предударный / заунарный (not last)
511 |                 elif (previous_allophone.get('hissing', '') == 'hissing') or (previous_phon in ts):
512 |                     section_copy[i] = 'ə'
513 |                 else:
514 |                     section_copy[i] = 'ᵻ'
515 | 
516 |             elif (previous_allophone.get('hissing', '') == 'hissing') or (previous_phon in ts):  # заударный (last)
517 |                 section_copy[i] = 'ə'
518 |             else:
519 |                 section_copy[i] = 'ᵻ'
520 | 
521 |     return section_copy
522 | 
523 | 
524 | def labia_velar(segment: list):
525 |     result_segment = []
526 |     for i, current_phon in enumerate(segment):
527 |         if i != 0:
528 |             previous_phon = segment[i - 1]
529 |         else:
530 |             previous_phon = ''
531 | 
532 |         current_allophone = allophones[current_phon]
533 |         previous_allophone = allophones[previous_phon]
534 |         if (i != 0) and (current_allophone.get('round', '') == 'round') and (previous_phon != '_')\
535 |                 and (previous_allophone['phon'] == 'C') and ('ʷ' not in previous_phon) and ('ᶣ' not in previous_phon):
536 |             if 'ʲ' in previous_phon:
537 |                 new = previous_phon.replace('ʲ', '') + 'ᶣ'
538 |                 if new in allophones.keys():
539 |                     del result_segment[-1]
540 |                     result_segment.append(new)
541 |                     result_segment.append(current_phon)
542 |             elif previous_allophone.get('palatalization', '') == 'asoft':
543 |                 new = previous_phon + 'ᶣ'
544 |                 if new in allophones.keys():
545 |                     del result_segment[-1]
546 |                     result_segment.append(new)
547 |                     result_segment.append(current_phon)
548 |             else:
549 |                 new = previous_phon + 'ʷ'
550 |                 if new in allophones.keys():
551 |                     del result_segment[-1]
552 |                     result_segment.append(new)
553 |                     result_segment.append(current_phon)
554 | 
555 |         elif (i != 0) and (current_allophone.get('round', '') == 'velarize') and (previous_phon != '_')\
556 |                 and (previous_allophone['phon'] == 'C') and ('ˠ' not in previous_phon)\
557 |                 and ('soft' not in previous_allophone.get('palatalization', '')):
558 |             # в русском нет слов, начинающихся с ы
559 |             new = previous_phon + 'ˠ'
560 |             if new in allophones.keys():
561 |                 del result_segment[-1]
562 |                 result_segment.append(new)
563 |                 result_segment.append(current_phon)
564 | 
565 |         else:
566 |             result_segment.append(current_phon)
567 | 
568 |     return result_segment
569 | 


--------------------------------------------------------------------------------
/src/tools/main_tools.py:
--------------------------------------------------------------------------------
  1 | import re
  2 | 
  3 | import nltk
  4 | from num2t4ru import num2text
  5 | 
  6 | # nltk.download('punkt')
  7 | # nltk.download('averaged_perceptron_tagger_ru')
  8 | 
  9 | 
 10 | def apply_differences(words):
 11 |     differences = {}
 12 |     for i, (char1, char2) in enumerate(zip(words[0], words[1].replace('+', ''))):
 13 |         if char1 != char2:
 14 |             differences[i + 1] = char2
 15 | 
 16 |     original_word, changed_word = words
 17 |     new_word = []
 18 |     n = 0
 19 |     for i, char in enumerate(changed_word):
 20 |         if char == '+':
 21 |             n += 1
 22 |             continue
 23 |         elif i + n + 1 in differences:
 24 |             new_word.append(differences[i + n + 1])
 25 |         else:
 26 |             new_word.append(char)
 27 | 
 28 |     return ''.join(new_word)
 29 | 
 30 | 
 31 | def get_punctuation_dict(text):
 32 |     """
 33 |     Returns a dictionary with the indices of punctuation marks as keys and the corresponding
 34 |     punctuation symbol (either '|' or '||') as values.
 35 |     """
 36 |     punctuation = r'.,:;()\—\|\?\!…'
 37 |     pause_dict = {}
 38 | 
 39 |     i = 1
 40 |     for char in text:
 41 |         if char in punctuation:
 42 |             pause_type = '||' if char in '.?!…' else '|'
 43 |             pause_dict[i] = pause_type
 44 |             i += 1
 45 | 
 46 |     return pause_dict
 47 | 
 48 | 
 49 | def custom_num2text(tokens: list):
 50 |     """
 51 |     Turns digits to words.
 52 |     """
 53 |     tokens_normal = []
 54 |     cache = {}
 55 | 
 56 |     for section_tokens in tokens:
 57 |         section_normal = []
 58 |         for word in section_tokens:
 59 |             if word.isnumeric():
 60 |                 if word not in cache:
 61 |                     cache[word] = num2text(int(word))
 62 |                 word_normal = cache[word]
 63 |                 section_normal.extend(word_normal.split(' '))
 64 |             else:
 65 |                 section_normal.append(word)
 66 |         tokens_normal.append(section_normal)
 67 | 
 68 |     return tokens_normal
 69 | 
 70 | 
 71 | def text_norm_tok(text: str):
 72 |     """
 73 |     Splits text by punctuation (not including ' and ") and than tokenize it.
 74 |     """
 75 |     sections = re.split(r'[.?!,:;()—…]', text)
 76 |     sections = [re.sub(r'\s+', ' ', w) for w in sections if w != '']
 77 |     sections = [re.sub(r'\s$', '', w) for w in sections if w != '']
 78 |     sections = [re.sub(r'^\s', '', w) for w in sections if w != '']
 79 | 
 80 |     tokens = [[re.sub(r"[,.\\|/;:()*&^%$#@?!\[\]{}\"—…«»]", '', word) for word in section.split()]
 81 |               for section in sections]
 82 | 
 83 |     return custom_num2text(tokens)
 84 | 
 85 | 
 86 | adverb_adp = {'после', 'кругом', 'мимо', 'около', 'вокруг', 'напротив', 'поперёк'}
 87 | 
 88 | 
 89 | def find_clitics(dep, text, indexes=None):
 90 |     """
 91 |     Finds proclitics and enclitics in text by using dependency trees.
 92 |     Args:
 93 |       dep (class 'nltk.tree.tree.Tree'): dependency tree
 94 |       text (list): list of tokens in the text
 95 |       indexes (list[tuple]): list of tuples with indexes of a main and a dependent words.
 96 |     """
 97 |     if indexes is None:
 98 |         indexes = set()
 99 |     functors_pos = {'CCONJ', 'PART', 'ADP'}
100 |     str_dep = str(dep)
101 | 
102 |     if len(str_dep.split(' ')) > 1:
103 |         for token in dep:
104 |             if isinstance(token, nltk.tree.Tree):
105 |                 indexes = find_clitics(token, text, indexes)
106 | 
107 |             elif (token.pos_ in functors_pos) and (token.text not in adverb_adp):
108 |                 clitic_index = token.i
109 |                 main_word_index = None
110 | 
111 |                 if (token.i < len(text) - 1) and (text[token.i + 1] in str_dep) \
112 |                         and (text[token.i + 1][0] not in 'еёюяи'):  # proclitic
113 |                     main_word_index = token.i + 1
114 | 
115 |                 elif (token.i > 0) and (text[token.i - 1] in str_dep):  # enclitic
116 |                     main_word_index = token.i - 1
117 | 
118 |                 if main_word_index is not None:
119 |                     indexes.add((main_word_index, clitic_index))
120 | 
121 |     return indexes
122 | 
123 | 
124 | def extract_phrasal_words(phonemes, indexes):
125 |     """
126 |     Joins clitics with main words.
127 |     Args:
128 |         phonemes (list): list of phonemes with '_' for spaces;
129 |         indexes (set[tuple]): set of tuples with indexes of a main and a dependent words.
130 |     """
131 |     tokens_list = []
132 |     start_token_index = 0
133 | 
134 |     for i, current_phon in enumerate(phonemes):
135 |         if current_phon == '_':
136 |             tokens_list.append(phonemes[start_token_index:i])
137 |             start_token_index = i + 1
138 | 
139 |     tokens_list.append(phonemes[start_token_index:])
140 | 
141 |     phrasal_words = tokens_list[:]
142 |     n = 0
143 |     main_word_cache = []
144 |     enclitic_cache = []
145 | 
146 |     for tuple_indexes in indexes:
147 |         try:
148 |             main_word_index = tuple_indexes[0]
149 | 
150 |             if tuple_indexes[1] > main_word_index:  # проклитика
151 |                 main_word = phrasal_words[main_word_index + n] if main_word_index in main_word_cache\
152 |                     else tokens_list[main_word_index]
153 |                 main_word_cache.append(main_word_index)
154 |                 proclitic_index = tuple_indexes[1]
155 | 
156 |                 proclitic = [x for x in tokens_list[proclitic_index] if x != '+']
157 |                 phrasal_words.remove(tokens_list[proclitic_index])
158 |                 phrasal_words.remove(main_word)
159 |                 if proclitic_index == 1:
160 |                     phrasal_words.insert(0, main_word + proclitic)
161 |                 else:
162 |                     phrasal_words.insert(proclitic_index - main_word_cache.count(main_word_index),
163 |                                          main_word + proclitic)
164 |                 n -= 1
165 | 
166 |             else:  # энклитика
167 |                 main_word = phrasal_words[main_word_index - enclitic_cache.count(main_word_index)] \
168 |                     if main_word_index in enclitic_cache \
169 |                     else tokens_list[main_word_index]
170 |                 main_word_cache.append(main_word_index)
171 |                 enclitic_index = tuple_indexes[1]
172 |                 enclitic_cache.append(enclitic_index)
173 | 
174 |                 enclitic = [x for x in tokens_list[enclitic_index] if x != '+']
175 |                 phrasal_words.remove(tokens_list[enclitic_index])
176 |                 phrasal_words.remove(main_word)
177 |                 phrasal_words.insert(enclitic_index + n + enclitic_cache.count(main_word_index), enclitic + main_word)
178 |                 n -= 1
179 | 
180 |         except:
181 |             continue
182 | 
183 |     phrasal_words_result = []
184 |     for token in phrasal_words:
185 |         phrasal_words_result.extend(token + ['_'])
186 |     del phrasal_words_result[-1]
187 | 
188 |     return phrasal_words_result
189 | 


--------------------------------------------------------------------------------
/src/tools/sounds.py:
--------------------------------------------------------------------------------
  1 | from os.path import join, dirname, abspath
  2 | from collections import defaultdict
  3 | 
  4 | epi_starterpack = 'a b bʲ v vʲ ɡ ɡʲ d dʲ e ʒ z zʲ i j k kʲ l lʲ m mʲ n nʲ o '\
  5 |                   'p pʲ r rʲ s sʲ t tʲ u f fʲ x xʲ t͡s t͡ɕ ʂ ɕː ɨ d͡ʒ'.split()
  6 | ru_starterpack = 'ё й ц у к е н г ш щ з х ъ ф ы в а п р о л д ж э я ч с м и т ь б ю'.split()
  7 | rus_v = 'а е ё и о у э ю я ы'.split()  # russian vowels
  8 | 
  9 | ROOT_DIR = dirname(abspath(__file__))
 10 | 
 11 | with open(join(ROOT_DIR, '../data/alphabet.txt'), encoding='utf-8') as f:
 12 |     alphabet = f.read().split(', ')
 13 | 
 14 | with open(join(ROOT_DIR, '../data/sorted_allophones.txt'), encoding='utf-8') as f:
 15 |     sorted_phonemes_txt = (line.replace('\n', '') for line in f)
 16 |     sorted_phonemes_1 = {}
 17 |     for group in sorted_phonemes_txt:
 18 |         group_name, phonemes = group.split(' = ')
 19 |         sorted_phonemes_1[group_name] = phonemes.split(', ')
 20 | 
 21 | sorted_phonemes = defaultdict(list)
 22 | for key, value in sorted_phonemes_1.items():
 23 |     for element in value:
 24 |         sorted_phonemes[element].append(key)
 25 | 
 26 | with open(join(ROOT_DIR, '../data/paired_consonants.txt'), encoding='utf-8') as f:
 27 |     paired_c_txt = f.read().replace(')', ')_').split('_, ')
 28 |     paired_c = {voiced.replace('(', ''): silent.replace(')', '')
 29 |                 for voiced, silent in (pair.split(', ') for pair in paired_c_txt)}
 30 | 
 31 | # creating a dictionary with all allophones
 32 | allophones = {key: {'phon': 'V', 'row': None, 'rise': None, 'round': None, 'class': 'vowel'} if 'total_v' in sorted_phonemes[key]
 33 |               else {'phon': 'C', 'place': None, 'manner': None, 'palatalization': None, 'voice': None, 'pair': None,
 34 |                     'hissing': None, 'class': None}
 35 |               for key in alphabet}
 36 | # vowels
 37 | # row
 38 | row_map = {'front_v': 'front', 'near_front_v': 'near front', 'central_v': 'central', 'near_back_v': 'near back',
 39 |            'back_v': 'back'}
 40 | # rise
 41 | rise_map = {'close_v': 'close', 'near_close_v': 'near close', 'close_mid_v': 'close mid', 'mid_v': 'mid',
 42 |             'open_mid_v': 'open mid', 'near_open_v': 'near open', 'open_v': 'open'}
 43 | # round / velarize
 44 | round_map = {'rounded_v': 'round', 'velarize_v': 'velarize'}
 45 | # consonants
 46 | # place
 47 | place_map = {'bilabial_c': 'labial, bilabial', 'labiodental_c': 'labial, labiodental', 'dental_c': 'lingual, dental',
 48 |              'palatinodental_c': 'lingual, palatinоdental', 'palatal_c': 'lingual, palatal',
 49 |              'velar_c': 'lingual, velar', 'glottal_c': 'glottal'}
 50 | # manner
 51 | manner_map = {'explosive_c': 'obstruent, explosive', 'affricate_c': 'obstruent, affricate',
 52 |               'fricative_c': 'obstruent, fricative', 'nasal_c': 'sonorant, nasal',
 53 |               'lateral_c': 'sonorant, lateral', 'vibrant_c': 'sonorant, vibrant'}
 54 | # hard / soft
 55 | palatalization_map = {'hard_c': 'hard', 'always_hard_c': 'ahard', 'soft_c': 'soft', 'always_soft_c': 'asoft'}
 56 | # voice / silent
 57 | voice_map = {'voiced_c': 'voiced', 'voiceless_c': 'voiceless'}
 58 | paired_c_inv = {v: k for k, v in paired_c.items()}
 59 | # hissing sounds
 60 | hissing_map = {'hissing_c': 'hissing'}
 61 | # class
 62 | class_map = {'sonorous_class': 'sonorous', 'voiced_class': 'voiced',
 63 |              'voiceless_class': 'voiceless', 'hissing_class': 'hissing'}
 64 | # experiments
 65 | # allophones = {key: {'phon': 'V', 'row': None, 'rise': None, 'round': None, 'class': 'vowel', 'experiment': None} if 'total_v' in sorted_phonemes[key] else {'phon': 'C', 'place': None, 'manner': None, 'palatalization': None, 'voice': None, 'pair': None, 'hissing': None, 'class': None, 'experiment': None} for key in alphabet}
 66 | # experiment_map = {'complex_experiment': 'complex', 'rare_experiment': 'rare', 'random_vowels_experiment': 'random_vowel', 'long_consonants_experiment': 'long_consonant'}
 67 | 
 68 | for key in allophones.keys():
 69 |     for group in sorted_phonemes[key]:
 70 |         # experiments
 71 |         # experiment = experiment_map.get(group, None)
 72 |         # allophones[key]['experiment'] = experiment if experiment is not None else allophones[key]['experiment']
 73 | 
 74 |         # vowels
 75 |         if allophones[key]['phon'] == 'V':
 76 |             row = row_map.get(group, None)
 77 |             allophones[key]['row'] = row if row is not None else allophones[key]['row']
 78 |             rise = rise_map.get(group, None)
 79 |             allophones[key]['rise'] = rise if rise is not None else allophones[key]['rise']
 80 |             round_ph = round_map.get(group, None)
 81 |             allophones[key]['round'] = round_ph if round_ph is not None else allophones[key]['round']
 82 | 
 83 |         # consonants
 84 |         if allophones[key]['phon'] == 'C':
 85 |             place = place_map.get(group, None)
 86 |             allophones[key]['place'] = place if place is not None else allophones[key]['place']
 87 |             manner = manner_map.get(group, None)
 88 |             allophones[key]['manner'] = manner if manner is not None else allophones[key]['manner']
 89 |             palatalization = palatalization_map.get(group, None)
 90 |             allophones[key]['palatalization'] = palatalization if palatalization is not None \
 91 |                 else allophones[key]['palatalization']
 92 |             hissing = hissing_map.get(group, None)
 93 |             allophones[key]['hissing'] = hissing if hissing is not None else allophones[key]['hissing']
 94 |             class_consonants = class_map.get(group, None)
 95 |             allophones[key]['class'] = class_consonants
 96 |             voice = voice_map.get(group, None)
 97 |             allophones[key]['voice'] = voice if voice is not None else allophones[key]['voice']
 98 |             if (allophones[key]['voice'] == 'voiced') and (key in paired_c.keys()):
 99 |                 allophones[key]['pair'] = paired_c[key]
100 |             elif (allophones[key]['voice'] == 'voiceless') and (key in paired_c.values()):
101 |                 allophones[key]['pair'] = paired_c_inv[key]
102 | 
103 | # symbols
104 | allophones.update({symbol: {'phon': 'symb'} for symbol in ['+', '-', '|', '||', '_', '']})
105 | 


--------------------------------------------------------------------------------
/src/tools/stress_tools.py:
--------------------------------------------------------------------------------
  1 | from os.path import join, dirname, abspath
  2 | 
  3 | from stressrnn import StressRNN
  4 | 
  5 | from .sounds import rus_v
  6 | 
  7 | ROOT_DIR = dirname(abspath(__file__))
  8 | 
  9 | with open(join(ROOT_DIR, '../data/error_words_stresses_default.txt'), encoding='utf-8') as file:
 10 |     error_words_stresses = file.readlines()
 11 | stress_default_dict = {}
 12 | for word in error_words_stresses:
 13 |     stress_default_dict[word.replace('+', '').replace('\n', '')] = word.replace('\n', '')
 14 | 
 15 | stress_rnn = StressRNN()
 16 | 
 17 | 
 18 | def place_stress(token: str, stress_accuracy_threshold: float):
 19 |     """
 20 |     Places an accent.
 21 |     Args:
 22 |       :param token: token without an accent.
 23 |       :param stress_accuracy_threshold:
 24 |     """
 25 |     if token in stress_default_dict.keys():
 26 |         return stress_default_dict[token]
 27 | 
 28 |     token_list = list(token)
 29 | 
 30 |     if 'ё' in token:
 31 |         token_list.insert(token.index('ё') + 1, '+')
 32 |         return ''.join(token_list)
 33 | 
 34 |     vowels_count = sum(token.count(let) for let in token if let in rus_v)
 35 | 
 36 |     if vowels_count == 1:
 37 |         for i, let in enumerate(token):
 38 |             if let in rus_v:
 39 |                 token_list.insert(i + 1, '+')
 40 |         return ''.join(token_list)
 41 | 
 42 |     if vowels_count == 0:
 43 |         return ''.join(token_list)
 44 | 
 45 |     # raise ValueError("Unfortunately, the automatic stress placement function is not yet available. "
 46 |     # f"Add stresses yourselves.\nThere is no stress for the word {token}")
 47 |     return stress_rnn.put_stress(token, accuracy_threshold=stress_accuracy_threshold)
 48 | 
 49 | 
 50 | def replace_stress(token):
 51 |     """
 52 |     Replaces an accent from a place before a stressed vowel to a place after it.
 53 |     Args:
 54 |       token (str): token which needs to be refactored.
 55 |     """
 56 |     plus_index = token.find('+')
 57 |     new_token_split = list(token)
 58 |     new_token_split.remove('+')
 59 |     new_token_split.insert(plus_index + 1, '+')
 60 |     return ''.join(new_token_split)
 61 | 
 62 | 
 63 | def remove_extra_stresses(string: str):
 64 |     first_plus_index = string.find('+')
 65 |     return string[:first_plus_index + 1] + string[first_plus_index + 1:].replace('+', '')
 66 | 
 67 | 
 68 | def replace_stress_before(text):
 69 |     if isinstance(text, str):
 70 |         text = list(text)
 71 | 
 72 |     text_copy = text.copy()
 73 |     for i, char in enumerate(text):
 74 |         if char == '+':
 75 |             text_copy.pop(i)
 76 |             text_copy.insert(i - 1, '+')
 77 |     return text_copy
 78 | 
 79 | 
 80 | def put_stresses(tokens_list: list, stress_place: str = 'after', stress_accuracy_threshold: float = 0.86):
 81 |     """
 82 |     Puts or replaces stresses.
 83 | 
 84 |     :param tokens_list: List of tokens.
 85 |     :param stress_place: 'after' - to place the stress symbol after the stressed vowel,
 86 |         'before' - to place the stress symbol before the stressed vowel.
 87 |     :param stress_accuracy_threshold: A threshold for the accuracy of stress placement for StressRNN.
 88 |     :return: List of tokens.
 89 |     """
 90 |     res = []
 91 |     for token in tokens_list:
 92 |         if ('+' in token) and (stress_place == 'before'):  # need to replace
 93 |             res.append(replace_stress(token))
 94 |         elif '+' not in token:  # use StressRNN
 95 |             res.append(place_stress(token, stress_accuracy_threshold))
 96 |         else:
 97 |             res.append(token)
 98 | 
 99 |     return res
100 | 
101 | 
102 | """
103 | [
104 | replace_stress(token) if ('+' in token) and (stress_place == 'before')  # need to replace
105 | else place_stress(token, stress_accuracy_threshold) if ('+' not in token)  # use StressRNN
106 | else token
107 | for token in tokens_list
108 | ]
109 | """


--------------------------------------------------------------------------------
/src/tools/syntax_tree.py:
--------------------------------------------------------------------------------
 1 | import spacy
 2 | from nltk import Tree
 3 | 
 4 | 
 5 | class SyntaxTree:
 6 |     def __init__(self):
 7 |         self.dependency_tree = None
 8 |         self.nlp = spacy.load('ru_core_news_sm')
 9 | 
10 |     def to_nltk_tree(self, node):
11 |         if node.n_lefts + node.n_rights > 0:
12 |             return Tree(node, [self.to_nltk_tree(child) for child in node.children])
13 | 
14 |         return node
15 | 
16 |     def make_dependency_tree(self, text):
17 |         """
18 |         Makes a dependency tree.
19 |         Args:
20 |           text (str): original text
21 |         """
22 |         doc = self.nlp(text)
23 |         for sent in doc.sents:
24 |             self.dependency_tree = self.to_nltk_tree(sent.root)
25 | 
26 |         return self.dependency_tree
27 | 


--------------------------------------------------------------------------------