├── IPATranscriber ├── README.md └── ipatranscriber.py ├── README.md ├── .gitignore ├── LICENSE ├── UyghurTransliterator ├── README.md └── uyghurtransliterator.py └── WiktionaryScraper ├── README.md └── wiktionaryscraper.py /IPATranscriber/README.md: -------------------------------------------------------------------------------- 1 | # Uyghur IPA Transcriber 2 | 3 | This script takes as input a list of Uyghur words in Latin orthography (one 4 | word per line) and returns a list of those words and their broad phonemic 5 | transcriptions (one word and its transcription per line). 6 | 7 | Thus the input: 8 | 9 | ``` 10 | yéziliq 11 | zeple- 12 | yashliq 13 | a'ile 14 | ``` 15 | 16 | Returns the output: 17 | 18 | ``` 19 | yéziliq;jeziliqʰ 20 | zeple-;zɛplɛ- 21 | yashliq;jaʃliq 22 | a'ile;aˀilɛ 23 | ``` 24 | 25 | The IPA output is largely a one-to-one substitution of graphs or digraphs for 26 | their phonemic values, with the exception that aspiration is blocked before consonants. 27 | Future versions of this script will take other orthographies (i.e., Uyghur 28 | Perso-Arabic script and Uyghur Cyrillic) as input. 29 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Uyghur-resources 2 | 3 | This repository contains a collection of resources for Uyghur linguistics, 4 | mostly constructed to assist in the initial data-gathering and data-organizing 5 | stages of a larger project, the goal of which is the construction of an 6 | automated speech-recognition system for the Uyghur language. 7 | 8 | * **IPATranscriber** contains a Python script which outputs broad phonemic 9 | transcriptions (in the International Phonetic Alphabet) of Uyghur words in 10 | Latin orthography. 11 | 12 | * **UyghurTransliterator** contains a Python script which transliterates an 13 | input file (in Uyghur) from one writing system to another. Nine writing 14 | systems are currently supported. 15 | 16 | * **WiktionaryScraper** contains a Python script which fetches English 17 | translations of Mandarin words from [Wiktionary](https://en.wiktionary.org). 18 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # miscellaneous input and output files 2 | *.txt 3 | *.xml 4 | *.ods 5 | *.xls 6 | 7 | # Byte-compiled / optimized / DLL files 8 | __pycache__/ 9 | *.py[cod] 10 | 11 | # C extensions 12 | *.so 13 | 14 | # Distribution / packaging 15 | .Python 16 | env/ 17 | build/ 18 | develop-eggs/ 19 | dist/ 20 | downloads/ 21 | eggs/ 22 | .eggs/ 23 | lib/ 24 | lib64/ 25 | parts/ 26 | sdist/ 27 | var/ 28 | *.egg-info/ 29 | .installed.cfg 30 | *.egg 31 | 32 | # PyInstaller 33 | # Usually these files are written by a python script from a template 34 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 35 | *.manifest 36 | *.spec 37 | 38 | # Installer logs 39 | pip-log.txt 40 | pip-delete-this-directory.txt 41 | 42 | # Unit test / coverage reports 43 | htmlcov/ 44 | .tox/ 45 | .coverage 46 | .coverage.* 47 | .cache 48 | nosetests.xml 49 | coverage.xml 50 | *,cover 51 | 52 | # Translations 53 | *.mo 54 | *.pot 55 | 56 | # Django stuff: 57 | *.log 58 | 59 | # Sphinx documentation 60 | docs/_build/ 61 | 62 | # PyBuilder 63 | target/ 64 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2015 Matt Menzenski 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | 23 | -------------------------------------------------------------------------------- /UyghurTransliterator/README.md: -------------------------------------------------------------------------------- 1 | # Uyghur Transliterator 2 | 3 | This script takes as input a list of Uyghur words in Latin orthography (one 4 | word per line) and returns a list of those words and their broad phonemic 5 | transcriptions (one word and its transcription per line). 6 | 7 | ## Usage 8 | 9 | The script may be imported as a module, but was written to be called from the 10 | command line. The template is `python uyghurtransliterator.py inputfilename.txt inputOrthography outputOrthography (outputfilename.txt)`, where `inputfilename.txt` is the name of the input file, `inputOrthography` and 11 | `outputOrthography` are, respectively, the name of the orthography used in the 12 | input file and the desired output orthography, and `outputfilename.txt` is the 13 | name of the output file. The name of the output file is optional: if one is not 14 | supplied, the input file will be overwritten with the results of the 15 | transliteration. 16 | 17 | ## Supported orthographies 18 | 19 | * `IPA` -- the International Phonetic Alphabet 20 | * `UyArabic` -- Uyghur Arabic 21 | * `UyLatin` -- Uyghur Latin 22 | * `UyCyrillic` -- Uyghur Cyrillic 23 | * `ChineseLatin` 24 | * `MengesLatin` 25 | * `JarringLatin` 26 | * `JarringArabic` 27 | * `MalovLatin` 28 | -------------------------------------------------------------------------------- /WiktionaryScraper/README.md: -------------------------------------------------------------------------------- 1 | # Wiktionary Scraper 2 | 3 | I have access to a Uyghur lexicon with 33,810 headwords, but only about a fifth 4 | of those entries have a translation provided in English. Most entries are 5 | glossed in Mandarin instead. I don't know Mandarin, and needed a way to get 6 | high-quality English translations of those Mandarin glosses. 7 | 8 | To that end, I threw together the script `wiktionaryscraper.py`, which does two 9 | things: 10 | 11 | 1. Fetches English translations (from [Wiktionary](https://en.wiktionary.org)) 12 | of those Mandarin glosses 13 | 2. Preserves the original order of the input (both the order of multiple 14 | Mandarin glosses in one line as well as the order of lines are respected) 15 | 16 | Each line of the input file takes the form `[numeral],[Mandarin gloss(es)]`, 17 | with the numeral and comma mandatory and the Mandarin gloss(es) optional. The 18 | following lines are all valid input: 19 | 20 | ``` 21 | 4745,小心;轻轻地 22 | 4746,响应 23 | 4747, 24 | 4748, 25 | ``` 26 | 27 | The output is formatted in much the same way, only with a semicolon rather than 28 | a comma separating the numeral and the gloss (for easier copy-pasting into a 29 | spreadsheet). The output corresponding to the above input example looks like 30 | this: 31 | 32 | ``` 33 | 4745;careful (小心), 34 | 4746;to respond, to answer (响应), 35 | 4747; 36 | 4748; 37 | ``` 38 | 39 | Note that only the first of the two Mandarin glosses in line 4745 has an entry 40 | on Wiktionary, so only that gloss returns an English translation in the output. 41 | The Mandarin gloss follows its English translation in parentheses. 42 | -------------------------------------------------------------------------------- /IPATranscriber/ipatranscriber.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | ########## 5 | ## ipatranscriber.py Version 1.0 (2015-07-21) 6 | ## 7 | ## Original author: Matthew Menzenski (menzenski@ku.edu) 8 | ## 9 | ## License: MIT ( http://opensource.org/licenses/MIT ) 10 | ## 11 | ## 12 | ### The MIT License (MIT) 13 | ### 14 | ### Copyright (c) 2015 Matt Menzenski 15 | ### 16 | ### Permission is hereby granted, free of charge, to any person obtaining a 17 | ### copy of this software and associated documentation files (the "Software"), 18 | ### to deal in the Software without restriction, including without limitation 19 | ### the rights to use, copy, modify, merge, publish, distribute, sublicense, 20 | ### and/or sell copies of the Software, and to permit persons to whom the 21 | ### Software is furnished to do so, subject to the following conditions: 22 | ### 23 | ### The above copyright notice and this permission notice shall be included in 24 | ### all copies or substantial portions of the Software. 25 | ### 26 | ### THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS 27 | ### OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 28 | ### FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL 29 | ### THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 30 | ### LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 31 | ### FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 32 | ### DEALINGS IN THE SOFTWARE. 33 | ## 34 | ########## 35 | 36 | """ 37 | Take a list of Uyghur words and add broad IPA transcription. 38 | 39 | The input: 40 | 41 | yéziliq 42 | zeple- 43 | yashliq 44 | a'ile 45 | 46 | Returns the output: 47 | 48 | yéziliq;jeziliqʰ 49 | zeple-;zɛplɛ- 50 | yashliq;jaʃliqʰ 51 | a'ile;aˀilɛ 52 | 53 | The IPA output is largely a one-to-one substitution of graphs or digraphs for 54 | their phonemic values, with the exception that aspiration is blocked consonants. 55 | """ 56 | 57 | from __future__ import unicode_literals 58 | import codecs 59 | 60 | ## our Uyghur word list 61 | input_file = "uyghuritems.txt" 62 | 63 | ## orthography/IPA pairs in which one or both members are digraphs 64 | uyghur_multiples = { 65 | "ch": "ʧʰ", 66 | "gh": "ɣ", 67 | "ng": "ŋ", 68 | "sh": "ʃ", 69 | "zh": "ʒ", 70 | "p": "pʰ", 71 | "t": "tʰ", 72 | "q": "qʰ", 73 | "k": "kʰ", 74 | ",": " | ", 75 | ".": " | ", 76 | ":": " | ", 77 | ";": " | ", 78 | "?": " | ", 79 | "!": " | " 80 | } 81 | 82 | ## orthography/IPA pairs in which both members are a single character 83 | ## (pairs in which IPA and orthography are equal don't need to be replaced.) 84 | uyghur_singles = { 85 | "e": "ɛ", 86 | "é": "e", 87 | "x": "χ", 88 | "j": "ʤ", 89 | "ö": "ø", 90 | "‘": "ˀ", 91 | "'": "ˀ" 92 | } 93 | 94 | ## y and j get treated separately, since both symbols are input of one process 95 | ## and output of another (i.e., otherwise, we'd get ʤ as output when we want j) 96 | uyghur_y = { 97 | "y": "j" 98 | } 99 | 100 | uyghur_u = { 101 | "ü": "y" 102 | } 103 | 104 | def uyghur_latin_to_ipa(word): 105 | """Return broad IPA transcription of a Uyghur word in Latin orthography.""" 106 | 107 | ## new_word will be the output. start by setting it equal to the input 108 | new_word = word 109 | 110 | ## first replace the diagraphs 111 | for char in uyghur_multiples.keys(): 112 | new_word = new_word.replace(char, uyghur_multiples[char]) 113 | 114 | ## then the single characters 115 | for char in uyghur_singles.keys(): 116 | new_word = new_word.replace(char, uyghur_singles[char]) 117 | 118 | ## then y 119 | for char in uyghur_y.keys(): 120 | new_word = new_word.replace(char, uyghur_y[char]) 121 | 122 | ## then j 123 | for char in uyghur_u.keys(): 124 | new_word = new_word.replace(char, uyghur_u[char]) 125 | 126 | ## list of Uyghur consonants (from "uigCLpixzd2ipa.xsl") 127 | consonants = [ 128 | "b", "d", "g", "ɣ", "h", "χ", "ʤ", "k", "q", "l", "ɫ", "m", "n", 129 | "ŋ", "p", "r", "s", "ʃ", "t", "ʧ", "w", "j", "z", "ʒ" 130 | ] 131 | 132 | ## create a list of aspiration ("ʰ") + consonant sequences 133 | sequences = ["ʰ" + consonant for consonant in consonants] 134 | 135 | ## replace sequences of ʰ + consonant with the plain consonant 136 | for sequence in sequences: 137 | new_word = new_word.replace(sequence, sequence[1:]) 138 | 139 | ## output the word in IPA transcription 140 | return new_word 141 | 142 | def main(): 143 | ## open input file in read-only mode with utf-8 encoding 144 | with codecs.open(input_file, mode='r', encoding='utf-8') as words: 145 | for line in words: 146 | ## replace spaces with an underscore---we'll undo this later 147 | ## (lines get split on whitespace, so this keeps entries together) 148 | line = line.replace(" ", "_") 149 | for word in line.split(): 150 | myipa = uyghur_latin_to_ipa(word) 151 | ## output with a semicolon between for straightforward 152 | ## pasting into a spreadsheet 153 | print "{};{}".format( 154 | word.replace("_"," "), myipa.replace("_", " ")) 155 | 156 | ## run the main() function if this script is called as a standalone script 157 | ## (but if imported, e.g., using the call 158 | ## from adduyghuripa import uyughur_latin_to_ipa 159 | ## the main() function won't run) 160 | if __name__ == "__main__": 161 | main() 162 | -------------------------------------------------------------------------------- /UyghurTransliterator/uyghurtransliterator.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | ########## 5 | ## uyghurtransliterator.py Version 0.2 (2015-11-09) 6 | ## 7 | ## Original author: Matthew Menzenski (menzenski@ku.edu) 8 | ## 9 | ## License: MIT ( http://opensource.org/licenses/MIT ) 10 | ## 11 | ## 12 | ### The MIT License (MIT) 13 | ### 14 | ### Copyright (c) 2015 Matt Menzenski 15 | ### 16 | ### Permission is hereby granted, free of charge, to any person obtaining a 17 | ### copy of this software and associated documentation files (the "Software"), 18 | ### to deal in the Software without restriction, including without limitation 19 | ### the rights to use, copy, modify, merge, publish, distribute, sublicense, 20 | ### and/or sell copies of the Software, and to permit persons to whom the 21 | ### Software is furnished to do so, subject to the following conditions: 22 | ### 23 | ### The above copyright notice and this permission notice shall be included in 24 | ### all copies or substantial portions of the Software. 25 | ### 26 | ### THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS 27 | ### OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 28 | ### FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL 29 | ### THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 30 | ### LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 31 | ### FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 32 | ### DEALINGS IN THE SOFTWARE. 33 | ## 34 | ########## 35 | 36 | """Convert a Uyghur text between different orthographies.""" 37 | 38 | from __future__ import print_function 39 | 40 | import codecs 41 | import sys 42 | 43 | def to_unicode_or_bust(obj, encoding='utf-8'): 44 | """Ensure that an object is unicode.""" 45 | # function by Kuman McMillan ( http://farmdev.com/talks/unicode ) 46 | if isinstance(obj, basestring): 47 | if not isinstance(obj, unicode): 48 | obj = unicode(obj, encoding) 49 | return obj 50 | 51 | class UyghurString(object): 52 | """String object containing text in the Uyghur language.""" 53 | 54 | def __init__(self, input_text, input_orth): 55 | """Initialize text object. 56 | 57 | Parameters 58 | --------- 59 | input_text (str): string containing Uyghur text 60 | input_orth (str): input orthography --- must be one of these: 61 | 'IPA', 'UyArabic', 'UyLatin', 'UyCyrillic', 'ChineseLatin', 62 | 'MengesLatin', 'JarringLatin', 'JarringArabic', 'MalovLatin' 63 | """ 64 | self.input_text = input_text 65 | self.input_orth = input_orth 66 | self.orth_key = { 67 | 'IPA': 0, 68 | 'UyArabic': 1, 69 | 'UyLatin': 2, 70 | 'UyCyrillic': 3, 71 | 'ChineseLatin': 4, 72 | 'MengesLatin': 5, 73 | 'JarringLatin': 6, 74 | 'JarringArabic': 7, 75 | 'MalovLatin': 8 76 | } 77 | self.uyghur_orthographies = ( 78 | (u'a', u'\u0627', u'a', u'а', u'a', u'a', u'a', 7, u'а'), 79 | (u'ɑ', u'\u0627', u'a', u'а', u'a', u'á', u'a', 7, 8), 80 | (u'aː', u'\u0627', u'a', u'а', u'a', u'ā', u'aː', u'\u0627', 8), 81 | (u'ɛ', u'\u06D5', u'e', u'е', u'e', u'ä', u'ɛ', u'\u06D5', u'ӓ'), 82 | (u'æ', u'\u06D5', u'e', u'е', u'e', u'ä', u'æ', u'\u06D5', 8), 83 | (u'b', u'\u0628', u'b', u'б', u'b', u'b', u'b', u'\u0628', u'б'), 84 | (u'd', u'\u062F', u'd', u'д', u'd', u'd', u'd', u'\u062F', u'д'), 85 | (u'e', u'\u06D0', u'ë', u'е', u'e', u'e', u'e', 7, u'е'), 86 | (u'f', u'\u0641', u'f', u'ф', u'f', u'f', u'f', u'\u0641', 8), 87 | (u'ɡ', u'\u06AF', u'g', u'г', u'g', u'g', u'g', u'\u06AF', u'г'), 88 | (u'ɣ', u'\u063A', u'gh', u'ғ', u'ƣ', u'ɣ', u'ɣ', u'\u063A', u'ҕ'), 89 | (u'h', u'\u0647', u'h', u'һ', u'ħ', u'h', u'h', u'\u0647', 8), 90 | (u'χ', u'\u062E', u'x', u'х', u'h', u'x', u'χ', u'\u062E', u'х'), 91 | (u'i', u'\u0649', u'i', u'и', u'i', u'i', u'i', u'\u0649', u'i'), 92 | (u'ɨ', u'\u0649', u'i', u'и', u'i', u'i', u'ï', u'\u0649', u'ы'), 93 | (u'dʒ', u'\u062C', u'j', u'ж', u'j', u'dž', u'dʒ', u'\u062C', u'з'), 94 | (u'kʰ', u'\u0643', u'k', u'k', u'k', u'k', u'k', u'\u0643', u'k'), 95 | (u'qʰ', u'\u0642', u'q', u'к', u'ḳ', u'q', u'q', u'\u0642', u'к'), 96 | (u'l', u'\u0644', u'l', u'л', u'l', u'l', u'l', u'\u0644', u'л'), 97 | (u'ł', u'\u0644', u'l', u'л', u'l', u'ł', u'l', u'\u0644', u'l'), 98 | (u'm', u'\u0645', u'm', u'м', u'm', u'm', u'm', u'\u0645', u'м'), 99 | (u'n', u'\u0646', u'n', u'н', u'n', u'n', u'n', u'\u0646', u'н'), 100 | (u'ŋ', u'\u06AD', u'ng', u'ң', u'ng', u'ñ', u'ŋ', u'\u06AD', u'ң'), 101 | (u'o', u'\u0648', u'o', u'о', u'o', u'o', u'o', u'\u0648', u'о'), 102 | (u'ø', u'\u06C6', u'ö', u'ө', u'ɵ', u'ö', u'ö', u'\u0648', u'ӧ'), 103 | (u'pʰ', u'\u067E', u'p', u'п', u'p', u'p', u'p', u'\u067E', u'п'), 104 | (u'r', u'\u0631', u'r', u'р', u'r', u'r', u'r', u'\u0631', u'р'), 105 | (u's', u'\u0633', u's', u'с', u's', u's', u's', u'\u0633', u'с'), 106 | (u'ʃ', u'\u0634', u'sh', u'ш', u'x', u'š', u'š', u'\u0634', u'ш'), 107 | (u'tʰ', u'\u062A', u't', u'т', u't', u't', u't', u'\u062A', u'т'), 108 | (u'tʃʰ', u'\u0686', u'ch', u'ч', u'q', u'č', u'č', u'\u0686', u'ч'), 109 | (u'u', u'\u06C7', u'u', u'у', u'u', u'u', u'u', u'\u0648', u'у'), 110 | (u'ɯ', u'\u06C7', u'u', u'у', u'u', u'ŏ', u'ɯ', u'\u0648', 8), 111 | (u'ʏ', u'\u06C7', u'u', u'у', u'u', u'ů', u'ů', u'\u0648', 8), 112 | (u'y', u'\u06C8', u'ü', u'ү', u'ü', u'ü', u'ů', u'\u06C8', 8), 113 | (u'yː', u'\u06C8', u'ü', u'ү', u'ü', u'ṻ', u'ůː', u'\u06C8', u'ӱ'), 114 | (u'ŭ', u'\u06C7', u'u', u'у', u'u', u'u', u'ŭ', u'\u06C8', 8), 115 | (u'w', u'\u06CB', u'w', u'в', u'w', u'w', u'v', u'\u06CB', u'в'), 116 | (u'j', u'\u064A', u'y', u'й', u'y', u'j', u'j', 7, u'ĭ'), 117 | (u'z', u'\u0632', u'z', u'з', u'z', u'z', u'z', u'\u0632', u'z'), 118 | (u'ʒ', u'\u0698', u'zh', u'ж', u'zh', u'ž', 6, 7, u'з'), 119 | (u'ʔ', u'\u0621', u"'", 3, u"'", u"'", u"'", 7, 8), 120 | (0, u'\u0626', u'', 3, 4, 5, 6, 7, 8), 121 | (0, u'\u06BE', u'h', 3, 4, 5, 6, 7, 8), 122 | ) 123 | 124 | def as_string(self): 125 | """Read the input file's contents into a string.""" 126 | with codecs.open(self.input_text, 'r+', encoding='utf-8') as f: 127 | return to_unicode_or_bust(f.read().replace(u'\n', u'')) 128 | 129 | def transliterate(self, output_orth, input_string=None): 130 | """Transliterate text to specified output orthography. 131 | 132 | Parameters 133 | --------- 134 | output_orth (str): output orthography --- must be one of these: 135 | 'IPA', 'UyArabic', 'UyLatin', 'UyCyrillic', 'ChineseLatin', 136 | 'MengesLatin', 'JarringLatin', 'JarringArabic', 'MalovLatin' 137 | """ 138 | if input_string == None: 139 | input_string = self.as_string() 140 | 141 | idx_c = self.orth_key[self.input_orth] 142 | idx_d = self.orth_key[output_orth] 143 | 144 | ## TODO: fix this case handling. This should give correct output for 145 | ## the three orthographies that don't distinguish case, but it won't 146 | ## transliterate an upper-case letter to another upper-case letter. 147 | ## TODO: make an upper-case version of the orthography dict? 148 | ## e.g., something like: 149 | ## try: 150 | ## upper_case = tup[idx_c].upper() ## for every entry? 151 | caseless_orths = ['IPA', 'UyArabic', 'JarringArabic'] 152 | if output_orth in caseless_orths: 153 | text_in = input_string.lower() 154 | text_out = input_string.lower() 155 | elif output_orth not in caseless_orths: 156 | text_in = input_string 157 | text_out = input_string 158 | ## TODO: Make the above more elegant 159 | 160 | for tup in self.uyghur_orthographies: 161 | input_char = tup[idx_c] 162 | output_char = tup[idx_d] 163 | 164 | if isinstance(input_char, int) or isinstance(output_char, int): 165 | pass 166 | else: 167 | text_out = text_out.replace(input_char, output_char) 168 | 169 | return text_out 170 | 171 | def main(input_file, input_orth, output_orth, output_file=None): 172 | """Convert file contents from one orthography to another.""" 173 | if output_file == None: 174 | output_file = input_file 175 | uy = UyghurString(input_file, input_orth) 176 | with codecs.open(output_file, 'w+', encoding='utf-8') as stream: 177 | stream.write(uy.transliterate(output_orth)) 178 | 179 | if __name__ == "__main__": 180 | if len(sys.argv) == 5: 181 | in_file = sys.argv[1] # input filename 182 | in_orth = sys.argv[2] # input orthography 183 | out_orth = sys.argv[3] # output orthography 184 | out_file = sys.argv[4] # output filename 185 | 186 | main(in_file, in_orth, out_orth, out_file) 187 | 188 | elif len(sys.argv) == 4: 189 | in_file = sys.argv[1] # input filename 190 | in_orth = sys.argv[2] # input orthography 191 | out_orth = sys.argv[3] # output orthography 192 | 193 | main(in_file, in_orth, out_orth) 194 | 195 | else: 196 | print("\nUSAGE:\n\tpython uyghurtransliterator.py " 197 | "inputfilename.txt inputOrthography outputOrthography " 198 | "(outputfilename.txt)\n\n" 199 | "inputOrthograpy and outputOrthography must be one of these:\n" 200 | "\t'IPA', 'UyArabic', 'UyLatin', 'UyCyrillic', 'ChineseLatin'\n" 201 | "\t'MengesLatin', 'JarringLatin', 'JarringArabic', 'MalovLatin'\n") 202 | -------------------------------------------------------------------------------- /WiktionaryScraper/wiktionaryscraper.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | ########## 5 | ## wiktionaryscraper.py Version 1.0 (2015-07-20) 6 | ## 7 | ## Original author: Matthew Menzenski (menzenski@ku.edu) 8 | ## 9 | ## License: MIT ( http://opensource.org/licenses/MIT ) 10 | ## 11 | ## 12 | ### The MIT License (MIT) 13 | ### 14 | ### Copyright (c) 2015 Matt Menzenski 15 | ### 16 | ### Permission is hereby granted, free of charge, to any person obtaining a 17 | ### copy of this software and associated documentation files (the "Software"), 18 | ### to deal in the Software without restriction, including without limitation 19 | ### the rights to use, copy, modify, merge, publish, distribute, sublicense, 20 | ### and/or sell copies of the Software, and to permit persons to whom the 21 | ### Software is furnished to do so, subject to the following conditions: 22 | ### 23 | ### The above copyright notice and this permission notice shall be included in 24 | ### all copies or substantial portions of the Software. 25 | ### 26 | ### THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS 27 | ### OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 28 | ### FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL 29 | ### THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 30 | ### LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 31 | ### FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 32 | ### DEALINGS IN THE SOFTWARE. 33 | ## 34 | ########## 35 | 36 | """ 37 | Get English translation from Wiktionary of each Mandarin word in a list. 38 | 39 | Take as input a text file in which each line contains a numeral and one 40 | or more Mandarin terms, separated by a comma, and return a text file in 41 | which each line contains a numeral (in the same order as in the input 42 | file) followed by a translation of each Mandarin term in that line (if 43 | a Wiktionary page exists for that term). 44 | 45 | E.g., the line 46 | 47 | 48,生气,发怒 48 | 49 | in the input yields 50 | 51 | 48;angry (生气), (literary) to become angry (发怒), 52 | 53 | in the output. 54 | """ 55 | 56 | #from __future__ import unicode_literals 57 | from bs4 import BeautifulSoup as Soup 58 | from urllib import FancyURLopener 59 | import urllib2 60 | import codecs 61 | import time 62 | import random 63 | 64 | input_file = "uyghurchineseitemswithindex.txt" 65 | 66 | results_file = "wiktionaryoutput.txt" 67 | 68 | 69 | class MyOpener(FancyURLopener): 70 | """FancyURLopener object with custom User-Agent field.""" 71 | 72 | ## regular Mac Safari browser: 73 | #version = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) " 74 | # "AppleWebKit/600.5.17 (KHTML, like Gecko) Version/8.0.5 " 75 | # "Safari/600.5.17") 76 | 77 | ## identify this web scraper as such 78 | ## and link to a page with a description of its purpose: 79 | version = ("Translation scraper created by Matt Menzenski. " 80 | "See www.menzenski.com/scraper for more information.") 81 | 82 | class RowInTheLexicon(object): 83 | """One line in the input file. 84 | 85 | 86 | Expects a file composed of lines with a numeral followed by 87 | either nothing or one or more Mandarin terms. The following are 88 | all valid lines: 89 | 90 | 104,提醒;警告 91 | 105,收集 92 | 106, 93 | 107, 94 | """ 95 | 96 | def __init__(self, mandarin_cell): 97 | """Initialize a new row object.""" 98 | self.mandarin_cell = mandarin_cell 99 | self.index = 0 100 | self.whole = '' 101 | self.searchable = [] 102 | self.all_english = [] 103 | self.english = '' 104 | self.best_translation = '' 105 | 106 | 107 | def get_lexicon_info(self, mandarin_cell): 108 | # TODO: Change name (reserve 'get' prefix for actual getters) 109 | """Split row into index (int) and Mandarin glosses (list).""" 110 | try: 111 | mysplit = mandarin_cell.split(",", 1) 112 | self.index = int(mysplit[0]) 113 | self.whole = mysplit[1].replace("\n", "") 114 | except ValueError: 115 | pass 116 | else: 117 | pass 118 | 119 | #tokens = re.split('[;,();, ]', self.whole) 120 | punct = [";", ",", "(", ")", ";", ",", "(", ")", '"', "'"] 121 | stripped = self.whole 122 | for i in punct: 123 | stripped = stripped.replace(i, " ") 124 | 125 | tokens = stripped.split(" ") 126 | 127 | self.searchable.append(self.whole) 128 | self.searchable.append(stripped) 129 | for token in tokens: 130 | if token != '': 131 | self.searchable.append(token) 132 | 133 | #while True: 134 | for _ in range(1,3): 135 | if len(self.searchable) >= 2: 136 | if self.searchable[0] == self.searchable[1]: 137 | self.searchable = self.searchable[1:] 138 | 139 | class WiktionaryEntry(object): 140 | """Entry on en.wiktionary.org for a Mandarin term.""" 141 | 142 | def __init__(self, mandarin_term): 143 | """Initialize an object for a Mandarin term.""" 144 | self.mandarin_term = mandarin_term 145 | self.english = [] 146 | self.english_str = ', '.join(self.english) 147 | self.english_short = '' 148 | self.address = "http://en.wiktionary.org/wiki/" + mandarin_term 149 | 150 | def check_page(self): 151 | """Load the actual wiktionary page for a term if it exists.""" 152 | try: 153 | #html = urllib.urlopen(self.address).read() 154 | myopener = MyOpener() 155 | html = myopener.open(self.address).read() 156 | soup = Soup(html) 157 | 158 | try: 159 | box = soup.find( 160 | "table", 161 | style=("border:1px solid #797979; margin-left: 1px; " 162 | "text-align:left; width:76%")) 163 | links = box.find_all("a") 164 | 165 | if len(links) > 0: 166 | self.address = "http://en.wiktionary.org" + links[0].get( 167 | "href") 168 | if self.address.endswith("#Chinese"): 169 | self.address = self.address[:-8] 170 | else: 171 | pass 172 | 173 | except AttributeError: 174 | pass 175 | else: 176 | pass 177 | 178 | except urllib2.HTTPError, e: 179 | print e.code 180 | 181 | except urllib2.URLError, e: 182 | print e.code 183 | 184 | else: 185 | pass 186 | 187 | def get_translation(self): 188 | """Find the translation of a Mandarin term from Wiktionary.""" 189 | try: 190 | #html = urllib.urlopen(self.address).read() 191 | myopener = MyOpener() 192 | html = myopener.open(self.address).read() 193 | soup = Soup(html) 194 | 195 | try: 196 | heading = soup.find( 197 | "span", {"class": "mw-headline", 198 | "id": ["Chinese", "Mandarin"]}) 199 | 200 | definition = heading.find_next("ol") 201 | 202 | new_def = definition.li 203 | 204 | self.english.append(new_def.text.split("\n")[0]) 205 | self.english_short = new_def.text.split( 206 | "\n")[0].replace(";", ",") 207 | 208 | if new_def.next_sibling.next_sibling: 209 | while True: 210 | newer_def = new_def.next_sibling.next_sibling 211 | self.english.append(newer_def.text.split("\n")[0]) 212 | 213 | new_def = newer_def 214 | 215 | except AttributeError: 216 | pass 217 | else: 218 | pass 219 | 220 | except urllib2.HTTPError, e: 221 | print e.code 222 | 223 | except urllib2.URLError, e: 224 | print e.code 225 | 226 | else: 227 | pass 228 | 229 | 230 | def main(): 231 | global pages_crawled 232 | with codecs.open(results_file, "a", encoding="utf-8") as stream: 233 | with codecs.open(input_file, mode="r", encoding="utf-8") as myitems: 234 | items = myitems.readlines() 235 | for item in items: 236 | if item.startswith("Index"): 237 | pass 238 | else: 239 | myrow = RowInTheLexicon(item.encode('utf-8')) 240 | myrow.get_lexicon_info(item.encode('utf-8')), 241 | 242 | stream.write("\n%s;" % str(myrow.index)), 243 | 244 | for term in myrow.searchable: 245 | 246 | wiki = WiktionaryEntry(term) 247 | wiki.check_page() 248 | wiki.get_translation() 249 | pages_crawled += 1 250 | print pages_crawled 251 | 252 | ## Delete some common Wiktionary entry prefixes: 253 | 254 | if wiki.english_short.startswith( 255 | "(Advanced Mandarin) "): 256 | wiki.english_short = wiki.english_short[20:] 257 | 258 | if wiki.english_short.startswith( 259 | "(Elementary Mandarin) "): 260 | wiki.english_short = wiki.english_short[22:] 261 | 262 | if wiki.english_short.startswith( 263 | "(Beginning Mandarin) "): 264 | wiki.english_short = wiki.english_short[21:] 265 | 266 | if wiki.english_short.startswith(u"† "): 267 | wiki.english_short = wiki.english_short[2:] + \ 268 | " [obsolete]" 269 | 270 | if wiki.english_short != '': 271 | if not wiki.english_short.startswith( 272 | "This entry needs a definition. " \ 273 | "Please add one, then remove"): 274 | try: 275 | stream.write("%s (%s), " % ( 276 | wiki.english_short.decode('utf-8'), 277 | term.decode('utf-8'))), 278 | except UnicodeDecodeError: 279 | stream.write( 280 | "UnicodeDecodeError (%s)" % term.decode( 281 | 'utf-8')) 282 | except UnicodeEncodeError: 283 | stream.write( 284 | "UnicodeEncodeError (%s)" % term.decode( 285 | 'utf-8')) 286 | else: 287 | pass 288 | else: 289 | pass 290 | 291 | ## wait a few seconds between searches--we don't 292 | ## want to overload the server 293 | delay = random.randint(0,4) 294 | time.sleep(delay) 295 | 296 | ## longer wait after 100 pages 297 | if pages_crawled % 100: 298 | long_delay = random.randint(11,29) 299 | time.sleep(long_delay) 300 | 301 | if __name__ == "__main__": 302 | ## counter to track the number of pages crawled 303 | pages_crawled = 0 304 | main() 305 | --------------------------------------------------------------------------------