├── IPATranscriber
    ├── README.md
    └── ipatranscriber.py
├── README.md
├── .gitignore
├── LICENSE
├── UyghurTransliterator
    ├── README.md
    └── uyghurtransliterator.py
└── WiktionaryScraper
    ├── README.md
    └── wiktionaryscraper.py


/IPATranscriber/README.md:
--------------------------------------------------------------------------------
 1 | # Uyghur IPA Transcriber
 2 | 
 3 | This script takes as input a list of Uyghur words in Latin orthography (one
 4 | word per line) and returns a list of those words and their broad phonemic
 5 | transcriptions (one word and its transcription per line).
 6 | 
 7 | Thus the input:
 8 | 
 9 | ```
10 | yéziliq
11 | zeple-
12 | yashliq
13 | a'ile
14 | ```
15 | 
16 | Returns the output:
17 | 
18 | ```
19 | yéziliq;jeziliqʰ
20 | zeple-;zɛplɛ-
21 | yashliq;jaʃliq
22 | a'ile;aˀilɛ
23 | ```
24 | 
25 | The IPA output is largely a one-to-one substitution of graphs or digraphs for
26 | their phonemic values, with the exception that aspiration is blocked before consonants.
27 | Future versions of this script will take other orthographies (i.e., Uyghur
28 | Perso-Arabic script and Uyghur Cyrillic) as input.
29 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Uyghur-resources
 2 | 
 3 | This repository contains a collection of resources for Uyghur linguistics,
 4 | mostly constructed to assist in the initial data-gathering and data-organizing
 5 | stages of a larger project, the goal of which is the construction of an
 6 | automated speech-recognition system for the Uyghur language.
 7 | 
 8 | * **IPATranscriber** contains a Python script which outputs broad phonemic
 9 | transcriptions (in the International Phonetic Alphabet) of Uyghur words in
10 | Latin orthography.
11 | 
12 | * **UyghurTransliterator** contains a Python script which transliterates an
13 | input file (in Uyghur) from one writing system to another. Nine writing
14 | systems are currently supported.
15 | 
16 | * **WiktionaryScraper** contains a Python script which fetches English
17 | translations of Mandarin words from [Wiktionary](https://en.wiktionary.org).
18 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | # miscellaneous input and output files
 2 | *.txt
 3 | *.xml
 4 | *.ods
 5 | *.xls
 6 | 
 7 | # Byte-compiled / optimized / DLL files
 8 | __pycache__/
 9 | *.py[cod]
10 | 
11 | # C extensions
12 | *.so
13 | 
14 | # Distribution / packaging
15 | .Python
16 | env/
17 | build/
18 | develop-eggs/
19 | dist/
20 | downloads/
21 | eggs/
22 | .eggs/
23 | lib/
24 | lib64/
25 | parts/
26 | sdist/
27 | var/
28 | *.egg-info/
29 | .installed.cfg
30 | *.egg
31 | 
32 | # PyInstaller
33 | #  Usually these files are written by a python script from a template
34 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
35 | *.manifest
36 | *.spec
37 | 
38 | # Installer logs
39 | pip-log.txt
40 | pip-delete-this-directory.txt
41 | 
42 | # Unit test / coverage reports
43 | htmlcov/
44 | .tox/
45 | .coverage
46 | .coverage.*
47 | .cache
48 | nosetests.xml
49 | coverage.xml
50 | *,cover
51 | 
52 | # Translations
53 | *.mo
54 | *.pot
55 | 
56 | # Django stuff:
57 | *.log
58 | 
59 | # Sphinx documentation
60 | docs/_build/
61 | 
62 | # PyBuilder
63 | target/
64 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2015 Matt Menzenski
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 
23 | 


--------------------------------------------------------------------------------
/UyghurTransliterator/README.md:
--------------------------------------------------------------------------------
 1 | # Uyghur Transliterator
 2 | 
 3 | This script takes as input a list of Uyghur words in Latin orthography (one
 4 | word per line) and returns a list of those words and their broad phonemic
 5 | transcriptions (one word and its transcription per line).
 6 | 
 7 | ## Usage
 8 | 
 9 | The script may be imported as a module, but was written to be called from the
10 | command line. The template is `python uyghurtransliterator.py inputfilename.txt inputOrthography outputOrthography (outputfilename.txt)`, where `inputfilename.txt` is the name of the input file, `inputOrthography` and
11 | `outputOrthography` are, respectively, the name of the orthography used in the
12 | input file and the desired output orthography, and `outputfilename.txt` is the
13 | name of the output file. The name of the output file is optional: if one is not
14 | supplied, the input file will be overwritten with the results of the
15 | transliteration.
16 | 
17 | ## Supported orthographies
18 | 
19 | * `IPA` -- the International Phonetic Alphabet
20 | * `UyArabic` -- Uyghur Arabic
21 | * `UyLatin` -- Uyghur Latin
22 | * `UyCyrillic` -- Uyghur Cyrillic
23 | * `ChineseLatin`
24 | * `MengesLatin`
25 | * `JarringLatin`
26 | * `JarringArabic`
27 | * `MalovLatin`
28 | 


--------------------------------------------------------------------------------
/WiktionaryScraper/README.md:
--------------------------------------------------------------------------------
 1 | # Wiktionary Scraper
 2 | 
 3 | I have access to a Uyghur lexicon with 33,810 headwords, but only about a fifth
 4 | of those entries have a translation provided in English. Most entries are
 5 | glossed in Mandarin instead. I don't know Mandarin, and needed a way to get
 6 | high-quality English translations of those Mandarin glosses.
 7 | 
 8 | To that end, I threw together the script `wiktionaryscraper.py`, which does two
 9 | things:
10 | 
11 | 1. Fetches English translations (from [Wiktionary](https://en.wiktionary.org))
12 | of those Mandarin glosses
13 | 2. Preserves the original order of the input (both the order of multiple
14 |     Mandarin glosses in one line as well as the order of lines are respected)
15 | 
16 | Each line of the input file takes the form `[numeral],[Mandarin gloss(es)]`,
17 | with the numeral and comma mandatory and the Mandarin gloss(es) optional. The
18 | following lines are all valid input:
19 | 
20 | ```
21 | 4745,小心；轻轻地
22 | 4746,响应
23 | 4747,
24 | 4748,
25 | ```
26 | 
27 | The output is formatted in much the same way, only with a semicolon rather than
28 | a comma separating the numeral and the gloss (for easier copy-pasting into a
29 | spreadsheet). The output corresponding to the above input example looks like
30 | this:
31 | 
32 | ```
33 | 4745;careful (小心),
34 | 4746;to respond, to answer (响应),
35 | 4747;
36 | 4748;
37 | ```
38 | 
39 | Note that only the first of the two Mandarin glosses in line 4745 has an entry
40 | on Wiktionary, so only that gloss returns an English translation in the output.
41 | The Mandarin gloss follows its English translation in parentheses.
42 | 


--------------------------------------------------------------------------------
/IPATranscriber/ipatranscriber.py:
--------------------------------------------------------------------------------
  1 | #! /usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | ##########
  5 | ## ipatranscriber.py Version 1.0 (2015-07-21)
  6 | ##
  7 | ## Original author: Matthew Menzenski (menzenski@ku.edu)
  8 | ##
  9 | ## License: MIT ( http://opensource.org/licenses/MIT )
 10 | ##
 11 | ##
 12 | ### The MIT License (MIT)
 13 | ###
 14 | ### Copyright (c) 2015 Matt Menzenski
 15 | ###
 16 | ### Permission is hereby granted, free of charge, to any person obtaining a
 17 | ### copy of this software and associated documentation files (the "Software"),
 18 | ### to deal in the Software without restriction, including without limitation
 19 | ### the rights to use, copy, modify, merge, publish, distribute, sublicense,
 20 | ### and/or sell copies of the Software, and to permit persons to whom the
 21 | ### Software is furnished to do so, subject to the following conditions:
 22 | ###
 23 | ### The above copyright notice and this permission notice shall be included in
 24 | ### all copies or substantial portions of the Software.
 25 | ###
 26 | ### THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
 27 | ### OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 28 | ### FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
 29 | ### THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 30 | ### LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
 31 | ### FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
 32 | ### DEALINGS IN THE SOFTWARE.
 33 | ##
 34 | ##########
 35 | 
 36 | """
 37 | Take a list of Uyghur words and add broad IPA transcription.
 38 | 
 39 | The input:
 40 | 
 41 |     yéziliq
 42 |     zeple-
 43 |     yashliq
 44 |     a'ile
 45 | 
 46 | Returns the output:
 47 | 
 48 |     yéziliq;jeziliqʰ
 49 |     zeple-;zɛplɛ-
 50 |     yashliq;jaʃliqʰ
 51 |     a'ile;aˀilɛ
 52 | 
 53 | The IPA output is largely a one-to-one substitution of graphs or digraphs for
 54 | their phonemic values, with the exception that aspiration is blocked consonants.
 55 | """
 56 | 
 57 | from __future__ import unicode_literals
 58 | import codecs
 59 | 
 60 | ## our Uyghur word list
 61 | input_file = "uyghuritems.txt"
 62 | 
 63 | ## orthography/IPA pairs in which one or both members are digraphs
 64 | uyghur_multiples = {
 65 |     "ch": "ʧʰ",
 66 |     "gh": "ɣ",
 67 |     "ng": "ŋ",
 68 |     "sh": "ʃ",
 69 |     "zh": "ʒ",
 70 |     "p": "pʰ",
 71 |     "t": "tʰ",
 72 |     "q": "qʰ",
 73 |     "k": "kʰ",
 74 |     ",": " | ",
 75 |     ".": " | ",
 76 |     ":": " | ",
 77 |     ";": " | ",
 78 |     "?": " | ",
 79 |     "!": " | "
 80 |     }
 81 | 
 82 | ## orthography/IPA pairs in which both members are a single character
 83 | ## (pairs in which IPA and orthography are equal don't need to be replaced.)
 84 | uyghur_singles = {
 85 |     "e": "ɛ",
 86 |     "é": "e",
 87 |     "x": "χ",
 88 |     "j": "ʤ",
 89 |     "ö": "ø",
 90 |     "‘": "ˀ",
 91 |     "'": "ˀ"
 92 |     }
 93 | 
 94 | ## y and j get treated separately, since both symbols are input of one process
 95 | ## and output of another (i.e., otherwise, we'd get ʤ as output when we want j)
 96 | uyghur_y = {
 97 |     "y": "j"
 98 |     }
 99 | 
100 | uyghur_u = {
101 |     "ü": "y"
102 |     }
103 | 
104 | def uyghur_latin_to_ipa(word):
105 |     """Return broad IPA transcription of a Uyghur word in Latin orthography."""
106 | 
107 |     ## new_word will be the output. start by setting it equal to the input
108 |     new_word = word
109 | 
110 |     ## first replace the diagraphs
111 |     for char in uyghur_multiples.keys():
112 |         new_word = new_word.replace(char, uyghur_multiples[char])
113 | 
114 |     ## then the single characters
115 |     for char in uyghur_singles.keys():
116 |         new_word = new_word.replace(char, uyghur_singles[char])
117 | 
118 |     ## then y
119 |     for char in uyghur_y.keys():
120 |         new_word = new_word.replace(char, uyghur_y[char])
121 | 
122 |     ## then j
123 |     for char in uyghur_u.keys():
124 |         new_word = new_word.replace(char, uyghur_u[char])
125 | 
126 |     ## list of Uyghur consonants (from "uigCLpixzd2ipa.xsl")
127 |     consonants = [
128 |         "b", "d", "g", "ɣ", "h", "χ", "ʤ", "k", "q", "l", "ɫ", "m", "n",
129 |         "ŋ", "p", "r", "s", "ʃ", "t", "ʧ", "w", "j", "z", "ʒ"
130 |         ]
131 | 
132 |     ## create a list of aspiration ("ʰ") + consonant sequences
133 |     sequences = ["ʰ" + consonant for consonant in consonants]
134 | 
135 |     ## replace sequences of ʰ + consonant with the plain consonant
136 |     for sequence in sequences:
137 |         new_word = new_word.replace(sequence, sequence[1:])
138 | 
139 |     ## output the word in IPA transcription
140 |     return new_word
141 | 
142 | def main():
143 |     ## open input file in read-only mode with utf-8 encoding
144 |     with codecs.open(input_file, mode='r', encoding='utf-8') as words:
145 |         for line in words:
146 |             ## replace spaces with an underscore---we'll undo this later
147 |             ## (lines get split on whitespace, so this keeps entries together)
148 |             line = line.replace(" ", "_")
149 |             for word in line.split():
150 |                 myipa = uyghur_latin_to_ipa(word)
151 |                 ## output with a semicolon between for straightforward
152 |                 ## pasting into a spreadsheet
153 |                 print "{};{}".format(
154 |                     word.replace("_"," "), myipa.replace("_", " "))
155 | 
156 | ## run the main() function if this script is called as a standalone script
157 | ## (but if imported, e.g., using the call
158 | ## from adduyghuripa import uyughur_latin_to_ipa
159 | ## the main() function won't run)
160 | if __name__ == "__main__":
161 |     main()
162 | 


--------------------------------------------------------------------------------
/UyghurTransliterator/uyghurtransliterator.py:
--------------------------------------------------------------------------------
  1 | #! /usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | ##########
  5 | ## uyghurtransliterator.py Version 0.2 (2015-11-09)
  6 | ##
  7 | ## Original author: Matthew Menzenski (menzenski@ku.edu)
  8 | ##
  9 | ## License: MIT ( http://opensource.org/licenses/MIT )
 10 | ##
 11 | ##
 12 | ### The MIT License (MIT)
 13 | ###
 14 | ### Copyright (c) 2015 Matt Menzenski
 15 | ###
 16 | ### Permission is hereby granted, free of charge, to any person obtaining a
 17 | ### copy of this software and associated documentation files (the "Software"),
 18 | ### to deal in the Software without restriction, including without limitation
 19 | ### the rights to use, copy, modify, merge, publish, distribute, sublicense,
 20 | ### and/or sell copies of the Software, and to permit persons to whom the
 21 | ### Software is furnished to do so, subject to the following conditions:
 22 | ###
 23 | ### The above copyright notice and this permission notice shall be included in
 24 | ### all copies or substantial portions of the Software.
 25 | ###
 26 | ### THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
 27 | ### OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 28 | ### FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
 29 | ### THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 30 | ### LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
 31 | ### FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
 32 | ### DEALINGS IN THE SOFTWARE.
 33 | ##
 34 | ##########
 35 | 
 36 | """Convert a Uyghur text between different orthographies."""
 37 | 
 38 | from __future__ import print_function
 39 | 
 40 | import codecs
 41 | import sys
 42 | 
 43 | def to_unicode_or_bust(obj, encoding='utf-8'):
 44 |     """Ensure that an object is unicode."""
 45 |     # function by Kuman McMillan ( http://farmdev.com/talks/unicode )
 46 |     if isinstance(obj, basestring):
 47 |         if not isinstance(obj, unicode):
 48 |             obj = unicode(obj, encoding)
 49 |     return obj
 50 | 
 51 | class UyghurString(object):
 52 |     """String object containing text in the Uyghur language."""
 53 | 
 54 |     def __init__(self, input_text, input_orth):
 55 |         """Initialize text object.
 56 | 
 57 |         Parameters
 58 |         ---------
 59 |           input_text (str): string containing Uyghur text
 60 |           input_orth (str): input orthography --- must be one of these:
 61 |             'IPA', 'UyArabic', 'UyLatin', 'UyCyrillic', 'ChineseLatin',
 62 |             'MengesLatin', 'JarringLatin', 'JarringArabic', 'MalovLatin'
 63 |         """
 64 |         self.input_text = input_text
 65 |         self.input_orth = input_orth
 66 |         self.orth_key = {
 67 |             'IPA': 0,
 68 |             'UyArabic': 1,
 69 |             'UyLatin': 2,
 70 |             'UyCyrillic': 3,
 71 |             'ChineseLatin': 4,
 72 |             'MengesLatin': 5,
 73 |             'JarringLatin': 6,
 74 |             'JarringArabic': 7,
 75 |             'MalovLatin': 8
 76 |             }
 77 |         self.uyghur_orthographies = (
 78 |             (u'a', u'\u0627', u'a', u'а', u'a', u'a', u'a', 7, u'а'),
 79 |             (u'ɑ', u'\u0627', u'a', u'а', u'a', u'á', u'a', 7, 8),
 80 |             (u'aː', u'\u0627', u'a', u'а', u'a', u'ā', u'aː', u'\u0627', 8),
 81 |             (u'ɛ', u'\u06D5', u'e', u'е', u'e', u'ä', u'ɛ', u'\u06D5', u'ӓ'),
 82 |             (u'æ', u'\u06D5', u'e', u'е', u'e', u'ä', u'æ', u'\u06D5', 8),
 83 |             (u'b', u'\u0628', u'b', u'б', u'b', u'b', u'b', u'\u0628', u'б'),
 84 |             (u'd', u'\u062F', u'd', u'д', u'd', u'd', u'd', u'\u062F', u'д'),
 85 |             (u'e', u'\u06D0', u'ë', u'е', u'e', u'e', u'e', 7, u'е'),
 86 |             (u'f', u'\u0641', u'f', u'ф', u'f', u'f', u'f', u'\u0641', 8),
 87 |             (u'ɡ', u'\u06AF', u'g', u'г', u'g', u'g', u'g', u'\u06AF', u'г'),
 88 |             (u'ɣ', u'\u063A', u'gh', u'ғ', u'ƣ', u'ɣ', u'ɣ', u'\u063A', u'ҕ'),
 89 |             (u'h', u'\u0647', u'h', u'һ', u'ħ', u'h', u'h', u'\u0647', 8),
 90 |             (u'χ', u'\u062E', u'x', u'х', u'h', u'x', u'χ', u'\u062E', u'х'),
 91 |             (u'i', u'\u0649', u'i', u'и', u'i', u'i', u'i', u'\u0649', u'i'),
 92 |             (u'ɨ', u'\u0649', u'i', u'и', u'i', u'i', u'ï', u'\u0649', u'ы'),
 93 |             (u'dʒ', u'\u062C', u'j', u'ж', u'j', u'dž', u'dʒ', u'\u062C', u'з'),
 94 |             (u'kʰ', u'\u0643', u'k', u'k', u'k', u'k', u'k', u'\u0643', u'k'),
 95 |             (u'qʰ', u'\u0642', u'q', u'к', u'ḳ', u'q', u'q', u'\u0642', u'к'),
 96 |             (u'l', u'\u0644', u'l', u'л', u'l', u'l', u'l', u'\u0644', u'л'),
 97 |             (u'ł', u'\u0644', u'l', u'л', u'l', u'ł', u'l', u'\u0644', u'l'),
 98 |             (u'm', u'\u0645', u'm', u'м', u'm', u'm', u'm', u'\u0645', u'м'),
 99 |             (u'n', u'\u0646', u'n', u'н', u'n', u'n', u'n', u'\u0646', u'н'),
100 |             (u'ŋ', u'\u06AD', u'ng', u'ң', u'ng', u'ñ', u'ŋ', u'\u06AD', u'ң'),
101 |             (u'o', u'\u0648', u'o', u'о', u'o', u'o', u'o', u'\u0648', u'о'),
102 |             (u'ø', u'\u06C6', u'ö', u'ө', u'ɵ', u'ö', u'ö', u'\u0648', u'ӧ'),
103 |             (u'pʰ', u'\u067E', u'p', u'п', u'p', u'p', u'p', u'\u067E', u'п'),
104 |             (u'r', u'\u0631', u'r', u'р', u'r', u'r', u'r', u'\u0631', u'р'),
105 |             (u's', u'\u0633', u's', u'с', u's', u's', u's', u'\u0633', u'с'),
106 |             (u'ʃ', u'\u0634', u'sh', u'ш', u'x', u'š', u'š', u'\u0634', u'ш'),
107 |             (u'tʰ', u'\u062A', u't', u'т', u't', u't', u't', u'\u062A', u'т'),
108 |             (u'tʃʰ', u'\u0686', u'ch', u'ч', u'q', u'č', u'č', u'\u0686', u'ч'),
109 |             (u'u', u'\u06C7', u'u', u'у', u'u', u'u', u'u', u'\u0648', u'у'),
110 |             (u'ɯ', u'\u06C7', u'u', u'у', u'u', u'ŏ', u'ɯ', u'\u0648', 8),
111 |             (u'ʏ', u'\u06C7', u'u', u'у', u'u', u'ů', u'ů', u'\u0648', 8),
112 |             (u'y', u'\u06C8', u'ü', u'ү', u'ü', u'ü', u'ů', u'\u06C8', 8),
113 |             (u'yː', u'\u06C8', u'ü', u'ү', u'ü', u'ṻ', u'ůː', u'\u06C8', u'ӱ'),
114 |             (u'ŭ', u'\u06C7', u'u', u'у', u'u', u'u', u'ŭ', u'\u06C8', 8),
115 |             (u'w', u'\u06CB', u'w', u'в', u'w', u'w', u'v', u'\u06CB', u'в'),
116 |             (u'j', u'\u064A', u'y', u'й', u'y', u'j', u'j', 7, u'ĭ'),
117 |             (u'z', u'\u0632', u'z', u'з', u'z', u'z', u'z', u'\u0632', u'z'),
118 |             (u'ʒ', u'\u0698', u'zh', u'ж', u'zh', u'ž', 6, 7, u'з'),
119 |             (u'ʔ', u'\u0621', u"'", 3, u"'", u"'", u"'", 7, 8),
120 |             (0, u'\u0626', u'', 3, 4, 5, 6, 7, 8),
121 |             (0, u'\u06BE', u'h', 3, 4, 5, 6, 7, 8),
122 |             )
123 | 
124 |     def as_string(self):
125 |         """Read the input file's contents into a string."""
126 |         with codecs.open(self.input_text, 'r+', encoding='utf-8') as f:
127 |             return to_unicode_or_bust(f.read().replace(u'\n', u''))
128 | 
129 |     def transliterate(self, output_orth, input_string=None):
130 |         """Transliterate text to specified output orthography.
131 | 
132 |         Parameters
133 |         ---------
134 |           output_orth (str): output orthography --- must be one of these:
135 |             'IPA', 'UyArabic', 'UyLatin', 'UyCyrillic', 'ChineseLatin',
136 |             'MengesLatin', 'JarringLatin', 'JarringArabic', 'MalovLatin'
137 |         """
138 |         if input_string == None:
139 |             input_string = self.as_string()
140 | 
141 |         idx_c = self.orth_key[self.input_orth]
142 |         idx_d = self.orth_key[output_orth]
143 | 
144 |         ## TODO: fix this case handling. This should give correct output for
145 |         ## the three orthographies that don't distinguish case, but it won't
146 |         ## transliterate an upper-case letter to another upper-case letter.
147 |         ## TODO: make an upper-case version of the orthography dict?
148 |         ## e.g., something like:
149 |         ## try:
150 |             ## upper_case = tup[idx_c].upper() ## for every entry?
151 |         caseless_orths = ['IPA', 'UyArabic', 'JarringArabic']
152 |         if output_orth in caseless_orths:
153 |             text_in = input_string.lower()
154 |             text_out = input_string.lower()
155 |         elif output_orth not in caseless_orths:
156 |             text_in = input_string
157 |             text_out = input_string
158 |         ## TODO: Make the above more elegant
159 | 
160 |         for tup in self.uyghur_orthographies:
161 |             input_char = tup[idx_c]
162 |             output_char = tup[idx_d]
163 | 
164 |             if isinstance(input_char, int) or isinstance(output_char, int):
165 |                 pass
166 |             else:
167 |                 text_out = text_out.replace(input_char, output_char)
168 | 
169 |         return text_out
170 | 
171 | def main(input_file, input_orth, output_orth, output_file=None):
172 |     """Convert file contents from one orthography to another."""
173 |     if output_file == None:
174 |         output_file = input_file
175 |     uy = UyghurString(input_file, input_orth)
176 |     with codecs.open(output_file, 'w+', encoding='utf-8') as stream:
177 |         stream.write(uy.transliterate(output_orth))
178 | 
179 | if __name__ == "__main__":
180 |     if len(sys.argv) == 5:
181 |         in_file = sys.argv[1]  # input filename
182 |         in_orth = sys.argv[2]  # input orthography
183 |         out_orth = sys.argv[3] # output orthography
184 |         out_file = sys.argv[4] # output filename
185 | 
186 |         main(in_file, in_orth, out_orth, out_file)
187 | 
188 |     elif len(sys.argv) == 4:
189 |         in_file = sys.argv[1]  # input filename
190 |         in_orth = sys.argv[2]  # input orthography
191 |         out_orth = sys.argv[3] # output orthography
192 | 
193 |         main(in_file, in_orth, out_orth)
194 | 
195 |     else:
196 |         print("\nUSAGE:\n\tpython uyghurtransliterator.py "
197 |               "inputfilename.txt inputOrthography outputOrthography "
198 |               "(outputfilename.txt)\n\n"
199 |               "inputOrthograpy and outputOrthography must be one of these:\n"
200 |               "\t'IPA', 'UyArabic', 'UyLatin', 'UyCyrillic', 'ChineseLatin'\n"
201 |               "\t'MengesLatin', 'JarringLatin', 'JarringArabic', 'MalovLatin'\n")
202 | 


--------------------------------------------------------------------------------
/WiktionaryScraper/wiktionaryscraper.py:
--------------------------------------------------------------------------------
  1 | #! /usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | ##########
  5 | ## wiktionaryscraper.py Version 1.0 (2015-07-20)
  6 | ##
  7 | ## Original author: Matthew Menzenski (menzenski@ku.edu)
  8 | ##
  9 | ## License: MIT ( http://opensource.org/licenses/MIT )
 10 | ##
 11 | ##
 12 | ### The MIT License (MIT)
 13 | ###
 14 | ### Copyright (c) 2015 Matt Menzenski
 15 | ###
 16 | ### Permission is hereby granted, free of charge, to any person obtaining a
 17 | ### copy of this software and associated documentation files (the "Software"),
 18 | ### to deal in the Software without restriction, including without limitation
 19 | ### the rights to use, copy, modify, merge, publish, distribute, sublicense,
 20 | ### and/or sell copies of the Software, and to permit persons to whom the
 21 | ### Software is furnished to do so, subject to the following conditions:
 22 | ###
 23 | ### The above copyright notice and this permission notice shall be included in
 24 | ### all copies or substantial portions of the Software.
 25 | ###
 26 | ### THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
 27 | ### OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 28 | ### FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
 29 | ### THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 30 | ### LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
 31 | ### FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
 32 | ### DEALINGS IN THE SOFTWARE.
 33 | ##
 34 | ##########
 35 | 
 36 | """
 37 | Get English translation from Wiktionary of each Mandarin word in a list.
 38 | 
 39 | Take as input a text file in which each line contains a numeral and one
 40 | or more Mandarin terms, separated by a comma, and return a text file in
 41 | which each line contains a numeral (in the same order as in the input
 42 | file) followed by a translation of each Mandarin term in that line (if
 43 | a Wiktionary page exists for that term).
 44 | 
 45 | E.g., the line
 46 | 
 47 |     48,生气，发怒
 48 | 
 49 | in the input yields
 50 | 
 51 |     48;angry (生气), (literary) to become angry (发怒),
 52 | 
 53 | in the output.
 54 | """
 55 | 
 56 | #from __future__ import unicode_literals
 57 | from bs4 import BeautifulSoup as Soup
 58 | from urllib import FancyURLopener
 59 | import urllib2
 60 | import codecs
 61 | import time
 62 | import random
 63 | 
 64 | input_file = "uyghurchineseitemswithindex.txt"
 65 | 
 66 | results_file = "wiktionaryoutput.txt"
 67 | 
 68 | 
 69 | class MyOpener(FancyURLopener):
 70 |     """FancyURLopener object with custom User-Agent field."""
 71 | 
 72 |     ## regular Mac Safari browser:
 73 |     #version = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) "
 74 |     #           "AppleWebKit/600.5.17 (KHTML, like Gecko) Version/8.0.5 "
 75 |     #           "Safari/600.5.17")
 76 | 
 77 |     ## identify this web scraper as such
 78 |     ## and link to a page with a description of its purpose:
 79 |     version = ("Translation scraper created by Matt Menzenski. "
 80 |                "See www.menzenski.com/scraper for more information.")
 81 | 
 82 | class RowInTheLexicon(object):
 83 |     """One line in the input file.
 84 | 
 85 | 
 86 |     Expects a file composed of lines with a numeral followed by
 87 |     either nothing or one or more Mandarin terms. The following are
 88 |     all valid lines:
 89 | 
 90 |     104,提醒；警告
 91 |     105,收集
 92 |     106,
 93 |     107,
 94 |     """
 95 | 
 96 |     def __init__(self, mandarin_cell):
 97 |         """Initialize a new row object."""
 98 |         self.mandarin_cell = mandarin_cell
 99 |         self.index = 0
100 |         self.whole = ''
101 |         self.searchable = []
102 |         self.all_english = []
103 |         self.english = ''
104 |         self.best_translation = ''
105 | 
106 | 
107 |     def get_lexicon_info(self, mandarin_cell):
108 |         # TODO: Change name (reserve 'get' prefix for actual getters)
109 |         """Split row into index (int) and Mandarin glosses (list)."""
110 |         try:
111 |             mysplit = mandarin_cell.split(",", 1)
112 |             self.index = int(mysplit[0])
113 |             self.whole = mysplit[1].replace("\n", "")
114 |         except ValueError:
115 |             pass
116 |         else:
117 |             pass
118 | 
119 |         #tokens = re.split('[；，（）;, ]', self.whole)
120 |         punct = ["；", "，", "（", "）", ";", ",", "(", ")", '"', "'"]
121 |         stripped = self.whole
122 |         for i in punct:
123 |             stripped = stripped.replace(i, " ")
124 | 
125 |         tokens = stripped.split(" ")
126 | 
127 |         self.searchable.append(self.whole)
128 |         self.searchable.append(stripped)
129 |         for token in tokens:
130 |             if token != '':
131 |                 self.searchable.append(token)
132 | 
133 |         #while True:
134 |         for _ in range(1,3):
135 |             if len(self.searchable) >= 2:
136 |                 if self.searchable[0] == self.searchable[1]:
137 |                     self.searchable = self.searchable[1:]
138 | 
139 | class WiktionaryEntry(object):
140 |     """Entry on en.wiktionary.org for a Mandarin term."""
141 | 
142 |     def __init__(self, mandarin_term):
143 |         """Initialize an object for a Mandarin term."""
144 |         self.mandarin_term = mandarin_term
145 |         self.english = []
146 |         self.english_str = ', '.join(self.english)
147 |         self.english_short = ''
148 |         self.address = "http://en.wiktionary.org/wiki/" + mandarin_term
149 | 
150 |     def check_page(self):
151 |         """Load the actual wiktionary page for a term if it exists."""
152 |         try:
153 |             #html = urllib.urlopen(self.address).read()
154 |             myopener = MyOpener()
155 |             html = myopener.open(self.address).read()
156 |             soup = Soup(html)
157 | 
158 |             try:
159 |                 box = soup.find(
160 |                     "table",
161 |                     style=("border:1px solid #797979; margin-left: 1px; "
162 |                     "text-align:left; width:76%"))
163 |                 links = box.find_all("a")
164 | 
165 |                 if len(links) > 0:
166 |                     self.address = "http://en.wiktionary.org" + links[0].get(
167 |                         "href")
168 |                     if self.address.endswith("#Chinese"):
169 |                         self.address = self.address[:-8]
170 |                 else:
171 |                     pass
172 | 
173 |             except AttributeError:
174 |                 pass
175 |             else:
176 |                 pass
177 | 
178 |         except urllib2.HTTPError, e:
179 |             print e.code
180 | 
181 |         except urllib2.URLError, e:
182 |             print e.code
183 | 
184 |         else:
185 |             pass
186 | 
187 |     def get_translation(self):
188 |         """Find the translation of a Mandarin term from Wiktionary."""
189 |         try:
190 |             #html = urllib.urlopen(self.address).read()
191 |             myopener = MyOpener()
192 |             html = myopener.open(self.address).read()
193 |             soup = Soup(html)
194 | 
195 |             try:
196 |                 heading = soup.find(
197 |                     "span", {"class": "mw-headline",
198 |                              "id": ["Chinese", "Mandarin"]})
199 | 
200 |                 definition = heading.find_next("ol")
201 | 
202 |                 new_def = definition.li
203 | 
204 |                 self.english.append(new_def.text.split("\n")[0])
205 |                 self.english_short = new_def.text.split(
206 |                     "\n")[0].replace(";", ",")
207 | 
208 |                 if new_def.next_sibling.next_sibling:
209 |                     while True:
210 |                         newer_def = new_def.next_sibling.next_sibling
211 |                         self.english.append(newer_def.text.split("\n")[0])
212 | 
213 |                         new_def = newer_def
214 | 
215 |             except AttributeError:
216 |                 pass
217 |             else:
218 |                 pass
219 | 
220 |         except urllib2.HTTPError, e:
221 |             print e.code
222 | 
223 |         except urllib2.URLError, e:
224 |             print e.code
225 | 
226 |         else:
227 |             pass
228 | 
229 | 
230 | def main():
231 |     global pages_crawled
232 |     with codecs.open(results_file, "a", encoding="utf-8") as stream:
233 |         with codecs.open(input_file, mode="r", encoding="utf-8") as myitems:
234 |             items = myitems.readlines()
235 |             for item in items:
236 |                 if item.startswith("Index"):
237 |                     pass
238 |                 else:
239 |                     myrow = RowInTheLexicon(item.encode('utf-8'))
240 |                     myrow.get_lexicon_info(item.encode('utf-8')),
241 | 
242 |                     stream.write("\n%s;" % str(myrow.index)),
243 | 
244 |                     for term in myrow.searchable:
245 | 
246 |                         wiki = WiktionaryEntry(term)
247 |                         wiki.check_page()
248 |                         wiki.get_translation()
249 |                         pages_crawled += 1
250 |                         print pages_crawled
251 | 
252 |                         ## Delete some common Wiktionary entry prefixes:
253 | 
254 |                         if wiki.english_short.startswith(
255 |                                 "(Advanced Mandarin) "):
256 |                             wiki.english_short = wiki.english_short[20:]
257 | 
258 |                         if wiki.english_short.startswith(
259 |                                 "(Elementary Mandarin) "):
260 |                             wiki.english_short = wiki.english_short[22:]
261 | 
262 |                         if wiki.english_short.startswith(
263 |                                 "(Beginning Mandarin) "):
264 |                             wiki.english_short = wiki.english_short[21:]
265 | 
266 |                         if wiki.english_short.startswith(u"† "):
267 |                             wiki.english_short = wiki.english_short[2:] + \
268 |                                                  " [obsolete]"
269 | 
270 |                         if wiki.english_short != '':
271 |                             if not wiki.english_short.startswith(
272 |                                     "This entry needs a definition. " \
273 |                                     "Please add one, then remove"):
274 |                                 try:
275 |                                     stream.write("%s (%s), " % (
276 |                                         wiki.english_short.decode('utf-8'),
277 |                                         term.decode('utf-8'))),
278 |                                 except UnicodeDecodeError:
279 |                                     stream.write(
280 |                                         "UnicodeDecodeError (%s)" % term.decode(
281 |                                             'utf-8'))
282 |                                 except UnicodeEncodeError:
283 |                                     stream.write(
284 |                                         "UnicodeEncodeError (%s)" % term.decode(
285 |                                             'utf-8'))
286 |                                 else:
287 |                                     pass
288 |                             else:
289 |                                 pass
290 | 
291 |                     ## wait a few seconds between searches--we don't
292 |                     ## want to overload the server
293 |                     delay = random.randint(0,4)
294 |                     time.sleep(delay)
295 | 
296 |                     ## longer wait after 100 pages
297 |                     if pages_crawled % 100:
298 |                         long_delay = random.randint(11,29)
299 |                         time.sleep(long_delay)
300 | 
301 | if __name__ == "__main__":
302 |     ## counter to track the number of pages crawled
303 |     pages_crawled = 0
304 |     main()
305 | 


--------------------------------------------------------------------------------