├── README.md ├── .gitignore ├── doc ├── sorting-ambiguous-syllables.md └── sorting-standard-tibetan.md ├── rules.txt ├── test.py └── LICENSE.md /README.md: -------------------------------------------------------------------------------- 1 | # Testing and improving Tibetan collation 2 | 3 | This repository provides tests and improvements of Tibetan sorting. 4 | 5 | It currently uses one backend: 6 | 7 | - the [Unicode Collation Algorithm](http://unicode.org/reports/tr10/) (UCA) 8 | 9 | 10 | To use it, you must install Python 3 and [PyICU](http://pyicu.osafoundation.org/). 11 | 12 | To run the tests, simply run `./test.py`. 13 | 14 | See [ICU doc](http://userguide.icu-project.org/collation/customization) and [Unicode doc](http://www.unicode.org/reports/tr35/tr35-collation.html#Orderings) for rule file format. 15 | 16 | ## History 17 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | 5 | # C extensions 6 | *.so 7 | 8 | # Distribution / packaging 9 | .Python 10 | env/ 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | *.egg-info/ 23 | .installed.cfg 24 | *.egg 25 | 26 | # PyInstaller 27 | # Usually these files are written by a python script from a template 28 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 29 | *.manifest 30 | *.spec 31 | 32 | # Installer logs 33 | pip-log.txt 34 | pip-delete-this-directory.txt 35 | 36 | # Unit test / coverage reports 37 | htmlcov/ 38 | .tox/ 39 | .coverage 40 | .coverage.* 41 | .cache 42 | nosetests.xml 43 | coverage.xml 44 | *,cover 45 | 46 | # Translations 47 | *.mo 48 | *.pot 49 | 50 | # Django stuff: 51 | *.log 52 | 53 | # Sphinx documentation 54 | docs/_build/ 55 | 56 | # PyBuilder 57 | target/ 58 | -------------------------------------------------------------------------------- /doc/sorting-ambiguous-syllables.md: -------------------------------------------------------------------------------- 1 | # Sorting ambiguous syllables 2 | 3 | Tibetan has 9 ambiguous syllables where it is not possible to know what the main stack is. This is documented [here](https://github.com/eroux/tibetan-spellchecker/blob/master/doc/finding-main-stack.md). A collation algorithm should treat these syllables as their most common form, documented in the above link. 4 | 5 | A collation algorithm should give the following order: 6 | 7 | ག་ དགས་ འགས་ ང་ ད་ དངས་ གདས་ བདས་ འདས་ ན་ བ་ བགས་ དབས་ འབས་ མ་ མགས་ མངས་ དམས་ ཙ་ 8 | 9 | ## Unusual disambiguation 10 | 11 | There might be (extremely rare) cases where a user might want to treat one of these syllables not as the way it is usually disambiguated. There are two methods to do so: 12 | 13 | ### Specify first consonnant as main 14 | 15 | Let's take the case of དམས་: in normal cases, it should be treated as having མ as main consonnant. But if you want to treat is as having ད as main consonnant, you must use Unicode character U+034F COMBINING GRAPHEME JOINER (hereafter referenced as *CGJ*) after ད. The use of this character is documented [here](http://unicode.org/reports/tr10/#Combining_Grapheme_Joiner). Any collation algorithm should handle it properly. 16 | 17 | A collation algorithm should respect the following order: 18 | 19 | ད་ དམ་ ད͏མས་ གདའ་ མ་ དམར་ དམས་ དམི་ 20 | 21 | (the first དམས་ contains a CGJ: དCGJམས་). 22 | 23 | ### Specify second consonnant as main 24 | 25 | The same mechanism cannot be used to specify the second consonnant as the main one, there seems to be no clear canonical mechanism to do so. What we propose here is to use Unicode character U+2060 WORD JOINER (hereafter referenced as *WJ*) after the second stack. Unlike the other solution, this is not canonical, and all collation algorithm might not supported (the rules provided in this repository should). 26 | 27 | A collation algorithm implementing this method should respect the following order: 28 | 29 | ང་ མངར་ མང⁠ས་ མངི་ མ་ མང་ མངས་ མད་ 30 | 31 | (the first མངས་ contains a WJ: མངWJས་) 32 | -------------------------------------------------------------------------------- /rules.txt: -------------------------------------------------------------------------------- 1 | # Rules for Sanskrit ordering 2 | # From Bod rgya tshig mdzod chen mo pages 9, 11, 347, 1153, 1615, 1619, 1711, 1827, 1833, 2055, 2061, 2332, 2840, 2920, 2922, 2934, 3136 and 3137 3 | # Example: ཀར་ལུགས། < ཀརྐ་ཊ། 4 | &ཀར=ཀར 5 | &ཀལ=ཀལ 6 | &ཀས=ཀས 7 | &གཉྫ=གཉྫ 8 | &ཐར=ཐར 9 | &པུས=པུས 10 | &ཕལ=ཕལ 11 | &བིལ=བིལ 12 | &མཉྫ=མཉྫ 13 | &མར=མར 14 | &ཤས=ཤས 15 | &སར=སར 16 | &ཨར=ཨར 17 | &ཨས=ཨས 18 | &ངྒྷ=ངྒྷ 19 | &ང༌ག=ངྒ 20 | &ད༌ད=དྡ 21 | &ན༌དེ=ནྡེ 22 | &མ༌བ=མྦ 23 | &ར༌པ=རྤ 24 | # Marks (seconadry different, with low equal primary weight after Lao) 25 | &ໆ<།<<༎<<༏<<༐<<༑<<༔<<༴<་=༌ 26 | &ཀ<<ྈྐ<དཀ<བཀ<རྐ<ལྐ<སྐ<བརྐ<བསྐ 27 | &ཁ<<ྈྑ<མཁ<འཁ 28 | &ག<དགག<དགང<དགད<དགན<དགབ<དགཝ<དགའ<དགར<དགལ<དགས<དགི<དགུ<དགེ<དགོ<དགྭ<དགྱ<དགྲ<བགག<བགང<བགད<བགབ<བགམ<<<བགཾ<བགཝ<བགའ 29 | <བགར<བགལ<བག⁠ས<བགི<བགུ<བགེ<བགོ<བགྭ<བགྱ<བགྲ<བགླ<མགག<མགང<མགད<མགབ<མགའ<མགར<མགལ<མག⁠ས<མགི<མགུ<མགེ<མགོ<མགྭ<མགྱ<མགྲ<འགག<འགང<འགད<འགན<འགབ<འགམ<<<འགཾ 30 | <འགའ<འགར<འགལ<འགས<འགི<འགུ<འགེ<འགོ<འགྭ<འགྱ<འགྲ<རྒ<ལྒ<སྒ<བརྒ<བསྒ 31 | &ང<<<ྂ<<<ྃ<དངག<དངང<དངད<དངན<དངབ<དངའ<དངར<དངལ<དང⁠ས<དངི<དངུ<དངེ<དངོ<དངྭ<མངག<མངང<མངད<མངན<མངབ<མངའ<མངར<མངལ<མང⁠ས<མངི<མངུ<མངེ<མངོ<མངྭ<རྔ<ལྔ<སྔ<བརྔ<བསྔ 32 | &ཅ<གཅ<བཅ<ལྕ<བལྕ 33 | &ཆ<མཆ<འཆ 34 | &ཇ<མཇ<འཇ<རྗ<ལྗ<བརྗ 35 | &ཉ<<ྋྙ<གཉ<མཉ<རྙ<<<ཪྙ<སྙ<བརྙ<<<བཪྙ<བསྙ 36 | &ཏ<<<ཊ<གཏ<བཏ<རྟ<ལྟ<སྟ<བརྟ<བལྟ<བསྟ 37 | &ཐ<<<ཋ<མཐ<འཐ 38 | &ད<<<ཌ<གདག<གདང<གདད<གདན<གདབ<གདམ<<<གདཾ<གདའ<གདར<གདལ<གདས<གདི<གདུ<གདེ<གདོ<གདྭ<བདག<བདང<བདད<བདབ<བདམ<<<བདཾ<བདའ 39 | <བདར<བདལ<བདས<བདི<བདུ<བདེ<བདོ<བདྭ<མདག<མདང<མདད<མདན<མདབ<མདའ<མདར<མདལ<མདས<མདི<མདུ<མདེ<མདོ<མདྭ<འདག<འདང<འདད<འདན<འདབ<འདམ<<<འདཾ 40 | <འདའ<འདར<འདལ<འདས<འདི<འདུ<འདེ<འདོ<འདྭ<འདྲ<རྡ<ལྡ<སྡ<བརྡ<བལྡ<བསྡ 41 | &ན<<<ཎ<གནག<གནང<གནད<གནན<གནབ<གནམ<<<གནཾ<གནའ<གནར<གནལ<གནས<གནི<གནུ<གནེ<གནོ<གནྭ<མནག<མནང<མནད<མནན<མནབ<མནམ<<<མནཾ<མནའ 42 | <མནར<མནལ<མནས<མནི<མནུ<མནེ<མནོ<མནྭ<རྣ<སྣ<བརྣ<བསྣ 43 | &པ<<ྉྤ<དཔག<དཔང<དཔད<དཔབ<དཔའ<དཔར<དཔལ<དཔས<དཔི<དཔུ<དཔེ<དཔོ<དཔྱ<དཔྲ<ལྤ<སྤ 44 | &ཕ<<ྉྥ<འཕ 45 | &བ<དབག<དབང<དབད<དབན<དབབ<དབའ<དབར<དབལ<དབས<དབི<དབུ<དབེ<དབོ<དབྭ<དབྱ<དབྲ<འབག<འབང<འབད<འབན<འབབ<འབམ 46 | <<<འབཾ<འབའ<འབར<འབལ<འབས<འབི<འབུ<འབེ<འབོ<འབྭ<འབྱ<འབྲ<རྦ<ལྦ<སྦ 47 | &མ<<<ཾ<དམག<དམང<དམད<དམན<དམབ<དམཝ<དམའ<དམར<དམལ<དམས<དམི<དམུ<དམེ<དམོ<དམྭ<དམྱ<དམྲ<རྨ<སྨ 48 | &ཙ<གཙ<བཙ<རྩ<སྩ<བརྩ<བསྩ 49 | &ཚ<མཚ<འཚ 50 | &ཛ<མཛ<འཛ<རྫ<བརྫ 51 | # &ཝ 52 | &ཞ<གཞ<བཞ 53 | &ཟ<གཟ<བཟ 54 | # &འ 55 | &ཡ<གཡ 56 | &ར<<<ཪ<བརླ<<<བཪླ 57 | # &ལ 58 | &ཤ<<<ཥ<གཤ<བཤ 59 | &ས<གསག<གསང<གསད<གསན<གསབ<གསའ<གསར<གསལ<གསས<གསི<གསུ<གསེ<གསོ<གསྭ<བསག<བསང<བསད<བསབ<བསམ<<<བསཾ<བསའ<བསར 60 | <བསལ<བསས<བསི<བསུ<བསེ<བསོ<བསྭ<བསྲ<བསླ 61 | &ཧ<ལྷ 62 | # &ཨ 63 | # Explicit vowels 64 | <ཱ<ི<ཱི<ྀ<ཱྀ<ུ<ཱུ<ེ<ཻ=ེེ<ོ<ཽ=ོོ 65 | # Post-radicals 66 | <ྐ<ྑ<ྒ<ྔ<ྕ<ྖ<ྗ<ྙ<ྟ<<<ྚ<ྠ<<<ྛ<ྡ<<<ྜ<ྣ<<<ྞ<ྤ<ྥ<ྦ<ྨ<ྩ<ྪ<ྫ<ྭ<<<ྺ<ྮ<ྯ<ྰ<ྱ<<<ྻ<ྲ<<<ྼ<ླ<ྴ 67 | <<<ྵ<ྶ<ྷ<ྸ 68 | # Combining marks and signs (secondary weight) 69 | &༹<<྄<<ཿ<<྅<<ྈ<<ྉ<<ྊ<<ྋ<<ྌ<<ྍ<<ྎ<<ྏ 70 | # Treatༀ, ཷand ,ཹ as decomposed 71 | &ཨོཾ=ༀ 72 | &ྲཱྀ=ཷ 73 | &ླཱྀ=ཹ 74 | # Shorthands for ག,ས 75 | &དགགས<<<དགཊ<<<དགཌ 76 | &བགགས<<<བགཊ<<<བགཌ 77 | &འགགས<<<འགཊ<<<འགཌ 78 | &དངགས<<<དངཊ<<<དངཌ 79 | &མངགས<<<མངཊ<<<མངཌ 80 | &གདགས<<<གདཊ<<<གདཌ 81 | &བདགས<<<བདཊ<<<བདཌ 82 | &མདགས<<<མདཊ<<<མདཌ 83 | &འདགས<<<འདཊ<<<འདཌ 84 | &གནགས<<<གནཊ<<<གནཌ 85 | &མནགས<<<མནཊ<<<མནཌ 86 | &དཔགས<<<དཔཊ<<<དཔཌ 87 | &དབགས<<<དབཊ<<<དབཌ 88 | &འབགས<<<འབཊ<<<འབཌ 89 | &དམགས<<<དམཊ<<<དམཌ 90 | &གསགས<<<གསཊ<<<གསཌ 91 | &བསགས<<<བསཊ<<<བསཌ 92 | -------------------------------------------------------------------------------- /doc/sorting-standard-tibetan.md: -------------------------------------------------------------------------------- 1 | # Standard tibetan sorting 2 | 3 | Tibetan use a specific sorting algorithm, described here in a human understandable form. The presented algorithm is only correct for syllables following the [standard tibetan syllable structure](https://github.com/eroux/tibetan-spellchecker/blob/master/doc/standard-syllable-structure.md). 4 | 5 | The algorithm sorts according to 6 weights in the following order: 6 | 7 | - main consonnant 8 | - superscript 9 | - prefix 10 | - subscript 11 | - vowel 12 | - suffix 13 | 14 | See below for the sorting order inside each weight. 15 | 16 | ## Examples 17 | 18 | Let's take a few examples: 19 | 20 | Comparing མགད and མག: in the first one, the main consonnant is ག, in the second one, the main consonnant is མ, they differ on the first weight, so we have མགད < མག du to the order of the main consonnant. 21 | 22 | Comparing བསྒ and མགོ: the main consonnants are the same (ག), so we compare the second weight. The first one has superscript ས, the second has no superscript, so according the order of superscripts, བསྒ > མགོ. 23 | 24 | Comparing འགྲ and འགྱི: the main consonnants are the same (ག), the superscripts are the same (no superscript), the prefix are the same (འ), so we compare the fourth weight. The first has subscript ར, the second has subscript ཡ, so according to the order of subscript, we have འགྱི < འགྲ. 25 | 26 | Comparing མགུ and མག: the main consonnants are the same (ག), the superscripts are the same (no superscript), the prefix are the same (མ), the subscripts are the same (no subscript), so we compare the fifth weight. The first has vowel u, the second has no vowel, so according to the order ofvowel, we have མགུ > མག. 27 | 28 | Comparing དགར and དགལ: the main consonnants are the same (ག), the superscripts are the same (no superscript), the prefix are the same (ད), the subscripts are the same (no subscript), the vowels are the same (no vowel), so we compare the sixth weight. The first has suffix ར, the second has suffix ལ, so according to the order of suffixes, we have དགར < དགལ. 29 | 30 | ## Implementation 31 | 32 | This algorithm could be easily implemented using the same mechanism as [UCA](http://unicode.org/reports/tr10/). 33 | 34 | ## Order Listing 35 | 36 | ### Order of main consonnant 37 | 38 | The main consonnants sort in the following order: 39 | 40 | - ཀ 41 | - ཁ 42 | - ག 43 | - ང 44 | - ཅ 45 | - ཆ 46 | - ཇ 47 | - ཉ 48 | - ཏ 49 | - ཐ 50 | - ད 51 | - ན 52 | - པ 53 | - ཕ 54 | - བ 55 | - མ 56 | - ཙ 57 | - ཚ 58 | - ཛ 59 | - ཝ 60 | - ཞ 61 | - ཟ 62 | - འ 63 | - ཡ 64 | - ར 65 | - ལ 66 | - ཤ 67 | - ས 68 | - ཧ 69 | - ཨ 70 | 71 | ### Order of superscripts 72 | 73 | The subscript letters sort in the following order (presented above ཀ): 74 | 75 | - ཀ (no superscript) 76 | - རྐ 77 | - ལྐ 78 | - སྐ 79 | 80 | ### Order of prefixes 81 | 82 | The prefix letters sort in the following order: 83 | 84 | - no prefix 85 | - ག 86 | - ད 87 | - བ 88 | - མ 89 | - འ 90 | 91 | ### Order of subscripts 92 | 93 | The subscript letters sort in the following order (presented on ཀ): 94 | 95 | - ཀ (no subscript) 96 | - ཀྭ 97 | - ཀྱ 98 | - ཀྱྭ 99 | - ཀྲ 100 | - ཀྲྭ 101 | - ཀླ 102 | 103 | ### Order of vowels 104 | 105 | The vowels sort in the following order (presented on ཀ): 106 | 107 | - ཀ (no vowel) 108 | - ཀི 109 | - ཀུ 110 | - ཀེ 111 | - ཀོ 112 | 113 | ### Order of suffixes 114 | 115 | Suffixes are sorted in following order: 116 | 117 | - (no suffix) 118 | - ག 119 | - གས 120 | - ང 121 | - ངས 122 | - ད 123 | - ན 124 | - བ 125 | - བས 126 | - མ 127 | - མས 128 | - འ 129 | - འང 130 | - འམ 131 | - འི 132 | - འིའང 133 | - འིའམ 134 | - འིའི 135 | - འིའིས 136 | - འིའོ 137 | - འིར 138 | - འིས 139 | - འུ 140 | - འུའང 141 | - འུའམ 142 | - འུའི 143 | - འུའིས 144 | - འུའོ 145 | - འུར 146 | - འུས 147 | - འོ 148 | - འོའང 149 | - འོའམ 150 | - འོའི 151 | - འོའིས 152 | - འོའོ 153 | - འོར 154 | - འོས 155 | - ར 156 | - ལ 157 | - ས 158 | -------------------------------------------------------------------------------- /test.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | from icu import RuleBasedCollator 4 | from sys import exit 5 | 6 | RULES = '' 7 | with open ("rules.txt", "r") as rulesfile: 8 | RULES=rulesfile.read() 9 | 10 | COLLATOR = RuleBasedCollator('[normalization on]\n'+RULES) 11 | 12 | EXIT_CODE = 0 13 | 14 | # Very simple test function: we 15 | def testOrder(argList, testName): 16 | global EXIT_CODE 17 | argList = argList.split(' ') 18 | newList = sorted(argList, key=COLLATOR.getSortKey) 19 | if argList != newList: 20 | EXIT_CODE = 1 21 | print(testName+' ... FAIL!') 22 | print("expected ["+(", ".join(argList))+"]") 23 | print("got ["+(", ".join(newList))+"]") 24 | else: 25 | print(testName+' ... OK') 26 | 27 | # Tests corresponding to all the prefix+superscript+main+suffix+second suffix possibilities, 28 | # see https://github.com/eroux/tibetan-spellchecker/blob/master/doc/standard-syllable-structure.md 29 | testOrder("ཀ ཀྭ ཀྱ ཀྲ ཀླ དཀ དཀྭ དཀྱ དཀྲ དཀླ བཀ བཀྭ བཀྱ བཀྲ བཀླ རྐ རྐྱ ལྐ སྐ སྐྱ སྐྲ བརྐ བརྐྱ བསྐ བསྐྱ བསྐྲ", "letter ཀ") 30 | testOrder("ཁ ཁྭ ཁྱ ཁྲ མཁ མཁྭ མཁྱ མཁྲ འཁ འཁྭ འཁྱ འཁྲ", "letter ཁ") 31 | testOrder("ག གྭ གྱ གྲ གྲྭ གླ དགྭ དགྱ དགྲ དགྲྭ བགྭ བགྱ བགྲ བགྲྭ བགླ མགྭ མགྱ མགྲ མགྲྭ འགྭ འགྱ འགྲ འགྲྭ རྒ རྒྱ ལྒ སྒ སྒྱ སྒྲ བརྒ བརྒྱ བསྒ བསྒྱ བསྒྲ", "letter ག") 32 | testOrder("ང རྔ ལྔ སྔ བརྔ བསྔ", "letter ང") 33 | testOrder("ཅ ཅྭ གཅ གཅྭ བཅ བཅྭ", "letter ཅ") 34 | testOrder("ཇ རྗ ལྗ བརྗ", "letter ཇ") 35 | testOrder("ཉ ཉྭ གཉྭ མཉྭ རྙ སྙ བརྙ བསྙ", "letter ཉ") 36 | testOrder("ཏ ཏྭ ཏྲ གཏྭ གཏྲ བཏྭ བཏྲ རྟ ལྟ སྟ བརྟ བལྟ བསྟ", "letter ཏ") 37 | testOrder("ཐ ཐྲ", "letter ཐ") 38 | testOrder("ད དྭ དྲ དྲྭ གདྭ བདྭ མདྭ འདྭ འདྲ འདྲྭ རྡ ལྡ སྡ བརྡ བལྡ བསྡ", "letter ད") 39 | testOrder("ན རྣ སྣ སྣྲ བརྣ བསྣ", "letter ན") 40 | testOrder("པ པྱ པྲ དཔྱ དཔྲ ལྤ སྤ སྤྱ སྤྲ", "letter པ") 41 | testOrder("ཕ ཕྱ ཕྱྭ ཕྲ འཕྱ འཕྱྭ འཕྲ", "letter ཕ") 42 | testOrder("བ བྱ བྲ བླ དབྱ དབྲ འབྱ འབྲ རྦ ལྦ སྦ སྦྱ སྦྲ", "letter བ") 43 | testOrder("མ མྱ མྲ དམྱ དམྲ རྨ རྨྱ སྨ སྨྱ སྨྲ", "letter མ") 44 | testOrder("ཙ ཙྭ གཙྭ བཙྭ རྩ རྩྭ སྩ བརྩ བརྩྭ བསྩ", "letter ཙ") 45 | testOrder("ཚ ཚྭ མཚྭ འཚྭ", "letter ཚ") 46 | testOrder("ཛ རྫ བརྫ", "letter ཛ") 47 | testOrder("ཞ ཞྭ གཞྭ བཞྭ", "letter ཞ") 48 | testOrder("ཟ ཟྭ ཟླ བཟྭ བཟླ", "letter ཟ") 49 | testOrder("ར རྭ རླ བརླ", "letter ར") 50 | testOrder("ཤ ཤྭ གཤྭ བཤྭ", "letter ཤ") 51 | testOrder("ས སྭ སྲ སླ གསྭ བསྭ བསྲ བསླ", "letter ས") 52 | testOrder("ཧ ཧྭ ཧྲ ལྷ", "letter ཧ") 53 | testOrder("ཀི ཀུ ཀེ ཀོ", "standard vowels") 54 | testOrder("ཀ ཀཱ ཀི ཀཱི ཀྀ ཀཱྀ ཀུ ཀཱུ ཀེ ཀཻ ཀེེ ཀོ ཀོོ ཀཽ", "all vowels (+ee, oo)") 55 | testOrder("ཀག ཀང ཀད ཀན ཀབ ཀམ ཀའ ཀའུ ཀར ཀལ ཀས", "standard suffixes") 56 | testOrder("ཀག ཀགས ཀང ཀངས ཀད ཀན ཀབ ཀབས ཀམ ཀམས ཀའ ཀའུ ཀར ཀལ ཀས", "standard and second suffixes") 57 | testOrder("ཀག ཀགས ཀང ཀངས ཀད ཀན ཀབ ཀབས ཀམ ཀམས ཀའ ཀའང ཀའམ ཀའི ཀའིའོ ཀའུ ཀའུའང ཀའུའམ ཀའུའི ཀའུའིའོ ཀའུའོ ཀའུར ཀའུས ཀའོ ཀར ཀལ ཀས", "standard, second and grammatical suffixes") 58 | testOrder("ཀིག ཀིགས ཀིང ཀིངས ཀིད ཀིན ཀིབ ཀིབས ཀིམ ཀིམས ཀིའ ཀིའང ཀིའམ ཀིའི ཀིའིའོ ཀིའུ ཀིའུའང ཀིའུའམ ཀིའུའི ཀིའུའིའོ ཀིའུའོ ཀིའོ ཀིར ཀིལ ཀིས", "standard, second and grammatical suffixes with i") 59 | testOrder("ཀག ཀགས ཀང ཀྃ ཀངས ཀད ཀམ ཀཾ ཀམས ཀའ", "standard, second and contracted suffixes") 60 | testOrder("ཀིག ཀིགས ཀིང ཀིྃ ཀིངས ཀིད ཀིམ ཀིཾ ཀིམས ཀིའ", "contracted suffixes with i") 61 | testOrder("ཀཀ ཀཁ ཀག ཀགས ཀང ཀངས ཀཉ ཀཏ ཀཊ ཀཐ ཀཋ ཀད ཀཌ ཀན ཀཎ ཀནད ཀཔ ཀཕ ཀབ ཀབས ཀམ ཀཾ ཀམས ཀཙ ཀཚ ཀཛ ཀཝ ཀའ ཀའང ཀའམ་ཀའན ཀའས ཀའི ཀའིམ ཀའུ ཀའུའི ཀའུར ཀའུས ཀའེ ཀའོ ཀཡ ཀར ཀརད ཀལ ཀལད ཀཤ ཀཥ ཀས ཀཧ", "suffixes (Di Jiang) (fixed)") 62 | testOrder("ཀྙ ཀྥ ཀྭ ཀྱ ཀྱྭ ཀྱྲ ཀྲ ཀྲྭ ཀྲྱ ཀླ ཀྵ ཀྷ ཀྷྭ ཀྷྲ", "subscripts (Di Jiang)") 63 | testOrder("ཨོམ ཨོཾ ༀ ཨོར", "decomposed oM") 64 | testOrder("ཀར་ལུགས། ཀརྐ་ཊ།", "TDC p. 9") 65 | testOrder("ཀརྨ་ ཀརྵ་ ཀལྤ་ ཀསྨིར་", "TDC p. 11") 66 | testOrder("གངས་ལྷགས། གཉྫིར། གད།", "TDC p. 347") 67 | testOrder("ཐར་ས། ཐརྐ།", "TDC p. 1153") 68 | testOrder("པུས་ལྷང་ པུསྟིཀཿ་", "TDC p. 1619") 69 | testOrder("ཕལ་མོ་ ཕལྒུ་ཎ་ ཕས་", "TDC p. 1711") 70 | testOrder("བིལ་བ་ བིལྦ་", "TDC p. 1827") 71 | testOrder("མང་ མཉྫ མད་", "TDC p. 2055") 72 | testOrder("མར མརྒཏ་ མལ་", "TDC p. 2061") 73 | testOrder("ཝརྟུ ཝར་ཏི་", "TDC p. 2367") 74 | testOrder("ཤས་ ཤསྟཾ་ ཤི་", "TDC p. 2840") 75 | testOrder("སར་ སརྒཿ་ སལ་", "TDC p. 2920") 76 | testOrder("ཨར་ ཨརྒྷཾ་ ཨརྱ་", "TDC p. 3136") 77 | testOrder("ཨལ་ ཨསྨ་ ཨཱརྱ་", "TDC p. 3137") 78 | testOrder("བུད་དྷ། བུདྡྷཿ། བུདྡྷ་པཱ་ལ། བུདྡྷ་པཱ་ལི་ཏ། བུདྡྷ་ཤྲཱིཤནྟི། བུད་པ།", "TDC p. 1833") 79 | testOrder("སིངས་པོ། སིངྒྷལ། སིད།", "TDC p. 2922") 80 | testOrder("སེང་གེ་ཁ་འབབ་གངས་རི། སེངྒེ་ཁྱིམ། སེང་གེ་རྒྱན་གཞི།", "TDC p. 2934") 81 | testOrder("ཛམ་ཐང་གཙང་པ་དགོན། ཛམྦུ་ཀ། ཛམྦུ་གླིང༌། ཛམྦུ་ཆུ་བོ། ཛམྦུ་ཆུ་གསེར། ཛམྦུ་པྲྀཀྵ། ཛམ་བྷ་ལ།", "TDC p. 2332") 82 | testOrder("པར་པ་ཏ། པརྤ་ཏ། པར་པར།", "TDC p. 1615") 83 | testOrder("བཻ་དཀར་གཡའ་སེལ། བཻཌཱུརྻ། བཻཌཱུརྻ་དཀར་པོ། བཻཌཱུརྻ་སྔོན་པོ། བཻཌཱུརྻ་སེར་པོ། བཻཌཱུརྻའི་མདོག་ཅན། བཻ་རོ་ཙ་ན། བཻཤྲ་བཎཿ། བེག་གེ།", "TDC p. 1839,1840") 84 | testOrder("ག དགག དགང དག༵ང དག༷ང དགད དགས དགི དགི༵ དགི༷ དགི༵ དགི༷ དགུ བགྱ བགྲ བགྲ༵ བགྲ༷ བགྲུ བགྲུ༵ བགྲུ༷ བགླ", "ignored marks (mark-vowel and vowel-mark)") 85 | # Test page 55 of Manuel de Tibétain Standard by Nicolas Tournadre 86 | testOrder("ག་རེ་ གངས་ གི་ གིས་ གུར་ གེ་སར་ གོ་ གྭ་ གྱང་ གྱུར་ གྲང་མོ་ གྲངས་ གླ་ གླང་ དགའ་ དགུ་ དགེ་བ་ དགོས་ དགྲ་ བགམས་ བགེགས་ མགུར་ མགྱོགས་ རྒན་ རྒོད་པོ་ རྒྱ་ རྒྱ་མ་ ལྒང་བུ་ སྒ་ སྒུག་ སྒོར་མོ་ སྒྱུར་ སྒྲ་ བརྒལ་ བརྒྱ་ བསྒོམས་ བསྒྱུར་ བསྒྲགས་ བསྒྲིགས་", "NT") 87 | testOrder("ཀ་ཀ་ ཀ་ཀ་ནཱི་ལ ཀ་ཀ་ཎི་ལ་ ཀ་ཀ་མུ་ཁ་ ཀ་གཉིས་པ་ ཀ་ཊོ་ར་ ཀ་ཏ་པུར་", "vowel-retroflex priority (Illuminator)") 88 | # Tests from https://github.com/suizokukan/dchars/tree/master/tests/languages/bod 89 | testOrder("ཀ་རྐ་ཏ་ ཀ་སྐྱོར་ ཀ་ཁ་ ཀ་ཁ་པ་ ཀ་ཁའི་རིམ་པ་ ཀ་ཁོལ་མ་ ཀ་འགོ་ ཀ་རྒྱན་ ཀ་རྒྱུག་ ཀ་སྒྲོགས་ ཀ་ཅ་ ཀ་ཅི་ ཀ་ཅོག་ཞང་གསུམ་ ཀ་ཆ་ ཀ་ཆུག་ ཀ་ཆེན་བཅུ་ ཀ་ཆེན་བཞི་ ཀ་གཉིས་པ་ ཀ་ཏ་པུར་ ཀ་ཏ་པུར་འཛག་ ཀ་ཏ་བུ་ར་ ཀ་ཏ་ཡ་ན་ ཀ་ཏ་རུ་ ཀ་ཏན་ ཀ་ཏའི་བུ་ནོག་ཅན་ ཀ་ཏའི་བུ་མོ་ ཀ་ཏི་ ཀ་ཏི་ཤེལ་གྱི་སྦུ་གུ་ཅན་ ཀ་ཏི་ཤེལ་གྱི་རྩ་ ཀ་ཏི་གསེར་གྱི་རྩ་ ཀ་ཏི་གསེར་གྱི་རྩ་ཆེན་ ཀ་ཏུ་ ཀ་ཏོ་ར་ ཀ་ཏྱ་བུ་མོ ཀ་ཏྱ་ཡ་ན ཀ་ཏྱཱ་ཡ་ན ཀ་ཏྱཱ་ཡ་ན་ཆེན་པོ ཀ་ཏྱཱ་ཡ་ན་ནོག་ཅན ཀ་ཏྱཱའི་བུ ཀ་ཏྱཱའི་བུ་ཆེན་པོ ཀ་ཏྱཱའི་བུ་ནོག་ཅན ཀ་ཏྱཱའི་བུ་མོ ཀ་རྟི་ཀ་ ཀ་སྟེགས་ དཀ བཀ རྐ ལྐ སྐ བརྐ བསྐ ཁ མཁ འཁ ག གད གན གས རྒ ལྒ སྒ བརྒ བསྒ ང རྔ ལྔ སྔ བརྔ བསྔ ཅ གཅ བཅ ལྕ བལྕ ཆ མཆ འཆ ཇ མཇ འཇ རྗ ལྗ བརྗ ཉ གཉ མཉ རྙ སྙ བརྙ བསྙ ཏ གཏ བཏ རྟ ལྟ སྟ བརྟ བལྟ བསྟ ཐ མཐ འཐ ད དག དང དབ དམ རྡ ལྡ སྡ བརྡ བལྡ བསྡ ན རྣ སྣ བརྣ བསྣ པ ལྤ སྤ ཕ འཕ བ བག བད བར བས རྦ ལྦ སྦ མ མག མང མད མན རྨ སྨ ཙ གཙ བཙ རྩ སྩ བརྩ བསྩ ཚ མཚ འཚ ཛ མཛ འཛ རྫ བརྫ ཝ ཞ གཞ བཞ ཟ གཟ བཟ འ འག འད འབ ཡ གཡ ར བརླ ལ ཤ གཤ བཤ ས ཧ ལྷ ཨ", "dchars (Illuminator)") 90 | testOrder("ཁ གད་ གན་ གར་ གལ་ གས་ ང ད དག་ དང་ དབ་ དམ་ དར་ དལ་ དས་ ན ཕ བག བད་ བར་ བལ་ བས་ མ མག་ མང མད་ མན མབ་ མར་ མལ་ མས་ ཙ འག འད འབ", "2 letters") 91 | testOrder("ད དངས ན ཕ བགས མ མགས མངས ཙ", "3 letters ambiguous with first letter as main letter") 92 | testOrder("ཁ དགས འགས ང ད གདས བདས འདས ན ཕ དབས འབས མ དམས ཙ", "3 letters ambiguous with second letter as main letter") 93 | testOrder("ག དགག དགང དགད དགན དགབ དགཝ དགའ དགར དགལ དགས བགག བགང བགད བགབ བགམ བགཾ བགཝ བགའ བགར བགལ མགག མགང མགད མགབ མགའ མགར མགལ འགག འགང འགད འགན འགབ འགམ འགཾ འགའ འགར འགལ འགས དངག དངང དངད དངན དངབ དངའ དངར དངལ མངག མངང མངད མངན མངབ མངའ མངར མངལ གདག གདང གདད གདན གདབ གདམ གདཾ གདའ གདར གདལ གདས བདང བདད བདབ བདམ བདཾ བདའ བདར བདལ བདས མདག མདང མདད མདན མདབ མདའ མདར མདལ མདས འདག འདང འདད འདན འདབ འདམ འདཾ འདའ འདར འདལ འདས གནག གནང གནད གནན གནབ གནམ གནཾ གནའ གནར གནལ གནས མནག མནང མནད མནན མནབ མནམ མནཾ མནའ མནར མནལ མནས དཔག དཔང དཔད དཔབ དཔའ དཔར དཔལ དཔས དབག དབང དབད དབན དབབ དབའ དབར དབལ དབས འབག འབང འབད འབན འབབ འབམ འབཾ འབའ འབར འབལ འབས དམག དམང དམད དམན དམབ དམཝ དམའ དམར དམལ དམས གསག གསང གསད གསན གསབ གསའ གསར གསལ གསས བསག བསང བསད བསབ བསམ བསཾ བསའ བསར བསལ བསས", "3 letters") 94 | testOrder("ཁ དགི དགུ དགེ དགོ བགི བགུ བགེ བགོ མགི མགུ མགེ མགོ འགི འགུ འགེ འགོ ང དངི དངུ དངེ དངོ མངི མངུ མངེ མངོ ཅ ད གདི གདུ གདེ གདོ བདི བདུ བདེ བདོ མདི མདུ མདེ མདོ འདི འདུ འདེ འདོ ན གནི གནུ གནེ གནོ མནི མནུ མནེ མནོ དཔི དཔུ དཔེ དཔོ ཕ དབི དབུ དབེ དབོ འབི འབུ འབེ འབོ མ དམི དམུ དམེ དམོ ས གསི གསུ གསེ གསོ བསི བསུ བསེ བསོ ཧ", "2 letters ambiguous with vowels") 95 | testOrder("ཁ དགྭ བགྭ མགྭ འགྭ ང དངྭ མངྭ ཅ ད གདྭ བདྭ མདྭ འདྭ ན ཕ དབྭ འབྭ མ དམྭ ས གསྭ བསྭ ཧ", "2 letters ambiguous with wasur") 96 | testOrder("ད་ དམ་ ད͏མས་ གདའ་ མ་ དམར་ དམས་ དམི་", "non-standard disambiguation with CGJ") 97 | testOrder("ང་ མངར་ མང⁠ས་ མངི་ མ་ མང་ མངས་ མད་", "non-standard disambiguation with WJ") 98 | 99 | exit(EXIT_CODE) 100 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | # License 2 | 3 | All the files except rules.txt are Copyright Elie Roux, 2015, and are under CC0 1.0 Universal License (see below). 4 | 5 | See below for license of rules.txt. 6 | 7 | ## License for rules.txt 8 | 9 | Copyright © 1991-2015 Unicode, Inc. All rights reserved. 10 | Distributed under the Terms of Use in 11 | http://www.unicode.org/copyright.html. 12 | 13 | Permission is hereby granted, free of charge, to any person obtaining 14 | a copy of the Unicode data files and any associated documentation 15 | (the "Data Files") or Unicode software and any associated documentation 16 | (the "Software") to deal in the Data Files or Software 17 | without restriction, including without limitation the rights to use, 18 | copy, modify, merge, publish, distribute, and/or sell copies of 19 | the Data Files or Software, and to permit persons to whom the Data Files 20 | or Software are furnished to do so, provided that 21 | (a) this copyright and permission notice appear with all copies 22 | of the Data Files or Software, 23 | (b) this copyright and permission notice appear in associated 24 | documentation, and 25 | (c) there is clear notice in each modified Data File or in the Software 26 | as well as in the documentation associated with the Data File(s) or 27 | Software that the data or software has been modified. 28 | 29 | THE DATA FILES AND SOFTWARE ARE PROVIDED "AS IS", WITHOUT WARRANTY OF 30 | ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE 31 | WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 32 | NONINFRINGEMENT OF THIRD PARTY RIGHTS. 33 | IN NO EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS 34 | NOTICE BE LIABLE FOR ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL 35 | DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, 36 | DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER 37 | TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR 38 | PERFORMANCE OF THE DATA FILES OR SOFTWARE. 39 | 40 | Except as contained in this notice, the name of a copyright holder 41 | shall not be used in advertising or otherwise to promote the sale, 42 | use or other dealings in these Data Files or Software without prior 43 | written authorization of the copyright holder. 44 | 45 | ## CC0 1.0 Universal 46 | 47 | Statement of Purpose 48 | 49 | The laws of most jurisdictions throughout the world automatically confer 50 | exclusive Copyright and Related Rights (defined below) upon the creator and 51 | subsequent owner(s) (each and all, an "owner") of an original work of 52 | authorship and/or a database (each, a "Work"). 53 | 54 | Certain owners wish to permanently relinquish those rights to a Work for the 55 | purpose of contributing to a commons of creative, cultural and scientific 56 | works ("Commons") that the public can reliably and without fear of later 57 | claims of infringement build upon, modify, incorporate in other works, reuse 58 | and redistribute as freely as possible in any form whatsoever and for any 59 | purposes, including without limitation commercial purposes. These owners may 60 | contribute to the Commons to promote the ideal of a free culture and the 61 | further production of creative, cultural and scientific works, or to gain 62 | reputation or greater distribution for their Work in part through the use and 63 | efforts of others. 64 | 65 | For these and/or other purposes and motivations, and without any expectation 66 | of additional consideration or compensation, the person associating CC0 with a 67 | Work (the "Affirmer"), to the extent that he or she is an owner of Copyright 68 | and Related Rights in the Work, voluntarily elects to apply CC0 to the Work 69 | and publicly distribute the Work under its terms, with knowledge of his or her 70 | Copyright and Related Rights in the Work and the meaning and intended legal 71 | effect of CC0 on those rights. 72 | 73 | 1. Copyright and Related Rights. A Work made available under CC0 may be 74 | protected by copyright and related or neighboring rights ("Copyright and 75 | Related Rights"). Copyright and Related Rights include, but are not limited 76 | to, the following: 77 | 78 | i. the right to reproduce, adapt, distribute, perform, display, communicate, 79 | and translate a Work; 80 | 81 | ii. moral rights retained by the original author(s) and/or performer(s); 82 | 83 | iii. publicity and privacy rights pertaining to a person's image or likeness 84 | depicted in a Work; 85 | 86 | iv. rights protecting against unfair competition in regards to a Work, 87 | subject to the limitations in paragraph 4(a), below; 88 | 89 | v. rights protecting the extraction, dissemination, use and reuse of data in 90 | a Work; 91 | 92 | vi. database rights (such as those arising under Directive 96/9/EC of the 93 | European Parliament and of the Council of 11 March 1996 on the legal 94 | protection of databases, and under any national implementation thereof, 95 | including any amended or successor version of such directive); and 96 | 97 | vii. other similar, equivalent or corresponding rights throughout the world 98 | based on applicable law or treaty, and any national implementations thereof. 99 | 100 | 2. Waiver. To the greatest extent permitted by, but not in contravention of, 101 | applicable law, Affirmer hereby overtly, fully, permanently, irrevocably and 102 | unconditionally waives, abandons, and surrenders all of Affirmer's Copyright 103 | and Related Rights and associated claims and causes of action, whether now 104 | known or unknown (including existing as well as future claims and causes of 105 | action), in the Work (i) in all territories worldwide, (ii) for the maximum 106 | duration provided by applicable law or treaty (including future time 107 | extensions), (iii) in any current or future medium and for any number of 108 | copies, and (iv) for any purpose whatsoever, including without limitation 109 | commercial, advertising or promotional purposes (the "Waiver"). Affirmer makes 110 | the Waiver for the benefit of each member of the public at large and to the 111 | detriment of Affirmer's heirs and successors, fully intending that such Waiver 112 | shall not be subject to revocation, rescission, cancellation, termination, or 113 | any other legal or equitable action to disrupt the quiet enjoyment of the Work 114 | by the public as contemplated by Affirmer's express Statement of Purpose. 115 | 116 | 3. Public License Fallback. Should any part of the Waiver for any reason be 117 | judged legally invalid or ineffective under applicable law, then the Waiver 118 | shall be preserved to the maximum extent permitted taking into account 119 | Affirmer's express Statement of Purpose. In addition, to the extent the Waiver 120 | is so judged Affirmer hereby grants to each affected person a royalty-free, 121 | non transferable, non sublicensable, non exclusive, irrevocable and 122 | unconditional license to exercise Affirmer's Copyright and Related Rights in 123 | the Work (i) in all territories worldwide, (ii) for the maximum duration 124 | provided by applicable law or treaty (including future time extensions), (iii) 125 | in any current or future medium and for any number of copies, and (iv) for any 126 | purpose whatsoever, including without limitation commercial, advertising or 127 | promotional purposes (the "License"). The License shall be deemed effective as 128 | of the date CC0 was applied by Affirmer to the Work. Should any part of the 129 | License for any reason be judged legally invalid or ineffective under 130 | applicable law, such partial invalidity or ineffectiveness shall not 131 | invalidate the remainder of the License, and in such case Affirmer hereby 132 | affirms that he or she will not (i) exercise any of his or her remaining 133 | Copyright and Related Rights in the Work or (ii) assert any associated claims 134 | and causes of action with respect to the Work, in either case contrary to 135 | Affirmer's express Statement of Purpose. 136 | 137 | 4. Limitations and Disclaimers. 138 | 139 | a. No trademark or patent rights held by Affirmer are waived, abandoned, 140 | surrendered, licensed or otherwise affected by this document. 141 | 142 | b. Affirmer offers the Work as-is and makes no representations or warranties 143 | of any kind concerning the Work, express, implied, statutory or otherwise, 144 | including without limitation warranties of title, merchantability, fitness 145 | for a particular purpose, non infringement, or the absence of latent or 146 | other defects, accuracy, or the present or absence of errors, whether or not 147 | discoverable, all to the greatest extent permissible under applicable law. 148 | 149 | c. Affirmer disclaims responsibility for clearing rights of other persons 150 | that may apply to the Work or any use thereof, including without limitation 151 | any person's Copyright and Related Rights in the Work. Further, Affirmer 152 | disclaims responsibility for obtaining any necessary consents, permissions 153 | or other rights required for any use of the Work. 154 | 155 | d. Affirmer understands and acknowledges that Creative Commons is not a 156 | party to this document and has no duty or obligation with respect to this 157 | CC0 or use of the Work. 158 | 159 | For more information, please see 160 | 161 | 162 | --------------------------------------------------------------------------------