├── .gitignore ├── Dharma_Text_Sources.md ├── LICENSE ├── README.md ├── bokepy ├── LICENSE ├── __init__.py ├── boke.py ├── grapheme │ ├── README.me │ ├── __init__.py │ ├── api.py │ ├── data │ │ └── grapheme_break_property.json │ ├── finder.py │ └── grapheme_property_group.py └── utils.py ├── extras └── rt_index.txt ├── requirements.txt ├── setup.py └── tests └── basic_tests.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.DS_Store 2 | *.ipynb_checkpoints 3 | *.pyc 4 | __pycache__ 5 | -------------------------------------------------------------------------------- /Dharma_Text_Sources.md: -------------------------------------------------------------------------------- 1 | ## Text Databases 2 | 3 | #### The entire Rinchen Terdzo 4 | http://rtz.tsadra.org/index.php/Main_Page 5 | 6 | #### A Collection of Text from More than 100 Authors 7 | http://www.lotsawahouse.org/bo/free-translations-tibetan-buddhist-texts 8 | 9 | #### Degé edition of the Kangyur and Tengyur 10 | http://www.thlib.org/encyclopedias/literary/canons/ 11 | 12 | #### Kangyur Translation Project 13 | http://84000.co/ 14 | 15 | #### Buddhist Digital Resource Center 16 | https://www.tbrc.org/ 17 | 18 | #### Old Tibetan Documents Online 19 | http://otdo.aa-ken.jp/ 20 | 21 | #### Timeless Treasuries 22 | http://dharmacloud.tsadra.org/library/ 23 | 24 | #### Dharma Text Repository 25 | http://rywikitexts.tsadra.org/ 26 | 27 | #### Buddhism Library Project 28 | http://www.buddism.ru/___DHARMA___/ 29 | 30 | 31 | ## Other Resources 32 | 33 | #### The Buddhist Canon Research Database 34 | http://databases.aibs.columbia.edu/?sub=about 35 | 36 | #### Treasury of Lives 37 | https://treasuryoflives.org 38 | 39 | #### Rigpa Wiki 40 | http://rigpawiki.org/ 41 | 42 | #### Ranjung Yeshe Wiki 43 | http://rywiki.tsadra.org/ 44 | 45 | #### Himalayan Art 46 | https://www.himalayanart.org/ 47 | 48 | #### Collection of English Dharma books 49 | http://promienie.net/ 50 | 51 | #### Collection of Dictionaries 52 | http://www.buddism.ru///___DICTIONARIES/ 53 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Mikko Kotila 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # bokepy 2 | 3 | bö (tib. བོད་) means Tibet, and ke (tib. སྐད) means language, so together böke means Tibetan Language. bokepy is a shorthand for boke and python, and is a Tibetan Language Processing Library built for handling the most common language processing tasks in a straightforward way. Bokepy is built from the ground up to facilitate for a wide range of research challenges, including those far beyond the scope of typical scholarly interest. This includes rapid testing of ideas and prototyping of completely new technology solutions. 4 | 5 | ## Table of Contents 6 | 7 | #### 1. About This document 8 | #### 2. Requirements 9 | #### 3. About the Design Paradigm 10 | #### 4. Roadmap 11 | #### 5. References 12 | 13 | ## 1. About This Document 14 | 15 | At the moment there is no user document, and this document serves as a guideline for development. What should be built first, and how it should be built. For now, see doc-strings for individual functions. 16 | 17 | ## 2. Requirements 18 | 19 | The initial spec is focused on making the most common analytical processes as easy to perform as possible. At the same time, the goal is to establish a clean codebase for further development with as modular design paradigm as possible. 20 | 21 | - **being able to ingest text** 22 | - as raw text 23 | - as a list of entities 24 | - as a frequency list with counts 25 | - in variety of file formats: 26 | - .txt 27 | - .csv 28 | - .doc 29 | - .xls 30 | - .pdf (later) 31 | - .sql (later) 32 | - other formats (which?) 33 | - handling unicode properly is a must! 34 | 35 | - **being able to clean text** 36 | - remove non-Tibetan entities 37 | - remove unnecessary spaces 38 | - remove other garbage (needs to be defined) 39 | 40 | - **being able to preprocess text** 41 | - split text in to: 42 | - sentences 43 | - phrases 44 | - words 45 | - syllables 46 | - characters 47 | - filter out stopwords 48 | - punctuation 49 | - particles 50 | - other words (what are these?) 51 | 52 | - **being able to establish grams** 53 | - character-level 54 | - syllable-level 55 | - word-level 56 | - phrase-level 57 | - sentence-level 58 | 59 | - **being able to create frequency tables** 60 | - at any entity level (e.g. syllable frequency) 61 | - with simple summaries for % share of buckets (e.g. 100 most common account for 12% of all in a set) 62 | 63 | - **being able to draw out plots** 64 | - word clouds 65 | - bar charts 66 | - heatmaps 67 | 68 | - **being able to output to files** 69 | - html 70 | - json 71 | - excel 72 | - latex 73 | - msgpack 74 | - sql 75 | - clipboard 76 | 77 | ## 3. About the Design Paradigm 78 | 79 | To maximize code readability, at least initially there is no object orientation i.e. classes are not used. 80 | 81 | Functions should meet several functional criteria: 82 | 83 | - Can ingest list, series, and dataframe without prompting user 84 | - Can be called separately from any other functionality 85 | - Can be within any other function 86 | 87 | Functions should also meet several structural criteria: 88 | 89 | - No more than 50 lines of code in total 90 | - Strict adherence to pep8 guidelines 91 | - Include comprehensive doc string 92 | - Use code simple enough to void need for comments 93 | 94 | ## 4. Roadmap 95 | 96 | Roughly speaking the roadmap is broken in to 4 stages (at least for now): 97 | 98 | 1) Common Foundational Features 99 | - Data in and out 100 | - Data cleaning 101 | - Tokenization / segmentation 102 | - Frequency tables 103 | - Access to common language resources (corpora etc.) 104 | 2) Common Statistical Functionality 105 | - (n)gram creation 106 | - word frequency table creation 107 | - foundational visualization capabilities 108 | 3) Special Foundational Features 109 | - embeddings / one-hot encoding 110 | - vectorization (sense2vec, word2vec, etc[1]) 111 | - several options for word tokenization 112 | - POS tagging 113 | - Entity recognition 114 | - Accuracy / quality assessment 115 | 4) Extraordinary Statistical Capabilities 116 | - Integrate with Keras (LSTM etc.) 117 | - Let's see what more...it takes a long journey to know the horse's strength. 118 | 119 | ## 5. References 120 | 121 | [1] https://github.com/MaxwellRebo/awesome-2vec 122 | -------------------------------------------------------------------------------- /bokepy/LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Mikko Kotila 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /bokepy/__init__.py: -------------------------------------------------------------------------------- 1 | # MIT License 2 | # Copyright (c) 2018 Mikko Kotila 3 | 4 | __version__ = "0.1" 5 | -------------------------------------------------------------------------------- /bokepy/boke.py: -------------------------------------------------------------------------------- 1 | # −∗− coding: utf−8 −∗− 2 | 3 | import pandas as pd 4 | import re 5 | from pytib import Segment 6 | from grapheme import graphemes 7 | from grapheme import length 8 | 9 | 10 | ''' 11 | NLP stuffs take time to do. So have patience. 12 | 13 | ''' 14 | 15 | sent_punct_re = "༄|༅|༆|༇|༈|།|༎|༏|༐|༑|༔|;|:" 16 | particles_re = "གི|ཀྱི|གྱི|ཡི|གིས|ཀྱིས|གྱིས|ཡིས|སུ|ཏུ|དུ|རུ|སྟེ|ཏེ|དེ|ཀྱང|ཡང|འང|གམ|ངམ|དམ|ནམ|བམ|མམ|འམ|རམ|ལམ|སམ|ཏམ|གོ|ངོ|དོ|ནོ|མོ|འོ|རོ|ལོ|སོ|ཏོ|ཅིང|ཅེས|ཅེའོ|ཅེ་ན|ཅིག|ཞིང|ཞེས|ཞེའོ|ཞེ་ན|ཞིག|ཤིང|ཤེའོ|ཤེ་ན|ཤིག|ལ|ན|ནས|ལས|ནི|དང|གང|ཅི|ཇི|གིན|གྱིན|ཀྱིན|ཡིན|པ|བ|པོ|བོ" 17 | 18 | stopwords = ["འི་", "གི་", "ཀྱི་", "གྱི་", "ཡི་", "གིས་", "ཀྱིས་", "གྱིས་", 19 | "ཡིས་", "སུ་", "ཏུ་", "དུ་", "རུ་", "སྟེ་", "ཏེ་", "དེ་", 20 | "ཀྱང་", "ཡང་", "འང་", "གམ་", "ངམ་", "དམ་", "ནམ་", "བམ་", 21 | "མམ་", "འམ་", "རམ་", "ལམ་", "སམ་", "ཏམ་", "གོ་", "ངོ་", 22 | "དོ་", "ནོ་", "མོ་", "འོ་", "རོ་", "ལོ་", "སོ་", "ཏོ་", "ཅིང་", 23 | "ཅེས་", "ཅེའོ་", "ཅེ་ན་", "ཅིག་", "ཞིང་", "ཞེས་", "ཞེའོ་", 24 | "ཞེ་ན་", "ཞིག་", "ཤིང་", "ཤེའོ་", "ཤེ་ན་", "ཤིག་", "ལ་", "ན་", 25 | "ནས་", "ལས་", "ནི་", "དང་", "གང་", "ཅི་", "ཇི་", "གིན་", 26 | "གྱིན་", "ཀྱིན་", "ཡིན་", "པ་", "བ་", "པོ་", "བ་ོ", "ར་", "ས་", 27 | "མ་", "་_", "ལ", "ན"] 28 | 29 | latins_re = r'[A-Z]|[a-z]|[0-9]|[\\$%&\'()*+,./:;<=>?@^_`[\]{}~]' 30 | 31 | 32 | def ingest_text(filename, mode='blob'): 33 | 34 | if mode is not "blob": 35 | out = open(filename).readlines() 36 | else: 37 | out = open(filename).read() 38 | 39 | return out 40 | 41 | 42 | def export(data, name='default', export_format='to_csv'): 43 | 44 | ''' 45 | WHAT 46 | ---- 47 | Takes data and exports it to a file on local drive. Note that 48 | saving will take place automatically on present working directory 49 | and any file with same name will be automatically overwritten. 50 | 51 | PARAMS 52 | ------ 53 | data: data in a list or dataframe or series 54 | export_to: to_html, to_json, to_csv, to_excel, to_latex, 55 | to_msgpack, to_sql, to_clipboard 56 | ''' 57 | temp = data 58 | 59 | if name is 'default': 60 | file_type = export_format.split('_')[1] 61 | name = 'export_from_boke.' + file_type 62 | 63 | method_to_call = find_method(temp, export_format) 64 | method_to_call(name) 65 | 66 | 67 | def type_convert(data): 68 | 69 | temp = data 70 | 71 | if isinstance(temp, pd.core.frame.DataFrame) is False: 72 | temp = pd.DataFrame(temp) 73 | temp = temp.set_index(temp.columns[0]) 74 | 75 | return temp 76 | 77 | 78 | def remove_latins(data): 79 | 80 | temp = type_convert(data) 81 | 82 | out = temp[temp.index.str.contains(latins_re) == False] 83 | 84 | return out 85 | 86 | 87 | def find_method(data, function): 88 | 89 | temp = type_convert(data) 90 | 91 | method_to_call = getattr(temp, function) 92 | 93 | return method_to_call 94 | 95 | 96 | def show_latins(data): 97 | 98 | temp = type_convert(data) 99 | 100 | return temp[temp.index.str.contains(latins_re) == True] 101 | 102 | 103 | def create_meta(data): 104 | 105 | ''' 106 | Accepts as input a list and outputs a dataframe with meta-data. 107 | ''' 108 | 109 | # new[~new.index.isin(particles)] 110 | temp = data 111 | temp = temp.reset_index() 112 | temp.columns = ['text', 'count'] 113 | temp['chars'] = temp.text.apply(length) 114 | temp['bytes'] = temp.text.str.len() 115 | temp['sentence_ending'] = temp.text.str.contains('་') == False 116 | temp['sentence_ending'] = temp['sentence_ending'].astype(int) 117 | temp['stopword'] = temp.text.isin(stopwords) 118 | 119 | return temp 120 | 121 | 122 | def text_to_chars(text): 123 | 124 | ''' 125 | Takes as input Tibetan text, and creates a list of individual characters. 126 | 127 | ''' 128 | 129 | temp = graphemes(text) 130 | out = list(temp) 131 | 132 | return out 133 | 134 | 135 | def text_to_syllables(text): 136 | 137 | ''' 138 | Takes as input Tibetan text, and creates a list of individual syllables. 139 | 140 | ''' 141 | 142 | out = text.split('་') 143 | 144 | return out 145 | 146 | 147 | def text_to_words(text, mode='list'): 148 | 149 | ''' 150 | OPTIONS 151 | ------- 152 | mode: either 'list' or 'whitespaces' 153 | ''' 154 | 155 | temp = re.sub(sent_punct_re, "", text) 156 | 157 | seg = Segment() 158 | temp = seg.segment(temp, uknown=0) 159 | if mode is 'list': 160 | temp = temp.split() 161 | 162 | return temp 163 | 164 | 165 | def text_to_sentence(text): 166 | 167 | ''' 168 | Takes as input Tibetan text, and creates a list of individual sentences. 169 | 170 | ''' 171 | 172 | out = re.split(sent_punct_re, text) 173 | 174 | return out 175 | 176 | 177 | def syllable_grams(data, grams=2, save_to_file=None): 178 | 179 | ''' 180 | Takes in a list of syllables and creates syllable pairs. 181 | Note that there is no intelligence involved, so the syllable pairs 182 | might not result in actual words (even though they often do). 183 | 184 | OUTPUT: a list with the syllable pairs (and optionally saved file) 185 | 186 | OPTIONS: if 'save_to_file' is not None, need to be a filename. 187 | 188 | ''' 189 | 190 | entities = pd.DataFrame(data) 191 | entities.columns = ['text'] 192 | 193 | l = [] 194 | a = 0 195 | for i in entities['text']: 196 | a += 1 197 | try: 198 | l.append(entities['text'][a] + " " + entities['text'][a + 1]) 199 | except KeyError: 200 | l.append("blank") 201 | 202 | if save_to_file is not None: 203 | 204 | out = pd.Series(l) 205 | out.to_csv(save_to_file) 206 | 207 | return l 208 | 209 | 210 | def syllable_counts(syllable_list): 211 | 212 | ''' 213 | Takes as input a list or series with syllables or other syllable entities 214 | such as syllable pairs. 215 | 216 | ''' 217 | 218 | out = pd.Series(syllable_list).value_counts() 219 | out = pd.DataFrame(out) 220 | out.columns = ['counts'] 221 | 222 | return out 223 | 224 | 225 | def share_by_order(data): 226 | 227 | ''' 228 | Takes as input a frequency dataframe with column name counts expected. 229 | 230 | ''' 231 | 232 | total = data['counts'].sum() 233 | print("Total syllable pairs : %d \n" % total) 234 | 235 | orders = [10, 100, 1000, 10000, 100000] 236 | 237 | for i in orders: 238 | share = data['counts'][:i].sum() / total.astype(float) * 100 239 | print(("TOP %d : %.2f%%") % (i, share)) 240 | -------------------------------------------------------------------------------- /bokepy/grapheme/README.me: -------------------------------------------------------------------------------- 1 | NOTE: Just one function is used from this library so it's probably better to remove all the redundant codes. 2 | -------------------------------------------------------------------------------- /bokepy/grapheme/__init__.py: -------------------------------------------------------------------------------- 1 | """ 2 | Main module for the grapheme package. 3 | """ 4 | 5 | from .api import graphemes, length, grapheme_lengths, slice, contains, safe_split_index, startswith, endswith 6 | 7 | __all__ = ['graphemes', 'length', 'grapheme_lengths', 'slice', 'contains', 'safe_split_index', 'startswith', 'endswith'] 8 | -------------------------------------------------------------------------------- /bokepy/grapheme/api.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | from grapheme.finder import GraphemeIterator, get_last_certain_break_index 3 | 4 | def graphemes(string): 5 | """ 6 | Returns an iterator of all graphemes of given string. 7 | 8 | >>> rainbow_flag = "🏳️‍🌈" 9 | >>> [codepoint for codepoint in rainbow_flag] 10 | ['🏳', '️', '\u200d', '🌈'] 11 | >>> list(grapheme.graphemes("multi codepoint grapheme: " + rainbow_flag)) 12 | ['m', 'u', 'l', 't', 'i', ' ', 'c', 'o', 'd', 'e', 'p', 'o', 'i', 'n', 't', ' ', 'g', 'r', 'a', 'p', 'h', 'e', 'm', 'e', ':', ' ', '🏳️‍🌈'] 13 | """ 14 | return iter(GraphemeIterator(string)) 15 | 16 | 17 | def length(string, until=None): 18 | """ 19 | Returns the number of graphemes in the string. 20 | 21 | Note that this functions needs to traverse the full string to calculate the length, 22 | unlike `len(string)` and it's time consumption is linear to the length of the string 23 | (up to the `until` value). 24 | 25 | Only counts up to the `until` argument, if given. This is useful when testing 26 | the length of a string against some limit and the excess length is not interesting. 27 | 28 | >>> rainbow_flag = "🏳️‍🌈" 29 | >>> len(rainbow_flag) 30 | 4 31 | >>> graphemes.length(rainbow_flag) 32 | 1 33 | >>> graphemes.length("".join(str(i) for i in range(100)), 30) 34 | 30 35 | """ 36 | if until is None: 37 | return sum(1 for _ in GraphemeIterator(string)) 38 | 39 | iterator = graphemes(string) 40 | count = 0 41 | while True: 42 | try: 43 | if count >= until: 44 | break 45 | next(iterator) 46 | except StopIteration: 47 | break 48 | else: 49 | count += 1 50 | 51 | return count 52 | 53 | 54 | # todo: should probably use an optimized iterator that only deals with code point counts (optimization) 55 | def grapheme_lengths(string): 56 | """ 57 | Returns an iterator of number of code points in each grapheme of the string. 58 | """ 59 | return iter(len(g) for g in graphemes(string)) 60 | 61 | 62 | def slice(string, start=None, end=None): 63 | """ 64 | Returns a substring of the given string, counting graphemes instead of codepoints. 65 | 66 | Negative indices is currently not supported. 67 | >>> string = "tamil நி (ni)" 68 | 69 | >>> string[:7] 70 | 'tamil ந' 71 | >>> grapheme.slice(string, end=7) 72 | 'tamil நி' 73 | >>> string[7:] 74 | 'ி (ni)' 75 | >>> grapheme.slice(string, 7) 76 | ' (ni)' 77 | """ 78 | 79 | if start is None: 80 | start = 0 81 | if end is not None and start >= end: 82 | return "" 83 | 84 | if start < 0: 85 | raise NotImplementedError("Negative indexing is currently not supported.") 86 | 87 | sum_ = 0 88 | start_index = None 89 | for grapheme_index, grapheme_length in enumerate(grapheme_lengths(string)): 90 | if grapheme_index == start: 91 | start_index = sum_ 92 | elif grapheme_index == end: 93 | return string[start_index:sum_] 94 | sum_ += grapheme_length 95 | 96 | if start_index is not None: 97 | return string[start_index:] 98 | 99 | return "" 100 | 101 | def contains(string, substring): 102 | """ 103 | Returns true if the sequence of graphemes in substring is also present in string. 104 | 105 | This differs from the normal python `in` operator, since the python operator will return 106 | true if the sequence of codepoints are withing the other string without considering 107 | grapheme boundaries. 108 | 109 | Performance notes: Very fast if `substring not in string`, since that also means that 110 | the same graphemes can not be in the two strings. Otherwise this function has linear time 111 | complexity in relation to the string length. It will traverse the sequence of graphemes until 112 | a match is found, so it will generally perform better for grapheme sequences that match early. 113 | 114 | >>> "🇸🇪" in "🇪🇸🇪🇪" 115 | True 116 | >>> grapheme.contains("🇪🇸🇪🇪", "🇸🇪") 117 | False 118 | """ 119 | if substring not in string: 120 | return False 121 | 122 | substr_graphemes = list(graphemes(substring)) 123 | 124 | if len(substr_graphemes) == 0: 125 | return True 126 | elif len(substr_graphemes) == 1: 127 | return substr_graphemes[0] in graphemes(string) 128 | else: 129 | str_iter = graphemes(string) 130 | str_sub_part = [] 131 | for _ in range(len(substr_graphemes)): 132 | try: 133 | str_sub_part.append(next(str_iter)) 134 | except StopIteration: 135 | return False 136 | 137 | for g in str_iter: 138 | if str_sub_part == substr_graphemes: 139 | return True 140 | 141 | str_sub_part.append(g) 142 | str_sub_part.pop(0) 143 | return str_sub_part == substr_graphemes 144 | 145 | 146 | def startswith(string, prefix): 147 | """ 148 | Like str.startswith, but also checks that the string starts with the given prefixes sequence of graphemes. 149 | 150 | str.startswith may return true for a prefix that is not visually represented as a prefix if a grapheme cluster 151 | is continued after the prefix ends. 152 | 153 | >>> grapheme.startswith("✊🏾", "✊") 154 | False 155 | >>> "✊🏾".startswith("✊") 156 | True 157 | """ 158 | return string.startswith(prefix) and safe_split_index(string, len(prefix)) == len(prefix) 159 | 160 | 161 | def endswith(string, suffix): 162 | """ 163 | Like str.endswith, but also checks that the string ends with the given prefixes sequence of graphemes. 164 | 165 | str.endswith may return true for a suffix that is not visually represented as a suffix if a grapheme cluster 166 | is initiated before the suffix starts. 167 | 168 | >>> grapheme.endswith("🏳️‍🌈", "🌈") 169 | False 170 | >>> "🏳️‍🌈".endswith("🌈") 171 | True 172 | """ 173 | expected_index = len(string) - len(suffix) 174 | return string.endswith(suffix) and safe_split_index(string, expected_index) == expected_index 175 | 176 | 177 | def safe_split_index(string, max_len): 178 | """ 179 | Returns the highest index up to `max_len` at which the given string can be sliced, without breaking a grapheme. 180 | 181 | This is useful for when you want to split or take a substring from a string, and don't really care about 182 | the exact grapheme length, but don't want to risk breaking existing graphemes. 183 | 184 | This function does normally not traverse the full grapheme sequence up to the given length, so it can be used 185 | for arbitrarily long strings and high `max_len`s. However, some grapheme boundaries depend on the previous state, 186 | so the worst case performance is O(n). In practice, it's only very long non-broken sequences of country flags 187 | (represented as Regional Indicators) that will perform badly. 188 | 189 | The return value will always be between `0` and `len(string)`. 190 | 191 | >>> string = "tamil நி (ni)" 192 | >>> i = grapheme.safe_split_index(string, 7) 193 | >>> i 194 | 6 195 | >>> string[:i] 196 | 'tamil ' 197 | >>> string[i:] 198 | 'நி (ni)' 199 | """ 200 | last_index = get_last_certain_break_index(string, max_len) 201 | for l in grapheme_lengths(string[last_index:]): 202 | if last_index + l > max_len: 203 | break 204 | last_index += l 205 | return last_index 206 | -------------------------------------------------------------------------------- /bokepy/grapheme/data/grapheme_break_property.json: -------------------------------------------------------------------------------- 1 | { 2 | "LF": { 3 | "ranges": [], 4 | "single_chars": [ 5 | "000A" 6 | ] 7 | }, 8 | "Control": { 9 | "ranges": [ 10 | [ 11 | "0000", 12 | "0009" 13 | ], 14 | [ 15 | "000B", 16 | "000C" 17 | ], 18 | [ 19 | "000E", 20 | "001F" 21 | ], 22 | [ 23 | "007F", 24 | "009F" 25 | ], 26 | [ 27 | "200E", 28 | "200F" 29 | ], 30 | [ 31 | "202A", 32 | "202E" 33 | ], 34 | [ 35 | "2060", 36 | "2064" 37 | ], 38 | [ 39 | "2066", 40 | "206F" 41 | ], 42 | [ 43 | "D800", 44 | "DFFF" 45 | ], 46 | [ 47 | "FFF0", 48 | "FFF8" 49 | ], 50 | [ 51 | "FFF9", 52 | "FFFB" 53 | ], 54 | [ 55 | "1BCA0", 56 | "1BCA3" 57 | ], 58 | [ 59 | "1D173", 60 | "1D17A" 61 | ], 62 | [ 63 | "E0002", 64 | "E001F" 65 | ], 66 | [ 67 | "E0080", 68 | "E00FF" 69 | ], 70 | [ 71 | "E01F0", 72 | "E0FFF" 73 | ] 74 | ], 75 | "single_chars": [ 76 | "00AD", 77 | "061C", 78 | "180E", 79 | "200B", 80 | "2028", 81 | "2029", 82 | "2065", 83 | "FEFF", 84 | "E0000", 85 | "E0001" 86 | ] 87 | }, 88 | "ZWJ": { 89 | "ranges": [], 90 | "single_chars": [ 91 | "200D" 92 | ] 93 | }, 94 | "Extend": { 95 | "ranges": [ 96 | [ 97 | "0300", 98 | "036F" 99 | ], 100 | [ 101 | "0483", 102 | "0487" 103 | ], 104 | [ 105 | "0488", 106 | "0489" 107 | ], 108 | [ 109 | "0591", 110 | "05BD" 111 | ], 112 | [ 113 | "05C1", 114 | "05C2" 115 | ], 116 | [ 117 | "05C4", 118 | "05C5" 119 | ], 120 | [ 121 | "0610", 122 | "061A" 123 | ], 124 | [ 125 | "064B", 126 | "065F" 127 | ], 128 | [ 129 | "06D6", 130 | "06DC" 131 | ], 132 | [ 133 | "06DF", 134 | "06E4" 135 | ], 136 | [ 137 | "06E7", 138 | "06E8" 139 | ], 140 | [ 141 | "06EA", 142 | "06ED" 143 | ], 144 | [ 145 | "0730", 146 | "074A" 147 | ], 148 | [ 149 | "07A6", 150 | "07B0" 151 | ], 152 | [ 153 | "07EB", 154 | "07F3" 155 | ], 156 | [ 157 | "0816", 158 | "0819" 159 | ], 160 | [ 161 | "081B", 162 | "0823" 163 | ], 164 | [ 165 | "0825", 166 | "0827" 167 | ], 168 | [ 169 | "0829", 170 | "082D" 171 | ], 172 | [ 173 | "0859", 174 | "085B" 175 | ], 176 | [ 177 | "08D4", 178 | "08E1" 179 | ], 180 | [ 181 | "08E3", 182 | "0902" 183 | ], 184 | [ 185 | "0941", 186 | "0948" 187 | ], 188 | [ 189 | "0951", 190 | "0957" 191 | ], 192 | [ 193 | "0962", 194 | "0963" 195 | ], 196 | [ 197 | "09C1", 198 | "09C4" 199 | ], 200 | [ 201 | "09E2", 202 | "09E3" 203 | ], 204 | [ 205 | "0A01", 206 | "0A02" 207 | ], 208 | [ 209 | "0A41", 210 | "0A42" 211 | ], 212 | [ 213 | "0A47", 214 | "0A48" 215 | ], 216 | [ 217 | "0A4B", 218 | "0A4D" 219 | ], 220 | [ 221 | "0A70", 222 | "0A71" 223 | ], 224 | [ 225 | "0A81", 226 | "0A82" 227 | ], 228 | [ 229 | "0AC1", 230 | "0AC5" 231 | ], 232 | [ 233 | "0AC7", 234 | "0AC8" 235 | ], 236 | [ 237 | "0AE2", 238 | "0AE3" 239 | ], 240 | [ 241 | "0AFA", 242 | "0AFF" 243 | ], 244 | [ 245 | "0B41", 246 | "0B44" 247 | ], 248 | [ 249 | "0B62", 250 | "0B63" 251 | ], 252 | [ 253 | "0C3E", 254 | "0C40" 255 | ], 256 | [ 257 | "0C46", 258 | "0C48" 259 | ], 260 | [ 261 | "0C4A", 262 | "0C4D" 263 | ], 264 | [ 265 | "0C55", 266 | "0C56" 267 | ], 268 | [ 269 | "0C62", 270 | "0C63" 271 | ], 272 | [ 273 | "0CCC", 274 | "0CCD" 275 | ], 276 | [ 277 | "0CD5", 278 | "0CD6" 279 | ], 280 | [ 281 | "0CE2", 282 | "0CE3" 283 | ], 284 | [ 285 | "0D00", 286 | "0D01" 287 | ], 288 | [ 289 | "0D3B", 290 | "0D3C" 291 | ], 292 | [ 293 | "0D41", 294 | "0D44" 295 | ], 296 | [ 297 | "0D62", 298 | "0D63" 299 | ], 300 | [ 301 | "0DD2", 302 | "0DD4" 303 | ], 304 | [ 305 | "0E34", 306 | "0E3A" 307 | ], 308 | [ 309 | "0E47", 310 | "0E4E" 311 | ], 312 | [ 313 | "0EB4", 314 | "0EB9" 315 | ], 316 | [ 317 | "0EBB", 318 | "0EBC" 319 | ], 320 | [ 321 | "0EC8", 322 | "0ECD" 323 | ], 324 | [ 325 | "0F18", 326 | "0F19" 327 | ], 328 | [ 329 | "0F71", 330 | "0F7E" 331 | ], 332 | [ 333 | "0F80", 334 | "0F84" 335 | ], 336 | [ 337 | "0F86", 338 | "0F87" 339 | ], 340 | [ 341 | "0F8D", 342 | "0F97" 343 | ], 344 | [ 345 | "0F99", 346 | "0FBC" 347 | ], 348 | [ 349 | "102D", 350 | "1030" 351 | ], 352 | [ 353 | "1032", 354 | "1037" 355 | ], 356 | [ 357 | "1039", 358 | "103A" 359 | ], 360 | [ 361 | "103D", 362 | "103E" 363 | ], 364 | [ 365 | "1058", 366 | "1059" 367 | ], 368 | [ 369 | "105E", 370 | "1060" 371 | ], 372 | [ 373 | "1071", 374 | "1074" 375 | ], 376 | [ 377 | "1085", 378 | "1086" 379 | ], 380 | [ 381 | "135D", 382 | "135F" 383 | ], 384 | [ 385 | "1712", 386 | "1714" 387 | ], 388 | [ 389 | "1732", 390 | "1734" 391 | ], 392 | [ 393 | "1752", 394 | "1753" 395 | ], 396 | [ 397 | "1772", 398 | "1773" 399 | ], 400 | [ 401 | "17B4", 402 | "17B5" 403 | ], 404 | [ 405 | "17B7", 406 | "17BD" 407 | ], 408 | [ 409 | "17C9", 410 | "17D3" 411 | ], 412 | [ 413 | "180B", 414 | "180D" 415 | ], 416 | [ 417 | "1885", 418 | "1886" 419 | ], 420 | [ 421 | "1920", 422 | "1922" 423 | ], 424 | [ 425 | "1927", 426 | "1928" 427 | ], 428 | [ 429 | "1939", 430 | "193B" 431 | ], 432 | [ 433 | "1A17", 434 | "1A18" 435 | ], 436 | [ 437 | "1A58", 438 | "1A5E" 439 | ], 440 | [ 441 | "1A65", 442 | "1A6C" 443 | ], 444 | [ 445 | "1A73", 446 | "1A7C" 447 | ], 448 | [ 449 | "1AB0", 450 | "1ABD" 451 | ], 452 | [ 453 | "1B00", 454 | "1B03" 455 | ], 456 | [ 457 | "1B36", 458 | "1B3A" 459 | ], 460 | [ 461 | "1B6B", 462 | "1B73" 463 | ], 464 | [ 465 | "1B80", 466 | "1B81" 467 | ], 468 | [ 469 | "1BA2", 470 | "1BA5" 471 | ], 472 | [ 473 | "1BA8", 474 | "1BA9" 475 | ], 476 | [ 477 | "1BAB", 478 | "1BAD" 479 | ], 480 | [ 481 | "1BE8", 482 | "1BE9" 483 | ], 484 | [ 485 | "1BEF", 486 | "1BF1" 487 | ], 488 | [ 489 | "1C2C", 490 | "1C33" 491 | ], 492 | [ 493 | "1C36", 494 | "1C37" 495 | ], 496 | [ 497 | "1CD0", 498 | "1CD2" 499 | ], 500 | [ 501 | "1CD4", 502 | "1CE0" 503 | ], 504 | [ 505 | "1CE2", 506 | "1CE8" 507 | ], 508 | [ 509 | "1CF8", 510 | "1CF9" 511 | ], 512 | [ 513 | "1DC0", 514 | "1DF9" 515 | ], 516 | [ 517 | "1DFB", 518 | "1DFF" 519 | ], 520 | [ 521 | "20D0", 522 | "20DC" 523 | ], 524 | [ 525 | "20DD", 526 | "20E0" 527 | ], 528 | [ 529 | "20E2", 530 | "20E4" 531 | ], 532 | [ 533 | "20E5", 534 | "20F0" 535 | ], 536 | [ 537 | "2CEF", 538 | "2CF1" 539 | ], 540 | [ 541 | "2DE0", 542 | "2DFF" 543 | ], 544 | [ 545 | "302A", 546 | "302D" 547 | ], 548 | [ 549 | "302E", 550 | "302F" 551 | ], 552 | [ 553 | "3099", 554 | "309A" 555 | ], 556 | [ 557 | "A670", 558 | "A672" 559 | ], 560 | [ 561 | "A674", 562 | "A67D" 563 | ], 564 | [ 565 | "A69E", 566 | "A69F" 567 | ], 568 | [ 569 | "A6F0", 570 | "A6F1" 571 | ], 572 | [ 573 | "A825", 574 | "A826" 575 | ], 576 | [ 577 | "A8C4", 578 | "A8C5" 579 | ], 580 | [ 581 | "A8E0", 582 | "A8F1" 583 | ], 584 | [ 585 | "A926", 586 | "A92D" 587 | ], 588 | [ 589 | "A947", 590 | "A951" 591 | ], 592 | [ 593 | "A980", 594 | "A982" 595 | ], 596 | [ 597 | "A9B6", 598 | "A9B9" 599 | ], 600 | [ 601 | "AA29", 602 | "AA2E" 603 | ], 604 | [ 605 | "AA31", 606 | "AA32" 607 | ], 608 | [ 609 | "AA35", 610 | "AA36" 611 | ], 612 | [ 613 | "AAB2", 614 | "AAB4" 615 | ], 616 | [ 617 | "AAB7", 618 | "AAB8" 619 | ], 620 | [ 621 | "AABE", 622 | "AABF" 623 | ], 624 | [ 625 | "AAEC", 626 | "AAED" 627 | ], 628 | [ 629 | "FE00", 630 | "FE0F" 631 | ], 632 | [ 633 | "FE20", 634 | "FE2F" 635 | ], 636 | [ 637 | "FF9E", 638 | "FF9F" 639 | ], 640 | [ 641 | "10376", 642 | "1037A" 643 | ], 644 | [ 645 | "10A01", 646 | "10A03" 647 | ], 648 | [ 649 | "10A05", 650 | "10A06" 651 | ], 652 | [ 653 | "10A0C", 654 | "10A0F" 655 | ], 656 | [ 657 | "10A38", 658 | "10A3A" 659 | ], 660 | [ 661 | "10AE5", 662 | "10AE6" 663 | ], 664 | [ 665 | "11038", 666 | "11046" 667 | ], 668 | [ 669 | "1107F", 670 | "11081" 671 | ], 672 | [ 673 | "110B3", 674 | "110B6" 675 | ], 676 | [ 677 | "110B9", 678 | "110BA" 679 | ], 680 | [ 681 | "11100", 682 | "11102" 683 | ], 684 | [ 685 | "11127", 686 | "1112B" 687 | ], 688 | [ 689 | "1112D", 690 | "11134" 691 | ], 692 | [ 693 | "11180", 694 | "11181" 695 | ], 696 | [ 697 | "111B6", 698 | "111BE" 699 | ], 700 | [ 701 | "111CA", 702 | "111CC" 703 | ], 704 | [ 705 | "1122F", 706 | "11231" 707 | ], 708 | [ 709 | "11236", 710 | "11237" 711 | ], 712 | [ 713 | "112E3", 714 | "112EA" 715 | ], 716 | [ 717 | "11300", 718 | "11301" 719 | ], 720 | [ 721 | "11366", 722 | "1136C" 723 | ], 724 | [ 725 | "11370", 726 | "11374" 727 | ], 728 | [ 729 | "11438", 730 | "1143F" 731 | ], 732 | [ 733 | "11442", 734 | "11444" 735 | ], 736 | [ 737 | "114B3", 738 | "114B8" 739 | ], 740 | [ 741 | "114BF", 742 | "114C0" 743 | ], 744 | [ 745 | "114C2", 746 | "114C3" 747 | ], 748 | [ 749 | "115B2", 750 | "115B5" 751 | ], 752 | [ 753 | "115BC", 754 | "115BD" 755 | ], 756 | [ 757 | "115BF", 758 | "115C0" 759 | ], 760 | [ 761 | "115DC", 762 | "115DD" 763 | ], 764 | [ 765 | "11633", 766 | "1163A" 767 | ], 768 | [ 769 | "1163F", 770 | "11640" 771 | ], 772 | [ 773 | "116B0", 774 | "116B5" 775 | ], 776 | [ 777 | "1171D", 778 | "1171F" 779 | ], 780 | [ 781 | "11722", 782 | "11725" 783 | ], 784 | [ 785 | "11727", 786 | "1172B" 787 | ], 788 | [ 789 | "11A01", 790 | "11A06" 791 | ], 792 | [ 793 | "11A09", 794 | "11A0A" 795 | ], 796 | [ 797 | "11A33", 798 | "11A38" 799 | ], 800 | [ 801 | "11A3B", 802 | "11A3E" 803 | ], 804 | [ 805 | "11A51", 806 | "11A56" 807 | ], 808 | [ 809 | "11A59", 810 | "11A5B" 811 | ], 812 | [ 813 | "11A8A", 814 | "11A96" 815 | ], 816 | [ 817 | "11A98", 818 | "11A99" 819 | ], 820 | [ 821 | "11C30", 822 | "11C36" 823 | ], 824 | [ 825 | "11C38", 826 | "11C3D" 827 | ], 828 | [ 829 | "11C92", 830 | "11CA7" 831 | ], 832 | [ 833 | "11CAA", 834 | "11CB0" 835 | ], 836 | [ 837 | "11CB2", 838 | "11CB3" 839 | ], 840 | [ 841 | "11CB5", 842 | "11CB6" 843 | ], 844 | [ 845 | "11D31", 846 | "11D36" 847 | ], 848 | [ 849 | "11D3C", 850 | "11D3D" 851 | ], 852 | [ 853 | "11D3F", 854 | "11D45" 855 | ], 856 | [ 857 | "16AF0", 858 | "16AF4" 859 | ], 860 | [ 861 | "16B30", 862 | "16B36" 863 | ], 864 | [ 865 | "16F8F", 866 | "16F92" 867 | ], 868 | [ 869 | "1BC9D", 870 | "1BC9E" 871 | ], 872 | [ 873 | "1D167", 874 | "1D169" 875 | ], 876 | [ 877 | "1D16E", 878 | "1D172" 879 | ], 880 | [ 881 | "1D17B", 882 | "1D182" 883 | ], 884 | [ 885 | "1D185", 886 | "1D18B" 887 | ], 888 | [ 889 | "1D1AA", 890 | "1D1AD" 891 | ], 892 | [ 893 | "1D242", 894 | "1D244" 895 | ], 896 | [ 897 | "1DA00", 898 | "1DA36" 899 | ], 900 | [ 901 | "1DA3B", 902 | "1DA6C" 903 | ], 904 | [ 905 | "1DA9B", 906 | "1DA9F" 907 | ], 908 | [ 909 | "1DAA1", 910 | "1DAAF" 911 | ], 912 | [ 913 | "1E000", 914 | "1E006" 915 | ], 916 | [ 917 | "1E008", 918 | "1E018" 919 | ], 920 | [ 921 | "1E01B", 922 | "1E021" 923 | ], 924 | [ 925 | "1E023", 926 | "1E024" 927 | ], 928 | [ 929 | "1E026", 930 | "1E02A" 931 | ], 932 | [ 933 | "1E8D0", 934 | "1E8D6" 935 | ], 936 | [ 937 | "1E944", 938 | "1E94A" 939 | ], 940 | [ 941 | "E0020", 942 | "E007F" 943 | ], 944 | [ 945 | "E0100", 946 | "E01EF" 947 | ] 948 | ], 949 | "single_chars": [ 950 | "05BF", 951 | "05C7", 952 | "0670", 953 | "0711", 954 | "093A", 955 | "093C", 956 | "094D", 957 | "0981", 958 | "09BC", 959 | "09BE", 960 | "09CD", 961 | "09D7", 962 | "0A3C", 963 | "0A51", 964 | "0A75", 965 | "0ABC", 966 | "0ACD", 967 | "0B01", 968 | "0B3C", 969 | "0B3E", 970 | "0B3F", 971 | "0B4D", 972 | "0B56", 973 | "0B57", 974 | "0B82", 975 | "0BBE", 976 | "0BC0", 977 | "0BCD", 978 | "0BD7", 979 | "0C00", 980 | "0C81", 981 | "0CBC", 982 | "0CBF", 983 | "0CC2", 984 | "0CC6", 985 | "0D3E", 986 | "0D4D", 987 | "0D57", 988 | "0DCA", 989 | "0DCF", 990 | "0DD6", 991 | "0DDF", 992 | "0E31", 993 | "0EB1", 994 | "0F35", 995 | "0F37", 996 | "0F39", 997 | "0FC6", 998 | "1082", 999 | "108D", 1000 | "109D", 1001 | "17C6", 1002 | "17DD", 1003 | "18A9", 1004 | "1932", 1005 | "1A1B", 1006 | "1A56", 1007 | "1A60", 1008 | "1A62", 1009 | "1A7F", 1010 | "1ABE", 1011 | "1B34", 1012 | "1B3C", 1013 | "1B42", 1014 | "1BE6", 1015 | "1BED", 1016 | "1CED", 1017 | "1CF4", 1018 | "200C", 1019 | "20E1", 1020 | "2D7F", 1021 | "A66F", 1022 | "A802", 1023 | "A806", 1024 | "A80B", 1025 | "A9B3", 1026 | "A9BC", 1027 | "A9E5", 1028 | "AA43", 1029 | "AA4C", 1030 | "AA7C", 1031 | "AAB0", 1032 | "AAC1", 1033 | "AAF6", 1034 | "ABE5", 1035 | "ABE8", 1036 | "ABED", 1037 | "FB1E", 1038 | "101FD", 1039 | "102E0", 1040 | "10A3F", 1041 | "11001", 1042 | "11173", 1043 | "11234", 1044 | "1123E", 1045 | "112DF", 1046 | "1133C", 1047 | "1133E", 1048 | "11340", 1049 | "11357", 1050 | "11446", 1051 | "114B0", 1052 | "114BA", 1053 | "114BD", 1054 | "115AF", 1055 | "1163D", 1056 | "116AB", 1057 | "116AD", 1058 | "116B7", 1059 | "11A47", 1060 | "11C3F", 1061 | "11D3A", 1062 | "11D47", 1063 | "1D165", 1064 | "1DA75", 1065 | "1DA84" 1066 | ] 1067 | }, 1068 | "LVT": { 1069 | "ranges": [ 1070 | [ 1071 | "AC01", 1072 | "AC1B" 1073 | ], 1074 | [ 1075 | "AC1D", 1076 | "AC37" 1077 | ], 1078 | [ 1079 | "AC39", 1080 | "AC53" 1081 | ], 1082 | [ 1083 | "AC55", 1084 | "AC6F" 1085 | ], 1086 | [ 1087 | "AC71", 1088 | "AC8B" 1089 | ], 1090 | [ 1091 | "AC8D", 1092 | "ACA7" 1093 | ], 1094 | [ 1095 | "ACA9", 1096 | "ACC3" 1097 | ], 1098 | [ 1099 | "ACC5", 1100 | "ACDF" 1101 | ], 1102 | [ 1103 | "ACE1", 1104 | "ACFB" 1105 | ], 1106 | [ 1107 | "ACFD", 1108 | "AD17" 1109 | ], 1110 | [ 1111 | "AD19", 1112 | "AD33" 1113 | ], 1114 | [ 1115 | "AD35", 1116 | "AD4F" 1117 | ], 1118 | [ 1119 | "AD51", 1120 | "AD6B" 1121 | ], 1122 | [ 1123 | "AD6D", 1124 | "AD87" 1125 | ], 1126 | [ 1127 | "AD89", 1128 | "ADA3" 1129 | ], 1130 | [ 1131 | "ADA5", 1132 | "ADBF" 1133 | ], 1134 | [ 1135 | "ADC1", 1136 | "ADDB" 1137 | ], 1138 | [ 1139 | "ADDD", 1140 | "ADF7" 1141 | ], 1142 | [ 1143 | "ADF9", 1144 | "AE13" 1145 | ], 1146 | [ 1147 | "AE15", 1148 | "AE2F" 1149 | ], 1150 | [ 1151 | "AE31", 1152 | "AE4B" 1153 | ], 1154 | [ 1155 | "AE4D", 1156 | "AE67" 1157 | ], 1158 | [ 1159 | "AE69", 1160 | "AE83" 1161 | ], 1162 | [ 1163 | "AE85", 1164 | "AE9F" 1165 | ], 1166 | [ 1167 | "AEA1", 1168 | "AEBB" 1169 | ], 1170 | [ 1171 | "AEBD", 1172 | "AED7" 1173 | ], 1174 | [ 1175 | "AED9", 1176 | "AEF3" 1177 | ], 1178 | [ 1179 | "AEF5", 1180 | "AF0F" 1181 | ], 1182 | [ 1183 | "AF11", 1184 | "AF2B" 1185 | ], 1186 | [ 1187 | "AF2D", 1188 | "AF47" 1189 | ], 1190 | [ 1191 | "AF49", 1192 | "AF63" 1193 | ], 1194 | [ 1195 | "AF65", 1196 | "AF7F" 1197 | ], 1198 | [ 1199 | "AF81", 1200 | "AF9B" 1201 | ], 1202 | [ 1203 | "AF9D", 1204 | "AFB7" 1205 | ], 1206 | [ 1207 | "AFB9", 1208 | "AFD3" 1209 | ], 1210 | [ 1211 | "AFD5", 1212 | "AFEF" 1213 | ], 1214 | [ 1215 | "AFF1", 1216 | "B00B" 1217 | ], 1218 | [ 1219 | "B00D", 1220 | "B027" 1221 | ], 1222 | [ 1223 | "B029", 1224 | "B043" 1225 | ], 1226 | [ 1227 | "B045", 1228 | "B05F" 1229 | ], 1230 | [ 1231 | "B061", 1232 | "B07B" 1233 | ], 1234 | [ 1235 | "B07D", 1236 | "B097" 1237 | ], 1238 | [ 1239 | "B099", 1240 | "B0B3" 1241 | ], 1242 | [ 1243 | "B0B5", 1244 | "B0CF" 1245 | ], 1246 | [ 1247 | "B0D1", 1248 | "B0EB" 1249 | ], 1250 | [ 1251 | "B0ED", 1252 | "B107" 1253 | ], 1254 | [ 1255 | "B109", 1256 | "B123" 1257 | ], 1258 | [ 1259 | "B125", 1260 | "B13F" 1261 | ], 1262 | [ 1263 | "B141", 1264 | "B15B" 1265 | ], 1266 | [ 1267 | "B15D", 1268 | "B177" 1269 | ], 1270 | [ 1271 | "B179", 1272 | "B193" 1273 | ], 1274 | [ 1275 | "B195", 1276 | "B1AF" 1277 | ], 1278 | [ 1279 | "B1B1", 1280 | "B1CB" 1281 | ], 1282 | [ 1283 | "B1CD", 1284 | "B1E7" 1285 | ], 1286 | [ 1287 | "B1E9", 1288 | "B203" 1289 | ], 1290 | [ 1291 | "B205", 1292 | "B21F" 1293 | ], 1294 | [ 1295 | "B221", 1296 | "B23B" 1297 | ], 1298 | [ 1299 | "B23D", 1300 | "B257" 1301 | ], 1302 | [ 1303 | "B259", 1304 | "B273" 1305 | ], 1306 | [ 1307 | "B275", 1308 | "B28F" 1309 | ], 1310 | [ 1311 | "B291", 1312 | "B2AB" 1313 | ], 1314 | [ 1315 | "B2AD", 1316 | "B2C7" 1317 | ], 1318 | [ 1319 | "B2C9", 1320 | "B2E3" 1321 | ], 1322 | [ 1323 | "B2E5", 1324 | "B2FF" 1325 | ], 1326 | [ 1327 | "B301", 1328 | "B31B" 1329 | ], 1330 | [ 1331 | "B31D", 1332 | "B337" 1333 | ], 1334 | [ 1335 | "B339", 1336 | "B353" 1337 | ], 1338 | [ 1339 | "B355", 1340 | "B36F" 1341 | ], 1342 | [ 1343 | "B371", 1344 | "B38B" 1345 | ], 1346 | [ 1347 | "B38D", 1348 | "B3A7" 1349 | ], 1350 | [ 1351 | "B3A9", 1352 | "B3C3" 1353 | ], 1354 | [ 1355 | "B3C5", 1356 | "B3DF" 1357 | ], 1358 | [ 1359 | "B3E1", 1360 | "B3FB" 1361 | ], 1362 | [ 1363 | "B3FD", 1364 | "B417" 1365 | ], 1366 | [ 1367 | "B419", 1368 | "B433" 1369 | ], 1370 | [ 1371 | "B435", 1372 | "B44F" 1373 | ], 1374 | [ 1375 | "B451", 1376 | "B46B" 1377 | ], 1378 | [ 1379 | "B46D", 1380 | "B487" 1381 | ], 1382 | [ 1383 | "B489", 1384 | "B4A3" 1385 | ], 1386 | [ 1387 | "B4A5", 1388 | "B4BF" 1389 | ], 1390 | [ 1391 | "B4C1", 1392 | "B4DB" 1393 | ], 1394 | [ 1395 | "B4DD", 1396 | "B4F7" 1397 | ], 1398 | [ 1399 | "B4F9", 1400 | "B513" 1401 | ], 1402 | [ 1403 | "B515", 1404 | "B52F" 1405 | ], 1406 | [ 1407 | "B531", 1408 | "B54B" 1409 | ], 1410 | [ 1411 | "B54D", 1412 | "B567" 1413 | ], 1414 | [ 1415 | "B569", 1416 | "B583" 1417 | ], 1418 | [ 1419 | "B585", 1420 | "B59F" 1421 | ], 1422 | [ 1423 | "B5A1", 1424 | "B5BB" 1425 | ], 1426 | [ 1427 | "B5BD", 1428 | "B5D7" 1429 | ], 1430 | [ 1431 | "B5D9", 1432 | "B5F3" 1433 | ], 1434 | [ 1435 | "B5F5", 1436 | "B60F" 1437 | ], 1438 | [ 1439 | "B611", 1440 | "B62B" 1441 | ], 1442 | [ 1443 | "B62D", 1444 | "B647" 1445 | ], 1446 | [ 1447 | "B649", 1448 | "B663" 1449 | ], 1450 | [ 1451 | "B665", 1452 | "B67F" 1453 | ], 1454 | [ 1455 | "B681", 1456 | "B69B" 1457 | ], 1458 | [ 1459 | "B69D", 1460 | "B6B7" 1461 | ], 1462 | [ 1463 | "B6B9", 1464 | "B6D3" 1465 | ], 1466 | [ 1467 | "B6D5", 1468 | "B6EF" 1469 | ], 1470 | [ 1471 | "B6F1", 1472 | "B70B" 1473 | ], 1474 | [ 1475 | "B70D", 1476 | "B727" 1477 | ], 1478 | [ 1479 | "B729", 1480 | "B743" 1481 | ], 1482 | [ 1483 | "B745", 1484 | "B75F" 1485 | ], 1486 | [ 1487 | "B761", 1488 | "B77B" 1489 | ], 1490 | [ 1491 | "B77D", 1492 | "B797" 1493 | ], 1494 | [ 1495 | "B799", 1496 | "B7B3" 1497 | ], 1498 | [ 1499 | "B7B5", 1500 | "B7CF" 1501 | ], 1502 | [ 1503 | "B7D1", 1504 | "B7EB" 1505 | ], 1506 | [ 1507 | "B7ED", 1508 | "B807" 1509 | ], 1510 | [ 1511 | "B809", 1512 | "B823" 1513 | ], 1514 | [ 1515 | "B825", 1516 | "B83F" 1517 | ], 1518 | [ 1519 | "B841", 1520 | "B85B" 1521 | ], 1522 | [ 1523 | "B85D", 1524 | "B877" 1525 | ], 1526 | [ 1527 | "B879", 1528 | "B893" 1529 | ], 1530 | [ 1531 | "B895", 1532 | "B8AF" 1533 | ], 1534 | [ 1535 | "B8B1", 1536 | "B8CB" 1537 | ], 1538 | [ 1539 | "B8CD", 1540 | "B8E7" 1541 | ], 1542 | [ 1543 | "B8E9", 1544 | "B903" 1545 | ], 1546 | [ 1547 | "B905", 1548 | "B91F" 1549 | ], 1550 | [ 1551 | "B921", 1552 | "B93B" 1553 | ], 1554 | [ 1555 | "B93D", 1556 | "B957" 1557 | ], 1558 | [ 1559 | "B959", 1560 | "B973" 1561 | ], 1562 | [ 1563 | "B975", 1564 | "B98F" 1565 | ], 1566 | [ 1567 | "B991", 1568 | "B9AB" 1569 | ], 1570 | [ 1571 | "B9AD", 1572 | "B9C7" 1573 | ], 1574 | [ 1575 | "B9C9", 1576 | "B9E3" 1577 | ], 1578 | [ 1579 | "B9E5", 1580 | "B9FF" 1581 | ], 1582 | [ 1583 | "BA01", 1584 | "BA1B" 1585 | ], 1586 | [ 1587 | "BA1D", 1588 | "BA37" 1589 | ], 1590 | [ 1591 | "BA39", 1592 | "BA53" 1593 | ], 1594 | [ 1595 | "BA55", 1596 | "BA6F" 1597 | ], 1598 | [ 1599 | "BA71", 1600 | "BA8B" 1601 | ], 1602 | [ 1603 | "BA8D", 1604 | "BAA7" 1605 | ], 1606 | [ 1607 | "BAA9", 1608 | "BAC3" 1609 | ], 1610 | [ 1611 | "BAC5", 1612 | "BADF" 1613 | ], 1614 | [ 1615 | "BAE1", 1616 | "BAFB" 1617 | ], 1618 | [ 1619 | "BAFD", 1620 | "BB17" 1621 | ], 1622 | [ 1623 | "BB19", 1624 | "BB33" 1625 | ], 1626 | [ 1627 | "BB35", 1628 | "BB4F" 1629 | ], 1630 | [ 1631 | "BB51", 1632 | "BB6B" 1633 | ], 1634 | [ 1635 | "BB6D", 1636 | "BB87" 1637 | ], 1638 | [ 1639 | "BB89", 1640 | "BBA3" 1641 | ], 1642 | [ 1643 | "BBA5", 1644 | "BBBF" 1645 | ], 1646 | [ 1647 | "BBC1", 1648 | "BBDB" 1649 | ], 1650 | [ 1651 | "BBDD", 1652 | "BBF7" 1653 | ], 1654 | [ 1655 | "BBF9", 1656 | "BC13" 1657 | ], 1658 | [ 1659 | "BC15", 1660 | "BC2F" 1661 | ], 1662 | [ 1663 | "BC31", 1664 | "BC4B" 1665 | ], 1666 | [ 1667 | "BC4D", 1668 | "BC67" 1669 | ], 1670 | [ 1671 | "BC69", 1672 | "BC83" 1673 | ], 1674 | [ 1675 | "BC85", 1676 | "BC9F" 1677 | ], 1678 | [ 1679 | "BCA1", 1680 | "BCBB" 1681 | ], 1682 | [ 1683 | "BCBD", 1684 | "BCD7" 1685 | ], 1686 | [ 1687 | "BCD9", 1688 | "BCF3" 1689 | ], 1690 | [ 1691 | "BCF5", 1692 | "BD0F" 1693 | ], 1694 | [ 1695 | "BD11", 1696 | "BD2B" 1697 | ], 1698 | [ 1699 | "BD2D", 1700 | "BD47" 1701 | ], 1702 | [ 1703 | "BD49", 1704 | "BD63" 1705 | ], 1706 | [ 1707 | "BD65", 1708 | "BD7F" 1709 | ], 1710 | [ 1711 | "BD81", 1712 | "BD9B" 1713 | ], 1714 | [ 1715 | "BD9D", 1716 | "BDB7" 1717 | ], 1718 | [ 1719 | "BDB9", 1720 | "BDD3" 1721 | ], 1722 | [ 1723 | "BDD5", 1724 | "BDEF" 1725 | ], 1726 | [ 1727 | "BDF1", 1728 | "BE0B" 1729 | ], 1730 | [ 1731 | "BE0D", 1732 | "BE27" 1733 | ], 1734 | [ 1735 | "BE29", 1736 | "BE43" 1737 | ], 1738 | [ 1739 | "BE45", 1740 | "BE5F" 1741 | ], 1742 | [ 1743 | "BE61", 1744 | "BE7B" 1745 | ], 1746 | [ 1747 | "BE7D", 1748 | "BE97" 1749 | ], 1750 | [ 1751 | "BE99", 1752 | "BEB3" 1753 | ], 1754 | [ 1755 | "BEB5", 1756 | "BECF" 1757 | ], 1758 | [ 1759 | "BED1", 1760 | "BEEB" 1761 | ], 1762 | [ 1763 | "BEED", 1764 | "BF07" 1765 | ], 1766 | [ 1767 | "BF09", 1768 | "BF23" 1769 | ], 1770 | [ 1771 | "BF25", 1772 | "BF3F" 1773 | ], 1774 | [ 1775 | "BF41", 1776 | "BF5B" 1777 | ], 1778 | [ 1779 | "BF5D", 1780 | "BF77" 1781 | ], 1782 | [ 1783 | "BF79", 1784 | "BF93" 1785 | ], 1786 | [ 1787 | "BF95", 1788 | "BFAF" 1789 | ], 1790 | [ 1791 | "BFB1", 1792 | "BFCB" 1793 | ], 1794 | [ 1795 | "BFCD", 1796 | "BFE7" 1797 | ], 1798 | [ 1799 | "BFE9", 1800 | "C003" 1801 | ], 1802 | [ 1803 | "C005", 1804 | "C01F" 1805 | ], 1806 | [ 1807 | "C021", 1808 | "C03B" 1809 | ], 1810 | [ 1811 | "C03D", 1812 | "C057" 1813 | ], 1814 | [ 1815 | "C059", 1816 | "C073" 1817 | ], 1818 | [ 1819 | "C075", 1820 | "C08F" 1821 | ], 1822 | [ 1823 | "C091", 1824 | "C0AB" 1825 | ], 1826 | [ 1827 | "C0AD", 1828 | "C0C7" 1829 | ], 1830 | [ 1831 | "C0C9", 1832 | "C0E3" 1833 | ], 1834 | [ 1835 | "C0E5", 1836 | "C0FF" 1837 | ], 1838 | [ 1839 | "C101", 1840 | "C11B" 1841 | ], 1842 | [ 1843 | "C11D", 1844 | "C137" 1845 | ], 1846 | [ 1847 | "C139", 1848 | "C153" 1849 | ], 1850 | [ 1851 | "C155", 1852 | "C16F" 1853 | ], 1854 | [ 1855 | "C171", 1856 | "C18B" 1857 | ], 1858 | [ 1859 | "C18D", 1860 | "C1A7" 1861 | ], 1862 | [ 1863 | "C1A9", 1864 | "C1C3" 1865 | ], 1866 | [ 1867 | "C1C5", 1868 | "C1DF" 1869 | ], 1870 | [ 1871 | "C1E1", 1872 | "C1FB" 1873 | ], 1874 | [ 1875 | "C1FD", 1876 | "C217" 1877 | ], 1878 | [ 1879 | "C219", 1880 | "C233" 1881 | ], 1882 | [ 1883 | "C235", 1884 | "C24F" 1885 | ], 1886 | [ 1887 | "C251", 1888 | "C26B" 1889 | ], 1890 | [ 1891 | "C26D", 1892 | "C287" 1893 | ], 1894 | [ 1895 | "C289", 1896 | "C2A3" 1897 | ], 1898 | [ 1899 | "C2A5", 1900 | "C2BF" 1901 | ], 1902 | [ 1903 | "C2C1", 1904 | "C2DB" 1905 | ], 1906 | [ 1907 | "C2DD", 1908 | "C2F7" 1909 | ], 1910 | [ 1911 | "C2F9", 1912 | "C313" 1913 | ], 1914 | [ 1915 | "C315", 1916 | "C32F" 1917 | ], 1918 | [ 1919 | "C331", 1920 | "C34B" 1921 | ], 1922 | [ 1923 | "C34D", 1924 | "C367" 1925 | ], 1926 | [ 1927 | "C369", 1928 | "C383" 1929 | ], 1930 | [ 1931 | "C385", 1932 | "C39F" 1933 | ], 1934 | [ 1935 | "C3A1", 1936 | "C3BB" 1937 | ], 1938 | [ 1939 | "C3BD", 1940 | "C3D7" 1941 | ], 1942 | [ 1943 | "C3D9", 1944 | "C3F3" 1945 | ], 1946 | [ 1947 | "C3F5", 1948 | "C40F" 1949 | ], 1950 | [ 1951 | "C411", 1952 | "C42B" 1953 | ], 1954 | [ 1955 | "C42D", 1956 | "C447" 1957 | ], 1958 | [ 1959 | "C449", 1960 | "C463" 1961 | ], 1962 | [ 1963 | "C465", 1964 | "C47F" 1965 | ], 1966 | [ 1967 | "C481", 1968 | "C49B" 1969 | ], 1970 | [ 1971 | "C49D", 1972 | "C4B7" 1973 | ], 1974 | [ 1975 | "C4B9", 1976 | "C4D3" 1977 | ], 1978 | [ 1979 | "C4D5", 1980 | "C4EF" 1981 | ], 1982 | [ 1983 | "C4F1", 1984 | "C50B" 1985 | ], 1986 | [ 1987 | "C50D", 1988 | "C527" 1989 | ], 1990 | [ 1991 | "C529", 1992 | "C543" 1993 | ], 1994 | [ 1995 | "C545", 1996 | "C55F" 1997 | ], 1998 | [ 1999 | "C561", 2000 | "C57B" 2001 | ], 2002 | [ 2003 | "C57D", 2004 | "C597" 2005 | ], 2006 | [ 2007 | "C599", 2008 | "C5B3" 2009 | ], 2010 | [ 2011 | "C5B5", 2012 | "C5CF" 2013 | ], 2014 | [ 2015 | "C5D1", 2016 | "C5EB" 2017 | ], 2018 | [ 2019 | "C5ED", 2020 | "C607" 2021 | ], 2022 | [ 2023 | "C609", 2024 | "C623" 2025 | ], 2026 | [ 2027 | "C625", 2028 | "C63F" 2029 | ], 2030 | [ 2031 | "C641", 2032 | "C65B" 2033 | ], 2034 | [ 2035 | "C65D", 2036 | "C677" 2037 | ], 2038 | [ 2039 | "C679", 2040 | "C693" 2041 | ], 2042 | [ 2043 | "C695", 2044 | "C6AF" 2045 | ], 2046 | [ 2047 | "C6B1", 2048 | "C6CB" 2049 | ], 2050 | [ 2051 | "C6CD", 2052 | "C6E7" 2053 | ], 2054 | [ 2055 | "C6E9", 2056 | "C703" 2057 | ], 2058 | [ 2059 | "C705", 2060 | "C71F" 2061 | ], 2062 | [ 2063 | "C721", 2064 | "C73B" 2065 | ], 2066 | [ 2067 | "C73D", 2068 | "C757" 2069 | ], 2070 | [ 2071 | "C759", 2072 | "C773" 2073 | ], 2074 | [ 2075 | "C775", 2076 | "C78F" 2077 | ], 2078 | [ 2079 | "C791", 2080 | "C7AB" 2081 | ], 2082 | [ 2083 | "C7AD", 2084 | "C7C7" 2085 | ], 2086 | [ 2087 | "C7C9", 2088 | "C7E3" 2089 | ], 2090 | [ 2091 | "C7E5", 2092 | "C7FF" 2093 | ], 2094 | [ 2095 | "C801", 2096 | "C81B" 2097 | ], 2098 | [ 2099 | "C81D", 2100 | "C837" 2101 | ], 2102 | [ 2103 | "C839", 2104 | "C853" 2105 | ], 2106 | [ 2107 | "C855", 2108 | "C86F" 2109 | ], 2110 | [ 2111 | "C871", 2112 | "C88B" 2113 | ], 2114 | [ 2115 | "C88D", 2116 | "C8A7" 2117 | ], 2118 | [ 2119 | "C8A9", 2120 | "C8C3" 2121 | ], 2122 | [ 2123 | "C8C5", 2124 | "C8DF" 2125 | ], 2126 | [ 2127 | "C8E1", 2128 | "C8FB" 2129 | ], 2130 | [ 2131 | "C8FD", 2132 | "C917" 2133 | ], 2134 | [ 2135 | "C919", 2136 | "C933" 2137 | ], 2138 | [ 2139 | "C935", 2140 | "C94F" 2141 | ], 2142 | [ 2143 | "C951", 2144 | "C96B" 2145 | ], 2146 | [ 2147 | "C96D", 2148 | "C987" 2149 | ], 2150 | [ 2151 | "C989", 2152 | "C9A3" 2153 | ], 2154 | [ 2155 | "C9A5", 2156 | "C9BF" 2157 | ], 2158 | [ 2159 | "C9C1", 2160 | "C9DB" 2161 | ], 2162 | [ 2163 | "C9DD", 2164 | "C9F7" 2165 | ], 2166 | [ 2167 | "C9F9", 2168 | "CA13" 2169 | ], 2170 | [ 2171 | "CA15", 2172 | "CA2F" 2173 | ], 2174 | [ 2175 | "CA31", 2176 | "CA4B" 2177 | ], 2178 | [ 2179 | "CA4D", 2180 | "CA67" 2181 | ], 2182 | [ 2183 | "CA69", 2184 | "CA83" 2185 | ], 2186 | [ 2187 | "CA85", 2188 | "CA9F" 2189 | ], 2190 | [ 2191 | "CAA1", 2192 | "CABB" 2193 | ], 2194 | [ 2195 | "CABD", 2196 | "CAD7" 2197 | ], 2198 | [ 2199 | "CAD9", 2200 | "CAF3" 2201 | ], 2202 | [ 2203 | "CAF5", 2204 | "CB0F" 2205 | ], 2206 | [ 2207 | "CB11", 2208 | "CB2B" 2209 | ], 2210 | [ 2211 | "CB2D", 2212 | "CB47" 2213 | ], 2214 | [ 2215 | "CB49", 2216 | "CB63" 2217 | ], 2218 | [ 2219 | "CB65", 2220 | "CB7F" 2221 | ], 2222 | [ 2223 | "CB81", 2224 | "CB9B" 2225 | ], 2226 | [ 2227 | "CB9D", 2228 | "CBB7" 2229 | ], 2230 | [ 2231 | "CBB9", 2232 | "CBD3" 2233 | ], 2234 | [ 2235 | "CBD5", 2236 | "CBEF" 2237 | ], 2238 | [ 2239 | "CBF1", 2240 | "CC0B" 2241 | ], 2242 | [ 2243 | "CC0D", 2244 | "CC27" 2245 | ], 2246 | [ 2247 | "CC29", 2248 | "CC43" 2249 | ], 2250 | [ 2251 | "CC45", 2252 | "CC5F" 2253 | ], 2254 | [ 2255 | "CC61", 2256 | "CC7B" 2257 | ], 2258 | [ 2259 | "CC7D", 2260 | "CC97" 2261 | ], 2262 | [ 2263 | "CC99", 2264 | "CCB3" 2265 | ], 2266 | [ 2267 | "CCB5", 2268 | "CCCF" 2269 | ], 2270 | [ 2271 | "CCD1", 2272 | "CCEB" 2273 | ], 2274 | [ 2275 | "CCED", 2276 | "CD07" 2277 | ], 2278 | [ 2279 | "CD09", 2280 | "CD23" 2281 | ], 2282 | [ 2283 | "CD25", 2284 | "CD3F" 2285 | ], 2286 | [ 2287 | "CD41", 2288 | "CD5B" 2289 | ], 2290 | [ 2291 | "CD5D", 2292 | "CD77" 2293 | ], 2294 | [ 2295 | "CD79", 2296 | "CD93" 2297 | ], 2298 | [ 2299 | "CD95", 2300 | "CDAF" 2301 | ], 2302 | [ 2303 | "CDB1", 2304 | "CDCB" 2305 | ], 2306 | [ 2307 | "CDCD", 2308 | "CDE7" 2309 | ], 2310 | [ 2311 | "CDE9", 2312 | "CE03" 2313 | ], 2314 | [ 2315 | "CE05", 2316 | "CE1F" 2317 | ], 2318 | [ 2319 | "CE21", 2320 | "CE3B" 2321 | ], 2322 | [ 2323 | "CE3D", 2324 | "CE57" 2325 | ], 2326 | [ 2327 | "CE59", 2328 | "CE73" 2329 | ], 2330 | [ 2331 | "CE75", 2332 | "CE8F" 2333 | ], 2334 | [ 2335 | "CE91", 2336 | "CEAB" 2337 | ], 2338 | [ 2339 | "CEAD", 2340 | "CEC7" 2341 | ], 2342 | [ 2343 | "CEC9", 2344 | "CEE3" 2345 | ], 2346 | [ 2347 | "CEE5", 2348 | "CEFF" 2349 | ], 2350 | [ 2351 | "CF01", 2352 | "CF1B" 2353 | ], 2354 | [ 2355 | "CF1D", 2356 | "CF37" 2357 | ], 2358 | [ 2359 | "CF39", 2360 | "CF53" 2361 | ], 2362 | [ 2363 | "CF55", 2364 | "CF6F" 2365 | ], 2366 | [ 2367 | "CF71", 2368 | "CF8B" 2369 | ], 2370 | [ 2371 | "CF8D", 2372 | "CFA7" 2373 | ], 2374 | [ 2375 | "CFA9", 2376 | "CFC3" 2377 | ], 2378 | [ 2379 | "CFC5", 2380 | "CFDF" 2381 | ], 2382 | [ 2383 | "CFE1", 2384 | "CFFB" 2385 | ], 2386 | [ 2387 | "CFFD", 2388 | "D017" 2389 | ], 2390 | [ 2391 | "D019", 2392 | "D033" 2393 | ], 2394 | [ 2395 | "D035", 2396 | "D04F" 2397 | ], 2398 | [ 2399 | "D051", 2400 | "D06B" 2401 | ], 2402 | [ 2403 | "D06D", 2404 | "D087" 2405 | ], 2406 | [ 2407 | "D089", 2408 | "D0A3" 2409 | ], 2410 | [ 2411 | "D0A5", 2412 | "D0BF" 2413 | ], 2414 | [ 2415 | "D0C1", 2416 | "D0DB" 2417 | ], 2418 | [ 2419 | "D0DD", 2420 | "D0F7" 2421 | ], 2422 | [ 2423 | "D0F9", 2424 | "D113" 2425 | ], 2426 | [ 2427 | "D115", 2428 | "D12F" 2429 | ], 2430 | [ 2431 | "D131", 2432 | "D14B" 2433 | ], 2434 | [ 2435 | "D14D", 2436 | "D167" 2437 | ], 2438 | [ 2439 | "D169", 2440 | "D183" 2441 | ], 2442 | [ 2443 | "D185", 2444 | "D19F" 2445 | ], 2446 | [ 2447 | "D1A1", 2448 | "D1BB" 2449 | ], 2450 | [ 2451 | "D1BD", 2452 | "D1D7" 2453 | ], 2454 | [ 2455 | "D1D9", 2456 | "D1F3" 2457 | ], 2458 | [ 2459 | "D1F5", 2460 | "D20F" 2461 | ], 2462 | [ 2463 | "D211", 2464 | "D22B" 2465 | ], 2466 | [ 2467 | "D22D", 2468 | "D247" 2469 | ], 2470 | [ 2471 | "D249", 2472 | "D263" 2473 | ], 2474 | [ 2475 | "D265", 2476 | "D27F" 2477 | ], 2478 | [ 2479 | "D281", 2480 | "D29B" 2481 | ], 2482 | [ 2483 | "D29D", 2484 | "D2B7" 2485 | ], 2486 | [ 2487 | "D2B9", 2488 | "D2D3" 2489 | ], 2490 | [ 2491 | "D2D5", 2492 | "D2EF" 2493 | ], 2494 | [ 2495 | "D2F1", 2496 | "D30B" 2497 | ], 2498 | [ 2499 | "D30D", 2500 | "D327" 2501 | ], 2502 | [ 2503 | "D329", 2504 | "D343" 2505 | ], 2506 | [ 2507 | "D345", 2508 | "D35F" 2509 | ], 2510 | [ 2511 | "D361", 2512 | "D37B" 2513 | ], 2514 | [ 2515 | "D37D", 2516 | "D397" 2517 | ], 2518 | [ 2519 | "D399", 2520 | "D3B3" 2521 | ], 2522 | [ 2523 | "D3B5", 2524 | "D3CF" 2525 | ], 2526 | [ 2527 | "D3D1", 2528 | "D3EB" 2529 | ], 2530 | [ 2531 | "D3ED", 2532 | "D407" 2533 | ], 2534 | [ 2535 | "D409", 2536 | "D423" 2537 | ], 2538 | [ 2539 | "D425", 2540 | "D43F" 2541 | ], 2542 | [ 2543 | "D441", 2544 | "D45B" 2545 | ], 2546 | [ 2547 | "D45D", 2548 | "D477" 2549 | ], 2550 | [ 2551 | "D479", 2552 | "D493" 2553 | ], 2554 | [ 2555 | "D495", 2556 | "D4AF" 2557 | ], 2558 | [ 2559 | "D4B1", 2560 | "D4CB" 2561 | ], 2562 | [ 2563 | "D4CD", 2564 | "D4E7" 2565 | ], 2566 | [ 2567 | "D4E9", 2568 | "D503" 2569 | ], 2570 | [ 2571 | "D505", 2572 | "D51F" 2573 | ], 2574 | [ 2575 | "D521", 2576 | "D53B" 2577 | ], 2578 | [ 2579 | "D53D", 2580 | "D557" 2581 | ], 2582 | [ 2583 | "D559", 2584 | "D573" 2585 | ], 2586 | [ 2587 | "D575", 2588 | "D58F" 2589 | ], 2590 | [ 2591 | "D591", 2592 | "D5AB" 2593 | ], 2594 | [ 2595 | "D5AD", 2596 | "D5C7" 2597 | ], 2598 | [ 2599 | "D5C9", 2600 | "D5E3" 2601 | ], 2602 | [ 2603 | "D5E5", 2604 | "D5FF" 2605 | ], 2606 | [ 2607 | "D601", 2608 | "D61B" 2609 | ], 2610 | [ 2611 | "D61D", 2612 | "D637" 2613 | ], 2614 | [ 2615 | "D639", 2616 | "D653" 2617 | ], 2618 | [ 2619 | "D655", 2620 | "D66F" 2621 | ], 2622 | [ 2623 | "D671", 2624 | "D68B" 2625 | ], 2626 | [ 2627 | "D68D", 2628 | "D6A7" 2629 | ], 2630 | [ 2631 | "D6A9", 2632 | "D6C3" 2633 | ], 2634 | [ 2635 | "D6C5", 2636 | "D6DF" 2637 | ], 2638 | [ 2639 | "D6E1", 2640 | "D6FB" 2641 | ], 2642 | [ 2643 | "D6FD", 2644 | "D717" 2645 | ], 2646 | [ 2647 | "D719", 2648 | "D733" 2649 | ], 2650 | [ 2651 | "D735", 2652 | "D74F" 2653 | ], 2654 | [ 2655 | "D751", 2656 | "D76B" 2657 | ], 2658 | [ 2659 | "D76D", 2660 | "D787" 2661 | ], 2662 | [ 2663 | "D789", 2664 | "D7A3" 2665 | ] 2666 | ], 2667 | "single_chars": [] 2668 | }, 2669 | "E_Modifier": { 2670 | "ranges": [ 2671 | [ 2672 | "1F3FB", 2673 | "1F3FF" 2674 | ] 2675 | ], 2676 | "single_chars": [] 2677 | }, 2678 | "E_Base": { 2679 | "ranges": [ 2680 | [ 2681 | "270A", 2682 | "270D" 2683 | ], 2684 | [ 2685 | "1F3C2", 2686 | "1F3C4" 2687 | ], 2688 | [ 2689 | "1F3CA", 2690 | "1F3CC" 2691 | ], 2692 | [ 2693 | "1F442", 2694 | "1F443" 2695 | ], 2696 | [ 2697 | "1F446", 2698 | "1F450" 2699 | ], 2700 | [ 2701 | "1F470", 2702 | "1F478" 2703 | ], 2704 | [ 2705 | "1F481", 2706 | "1F483" 2707 | ], 2708 | [ 2709 | "1F485", 2710 | "1F487" 2711 | ], 2712 | [ 2713 | "1F574", 2714 | "1F575" 2715 | ], 2716 | [ 2717 | "1F595", 2718 | "1F596" 2719 | ], 2720 | [ 2721 | "1F645", 2722 | "1F647" 2723 | ], 2724 | [ 2725 | "1F64B", 2726 | "1F64F" 2727 | ], 2728 | [ 2729 | "1F6B4", 2730 | "1F6B6" 2731 | ], 2732 | [ 2733 | "1F918", 2734 | "1F91C" 2735 | ], 2736 | [ 2737 | "1F91E", 2738 | "1F91F" 2739 | ], 2740 | [ 2741 | "1F930", 2742 | "1F939" 2743 | ], 2744 | [ 2745 | "1F93D", 2746 | "1F93E" 2747 | ], 2748 | [ 2749 | "1F9D1", 2750 | "1F9DD" 2751 | ] 2752 | ], 2753 | "single_chars": [ 2754 | "261D", 2755 | "26F9", 2756 | "1F385", 2757 | "1F3C7", 2758 | "1F46E", 2759 | "1F47C", 2760 | "1F4AA", 2761 | "1F57A", 2762 | "1F590", 2763 | "1F6A3", 2764 | "1F6C0", 2765 | "1F6CC", 2766 | "1F926" 2767 | ] 2768 | }, 2769 | "V": { 2770 | "ranges": [ 2771 | [ 2772 | "1160", 2773 | "11A7" 2774 | ], 2775 | [ 2776 | "D7B0", 2777 | "D7C6" 2778 | ] 2779 | ], 2780 | "single_chars": [] 2781 | }, 2782 | "L": { 2783 | "ranges": [ 2784 | [ 2785 | "1100", 2786 | "115F" 2787 | ], 2788 | [ 2789 | "A960", 2790 | "A97C" 2791 | ] 2792 | ], 2793 | "single_chars": [] 2794 | }, 2795 | "Glue_After_Zwj": { 2796 | "ranges": [ 2797 | [ 2798 | "2695", 2799 | "2696" 2800 | ], 2801 | [ 2802 | "1F4BB", 2803 | "1F4BC" 2804 | ] 2805 | ], 2806 | "single_chars": [ 2807 | "2640", 2808 | "2642", 2809 | "2708", 2810 | "2764", 2811 | "1F308", 2812 | "1F33E", 2813 | "1F373", 2814 | "1F393", 2815 | "1F3A4", 2816 | "1F3A8", 2817 | "1F3EB", 2818 | "1F3ED", 2819 | "1F48B", 2820 | "1F527", 2821 | "1F52C", 2822 | "1F5E8", 2823 | "1F680", 2824 | "1F692" 2825 | ] 2826 | }, 2827 | "Regional_Indicator": { 2828 | "ranges": [ 2829 | [ 2830 | "1F1E6", 2831 | "1F1FF" 2832 | ] 2833 | ], 2834 | "single_chars": [] 2835 | }, 2836 | "LV": { 2837 | "ranges": [], 2838 | "single_chars": [ 2839 | "AC00", 2840 | "AC1C", 2841 | "AC38", 2842 | "AC54", 2843 | "AC70", 2844 | "AC8C", 2845 | "ACA8", 2846 | "ACC4", 2847 | "ACE0", 2848 | "ACFC", 2849 | "AD18", 2850 | "AD34", 2851 | "AD50", 2852 | "AD6C", 2853 | "AD88", 2854 | "ADA4", 2855 | "ADC0", 2856 | "ADDC", 2857 | "ADF8", 2858 | "AE14", 2859 | "AE30", 2860 | "AE4C", 2861 | "AE68", 2862 | "AE84", 2863 | "AEA0", 2864 | "AEBC", 2865 | "AED8", 2866 | "AEF4", 2867 | "AF10", 2868 | "AF2C", 2869 | "AF48", 2870 | "AF64", 2871 | "AF80", 2872 | "AF9C", 2873 | "AFB8", 2874 | "AFD4", 2875 | "AFF0", 2876 | "B00C", 2877 | "B028", 2878 | "B044", 2879 | "B060", 2880 | "B07C", 2881 | "B098", 2882 | "B0B4", 2883 | "B0D0", 2884 | "B0EC", 2885 | "B108", 2886 | "B124", 2887 | "B140", 2888 | "B15C", 2889 | "B178", 2890 | "B194", 2891 | "B1B0", 2892 | "B1CC", 2893 | "B1E8", 2894 | "B204", 2895 | "B220", 2896 | "B23C", 2897 | "B258", 2898 | "B274", 2899 | "B290", 2900 | "B2AC", 2901 | "B2C8", 2902 | "B2E4", 2903 | "B300", 2904 | "B31C", 2905 | "B338", 2906 | "B354", 2907 | "B370", 2908 | "B38C", 2909 | "B3A8", 2910 | "B3C4", 2911 | "B3E0", 2912 | "B3FC", 2913 | "B418", 2914 | "B434", 2915 | "B450", 2916 | "B46C", 2917 | "B488", 2918 | "B4A4", 2919 | "B4C0", 2920 | "B4DC", 2921 | "B4F8", 2922 | "B514", 2923 | "B530", 2924 | "B54C", 2925 | "B568", 2926 | "B584", 2927 | "B5A0", 2928 | "B5BC", 2929 | "B5D8", 2930 | "B5F4", 2931 | "B610", 2932 | "B62C", 2933 | "B648", 2934 | "B664", 2935 | "B680", 2936 | "B69C", 2937 | "B6B8", 2938 | "B6D4", 2939 | "B6F0", 2940 | "B70C", 2941 | "B728", 2942 | "B744", 2943 | "B760", 2944 | "B77C", 2945 | "B798", 2946 | "B7B4", 2947 | "B7D0", 2948 | "B7EC", 2949 | "B808", 2950 | "B824", 2951 | "B840", 2952 | "B85C", 2953 | "B878", 2954 | "B894", 2955 | "B8B0", 2956 | "B8CC", 2957 | "B8E8", 2958 | "B904", 2959 | "B920", 2960 | "B93C", 2961 | "B958", 2962 | "B974", 2963 | "B990", 2964 | "B9AC", 2965 | "B9C8", 2966 | "B9E4", 2967 | "BA00", 2968 | "BA1C", 2969 | "BA38", 2970 | "BA54", 2971 | "BA70", 2972 | "BA8C", 2973 | "BAA8", 2974 | "BAC4", 2975 | "BAE0", 2976 | "BAFC", 2977 | "BB18", 2978 | "BB34", 2979 | "BB50", 2980 | "BB6C", 2981 | "BB88", 2982 | "BBA4", 2983 | "BBC0", 2984 | "BBDC", 2985 | "BBF8", 2986 | "BC14", 2987 | "BC30", 2988 | "BC4C", 2989 | "BC68", 2990 | "BC84", 2991 | "BCA0", 2992 | "BCBC", 2993 | "BCD8", 2994 | "BCF4", 2995 | "BD10", 2996 | "BD2C", 2997 | "BD48", 2998 | "BD64", 2999 | "BD80", 3000 | "BD9C", 3001 | "BDB8", 3002 | "BDD4", 3003 | "BDF0", 3004 | "BE0C", 3005 | "BE28", 3006 | "BE44", 3007 | "BE60", 3008 | "BE7C", 3009 | "BE98", 3010 | "BEB4", 3011 | "BED0", 3012 | "BEEC", 3013 | "BF08", 3014 | "BF24", 3015 | "BF40", 3016 | "BF5C", 3017 | "BF78", 3018 | "BF94", 3019 | "BFB0", 3020 | "BFCC", 3021 | "BFE8", 3022 | "C004", 3023 | "C020", 3024 | "C03C", 3025 | "C058", 3026 | "C074", 3027 | "C090", 3028 | "C0AC", 3029 | "C0C8", 3030 | "C0E4", 3031 | "C100", 3032 | "C11C", 3033 | "C138", 3034 | "C154", 3035 | "C170", 3036 | "C18C", 3037 | "C1A8", 3038 | "C1C4", 3039 | "C1E0", 3040 | "C1FC", 3041 | "C218", 3042 | "C234", 3043 | "C250", 3044 | "C26C", 3045 | "C288", 3046 | "C2A4", 3047 | "C2C0", 3048 | "C2DC", 3049 | "C2F8", 3050 | "C314", 3051 | "C330", 3052 | "C34C", 3053 | "C368", 3054 | "C384", 3055 | "C3A0", 3056 | "C3BC", 3057 | "C3D8", 3058 | "C3F4", 3059 | "C410", 3060 | "C42C", 3061 | "C448", 3062 | "C464", 3063 | "C480", 3064 | "C49C", 3065 | "C4B8", 3066 | "C4D4", 3067 | "C4F0", 3068 | "C50C", 3069 | "C528", 3070 | "C544", 3071 | "C560", 3072 | "C57C", 3073 | "C598", 3074 | "C5B4", 3075 | "C5D0", 3076 | "C5EC", 3077 | "C608", 3078 | "C624", 3079 | "C640", 3080 | "C65C", 3081 | "C678", 3082 | "C694", 3083 | "C6B0", 3084 | "C6CC", 3085 | "C6E8", 3086 | "C704", 3087 | "C720", 3088 | "C73C", 3089 | "C758", 3090 | "C774", 3091 | "C790", 3092 | "C7AC", 3093 | "C7C8", 3094 | "C7E4", 3095 | "C800", 3096 | "C81C", 3097 | "C838", 3098 | "C854", 3099 | "C870", 3100 | "C88C", 3101 | "C8A8", 3102 | "C8C4", 3103 | "C8E0", 3104 | "C8FC", 3105 | "C918", 3106 | "C934", 3107 | "C950", 3108 | "C96C", 3109 | "C988", 3110 | "C9A4", 3111 | "C9C0", 3112 | "C9DC", 3113 | "C9F8", 3114 | "CA14", 3115 | "CA30", 3116 | "CA4C", 3117 | "CA68", 3118 | "CA84", 3119 | "CAA0", 3120 | "CABC", 3121 | "CAD8", 3122 | "CAF4", 3123 | "CB10", 3124 | "CB2C", 3125 | "CB48", 3126 | "CB64", 3127 | "CB80", 3128 | "CB9C", 3129 | "CBB8", 3130 | "CBD4", 3131 | "CBF0", 3132 | "CC0C", 3133 | "CC28", 3134 | "CC44", 3135 | "CC60", 3136 | "CC7C", 3137 | "CC98", 3138 | "CCB4", 3139 | "CCD0", 3140 | "CCEC", 3141 | "CD08", 3142 | "CD24", 3143 | "CD40", 3144 | "CD5C", 3145 | "CD78", 3146 | "CD94", 3147 | "CDB0", 3148 | "CDCC", 3149 | "CDE8", 3150 | "CE04", 3151 | "CE20", 3152 | "CE3C", 3153 | "CE58", 3154 | "CE74", 3155 | "CE90", 3156 | "CEAC", 3157 | "CEC8", 3158 | "CEE4", 3159 | "CF00", 3160 | "CF1C", 3161 | "CF38", 3162 | "CF54", 3163 | "CF70", 3164 | "CF8C", 3165 | "CFA8", 3166 | "CFC4", 3167 | "CFE0", 3168 | "CFFC", 3169 | "D018", 3170 | "D034", 3171 | "D050", 3172 | "D06C", 3173 | "D088", 3174 | "D0A4", 3175 | "D0C0", 3176 | "D0DC", 3177 | "D0F8", 3178 | "D114", 3179 | "D130", 3180 | "D14C", 3181 | "D168", 3182 | "D184", 3183 | "D1A0", 3184 | "D1BC", 3185 | "D1D8", 3186 | "D1F4", 3187 | "D210", 3188 | "D22C", 3189 | "D248", 3190 | "D264", 3191 | "D280", 3192 | "D29C", 3193 | "D2B8", 3194 | "D2D4", 3195 | "D2F0", 3196 | "D30C", 3197 | "D328", 3198 | "D344", 3199 | "D360", 3200 | "D37C", 3201 | "D398", 3202 | "D3B4", 3203 | "D3D0", 3204 | "D3EC", 3205 | "D408", 3206 | "D424", 3207 | "D440", 3208 | "D45C", 3209 | "D478", 3210 | "D494", 3211 | "D4B0", 3212 | "D4CC", 3213 | "D4E8", 3214 | "D504", 3215 | "D520", 3216 | "D53C", 3217 | "D558", 3218 | "D574", 3219 | "D590", 3220 | "D5AC", 3221 | "D5C8", 3222 | "D5E4", 3223 | "D600", 3224 | "D61C", 3225 | "D638", 3226 | "D654", 3227 | "D670", 3228 | "D68C", 3229 | "D6A8", 3230 | "D6C4", 3231 | "D6E0", 3232 | "D6FC", 3233 | "D718", 3234 | "D734", 3235 | "D750", 3236 | "D76C", 3237 | "D788" 3238 | ] 3239 | }, 3240 | "T": { 3241 | "ranges": [ 3242 | [ 3243 | "11A8", 3244 | "11FF" 3245 | ], 3246 | [ 3247 | "D7CB", 3248 | "D7FB" 3249 | ] 3250 | ], 3251 | "single_chars": [] 3252 | }, 3253 | "SpacingMark": { 3254 | "ranges": [ 3255 | [ 3256 | "093E", 3257 | "0940" 3258 | ], 3259 | [ 3260 | "0949", 3261 | "094C" 3262 | ], 3263 | [ 3264 | "094E", 3265 | "094F" 3266 | ], 3267 | [ 3268 | "0982", 3269 | "0983" 3270 | ], 3271 | [ 3272 | "09BF", 3273 | "09C0" 3274 | ], 3275 | [ 3276 | "09C7", 3277 | "09C8" 3278 | ], 3279 | [ 3280 | "09CB", 3281 | "09CC" 3282 | ], 3283 | [ 3284 | "0A3E", 3285 | "0A40" 3286 | ], 3287 | [ 3288 | "0ABE", 3289 | "0AC0" 3290 | ], 3291 | [ 3292 | "0ACB", 3293 | "0ACC" 3294 | ], 3295 | [ 3296 | "0B02", 3297 | "0B03" 3298 | ], 3299 | [ 3300 | "0B47", 3301 | "0B48" 3302 | ], 3303 | [ 3304 | "0B4B", 3305 | "0B4C" 3306 | ], 3307 | [ 3308 | "0BC1", 3309 | "0BC2" 3310 | ], 3311 | [ 3312 | "0BC6", 3313 | "0BC8" 3314 | ], 3315 | [ 3316 | "0BCA", 3317 | "0BCC" 3318 | ], 3319 | [ 3320 | "0C01", 3321 | "0C03" 3322 | ], 3323 | [ 3324 | "0C41", 3325 | "0C44" 3326 | ], 3327 | [ 3328 | "0C82", 3329 | "0C83" 3330 | ], 3331 | [ 3332 | "0CC0", 3333 | "0CC1" 3334 | ], 3335 | [ 3336 | "0CC3", 3337 | "0CC4" 3338 | ], 3339 | [ 3340 | "0CC7", 3341 | "0CC8" 3342 | ], 3343 | [ 3344 | "0CCA", 3345 | "0CCB" 3346 | ], 3347 | [ 3348 | "0D02", 3349 | "0D03" 3350 | ], 3351 | [ 3352 | "0D3F", 3353 | "0D40" 3354 | ], 3355 | [ 3356 | "0D46", 3357 | "0D48" 3358 | ], 3359 | [ 3360 | "0D4A", 3361 | "0D4C" 3362 | ], 3363 | [ 3364 | "0D82", 3365 | "0D83" 3366 | ], 3367 | [ 3368 | "0DD0", 3369 | "0DD1" 3370 | ], 3371 | [ 3372 | "0DD8", 3373 | "0DDE" 3374 | ], 3375 | [ 3376 | "0DF2", 3377 | "0DF3" 3378 | ], 3379 | [ 3380 | "0F3E", 3381 | "0F3F" 3382 | ], 3383 | [ 3384 | "103B", 3385 | "103C" 3386 | ], 3387 | [ 3388 | "1056", 3389 | "1057" 3390 | ], 3391 | [ 3392 | "17BE", 3393 | "17C5" 3394 | ], 3395 | [ 3396 | "17C7", 3397 | "17C8" 3398 | ], 3399 | [ 3400 | "1923", 3401 | "1926" 3402 | ], 3403 | [ 3404 | "1929", 3405 | "192B" 3406 | ], 3407 | [ 3408 | "1930", 3409 | "1931" 3410 | ], 3411 | [ 3412 | "1933", 3413 | "1938" 3414 | ], 3415 | [ 3416 | "1A19", 3417 | "1A1A" 3418 | ], 3419 | [ 3420 | "1A6D", 3421 | "1A72" 3422 | ], 3423 | [ 3424 | "1B3D", 3425 | "1B41" 3426 | ], 3427 | [ 3428 | "1B43", 3429 | "1B44" 3430 | ], 3431 | [ 3432 | "1BA6", 3433 | "1BA7" 3434 | ], 3435 | [ 3436 | "1BEA", 3437 | "1BEC" 3438 | ], 3439 | [ 3440 | "1BF2", 3441 | "1BF3" 3442 | ], 3443 | [ 3444 | "1C24", 3445 | "1C2B" 3446 | ], 3447 | [ 3448 | "1C34", 3449 | "1C35" 3450 | ], 3451 | [ 3452 | "1CF2", 3453 | "1CF3" 3454 | ], 3455 | [ 3456 | "A823", 3457 | "A824" 3458 | ], 3459 | [ 3460 | "A880", 3461 | "A881" 3462 | ], 3463 | [ 3464 | "A8B4", 3465 | "A8C3" 3466 | ], 3467 | [ 3468 | "A952", 3469 | "A953" 3470 | ], 3471 | [ 3472 | "A9B4", 3473 | "A9B5" 3474 | ], 3475 | [ 3476 | "A9BA", 3477 | "A9BB" 3478 | ], 3479 | [ 3480 | "A9BD", 3481 | "A9C0" 3482 | ], 3483 | [ 3484 | "AA2F", 3485 | "AA30" 3486 | ], 3487 | [ 3488 | "AA33", 3489 | "AA34" 3490 | ], 3491 | [ 3492 | "AAEE", 3493 | "AAEF" 3494 | ], 3495 | [ 3496 | "ABE3", 3497 | "ABE4" 3498 | ], 3499 | [ 3500 | "ABE6", 3501 | "ABE7" 3502 | ], 3503 | [ 3504 | "ABE9", 3505 | "ABEA" 3506 | ], 3507 | [ 3508 | "110B0", 3509 | "110B2" 3510 | ], 3511 | [ 3512 | "110B7", 3513 | "110B8" 3514 | ], 3515 | [ 3516 | "111B3", 3517 | "111B5" 3518 | ], 3519 | [ 3520 | "111BF", 3521 | "111C0" 3522 | ], 3523 | [ 3524 | "1122C", 3525 | "1122E" 3526 | ], 3527 | [ 3528 | "11232", 3529 | "11233" 3530 | ], 3531 | [ 3532 | "112E0", 3533 | "112E2" 3534 | ], 3535 | [ 3536 | "11302", 3537 | "11303" 3538 | ], 3539 | [ 3540 | "11341", 3541 | "11344" 3542 | ], 3543 | [ 3544 | "11347", 3545 | "11348" 3546 | ], 3547 | [ 3548 | "1134B", 3549 | "1134D" 3550 | ], 3551 | [ 3552 | "11362", 3553 | "11363" 3554 | ], 3555 | [ 3556 | "11435", 3557 | "11437" 3558 | ], 3559 | [ 3560 | "11440", 3561 | "11441" 3562 | ], 3563 | [ 3564 | "114B1", 3565 | "114B2" 3566 | ], 3567 | [ 3568 | "114BB", 3569 | "114BC" 3570 | ], 3571 | [ 3572 | "115B0", 3573 | "115B1" 3574 | ], 3575 | [ 3576 | "115B8", 3577 | "115BB" 3578 | ], 3579 | [ 3580 | "11630", 3581 | "11632" 3582 | ], 3583 | [ 3584 | "1163B", 3585 | "1163C" 3586 | ], 3587 | [ 3588 | "116AE", 3589 | "116AF" 3590 | ], 3591 | [ 3592 | "11720", 3593 | "11721" 3594 | ], 3595 | [ 3596 | "11A07", 3597 | "11A08" 3598 | ], 3599 | [ 3600 | "11A57", 3601 | "11A58" 3602 | ], 3603 | [ 3604 | "16F51", 3605 | "16F7E" 3606 | ] 3607 | ], 3608 | "single_chars": [ 3609 | "0903", 3610 | "093B", 3611 | "0A03", 3612 | "0A83", 3613 | "0AC9", 3614 | "0B40", 3615 | "0BBF", 3616 | "0CBE", 3617 | "0E33", 3618 | "0EB3", 3619 | "0F7F", 3620 | "1031", 3621 | "1084", 3622 | "17B6", 3623 | "1A55", 3624 | "1A57", 3625 | "1B04", 3626 | "1B35", 3627 | "1B3B", 3628 | "1B82", 3629 | "1BA1", 3630 | "1BAA", 3631 | "1BE7", 3632 | "1BEE", 3633 | "1CE1", 3634 | "1CF7", 3635 | "A827", 3636 | "A983", 3637 | "AA4D", 3638 | "AAEB", 3639 | "AAF5", 3640 | "ABEC", 3641 | "11000", 3642 | "11002", 3643 | "11082", 3644 | "1112C", 3645 | "11182", 3646 | "11235", 3647 | "1133F", 3648 | "11445", 3649 | "114B9", 3650 | "114BE", 3651 | "114C1", 3652 | "115BE", 3653 | "1163E", 3654 | "116AC", 3655 | "116B6", 3656 | "11726", 3657 | "11A39", 3658 | "11A97", 3659 | "11C2F", 3660 | "11C3E", 3661 | "11CA9", 3662 | "11CB1", 3663 | "11CB4", 3664 | "1D166", 3665 | "1D16D" 3666 | ] 3667 | }, 3668 | "CR": { 3669 | "ranges": [], 3670 | "single_chars": [ 3671 | "000D" 3672 | ] 3673 | }, 3674 | "Prepend": { 3675 | "ranges": [ 3676 | [ 3677 | "0600", 3678 | "0605" 3679 | ], 3680 | [ 3681 | "111C2", 3682 | "111C3" 3683 | ], 3684 | [ 3685 | "11A86", 3686 | "11A89" 3687 | ] 3688 | ], 3689 | "single_chars": [ 3690 | "06DD", 3691 | "070F", 3692 | "08E2", 3693 | "0D4E", 3694 | "110BD", 3695 | "11A3A", 3696 | "11D46" 3697 | ] 3698 | }, 3699 | "E_Base_GAZ": { 3700 | "ranges": [ 3701 | [ 3702 | "1F466", 3703 | "1F469" 3704 | ] 3705 | ], 3706 | "single_chars": [] 3707 | } 3708 | } -------------------------------------------------------------------------------- /bokepy/grapheme/finder.py: -------------------------------------------------------------------------------- 1 | from enum import Enum 2 | 3 | from grapheme.grapheme_property_group import GraphemePropertyGroup as G 4 | from grapheme.grapheme_property_group import get_group 5 | 6 | 7 | class FSM: 8 | @classmethod 9 | def default(cls, n): 10 | if n is G.OTHER: 11 | return True, cls.default 12 | 13 | if n is G.CR: 14 | return True, cls.cr 15 | 16 | if n in [G.LF, G.CONTROL]: 17 | return True, cls.lf_or_control 18 | 19 | if n is G.ZWJ: 20 | return False, cls.zwj 21 | 22 | if n in [G.EXTEND, G.SPACING_MARK]: 23 | return False, cls.default 24 | 25 | if n in [G.E_BASE, G.E_BASE_GAZ]: 26 | return True, cls.emoji 27 | 28 | if n is G.REGIONAL_INDICATOR: 29 | return True, cls.ri 30 | 31 | if n is G.L: 32 | return True, cls.hangul_l 33 | 34 | if n in [G.LV, G.V]: 35 | return True, cls.hangul_lv_or_v 36 | 37 | if n in [G.LVT, G.T]: 38 | return True, cls.hangul_lvt_or_t 39 | 40 | if n is G.PREPEND: 41 | return True, cls.prepend 42 | 43 | return True, cls.default 44 | 45 | @classmethod 46 | def default_next_state(cls, n, should_break): 47 | _, next_state = cls.default(n) 48 | return should_break, next_state 49 | 50 | @classmethod 51 | def cr(cls, n): 52 | if n is G.LF: 53 | return False, cls.lf_or_control 54 | return cls.default_next_state(n, should_break=True) 55 | 56 | @classmethod 57 | def lf_or_control(cls, n): 58 | return cls.default_next_state(n, should_break=True) 59 | 60 | @classmethod 61 | def prepend(cls, n): 62 | if n in [G.CONTROL, G.LF]: 63 | return True, cls.default 64 | if n is G.CR: 65 | return True, cls.cr 66 | return cls.default_next_state(n, should_break=False) 67 | 68 | # Hanguls 69 | @classmethod 70 | def hangul_l(cls, n): 71 | if n in [G.V, G.LV]: 72 | return False, cls.hangul_lv_or_v 73 | if n is G.LVT: 74 | return False, cls.hangul_lvt_or_t 75 | if n is G.L: 76 | return False, cls.hangul_l 77 | return cls.default(n) 78 | 79 | @classmethod 80 | def hangul_lv_or_v(cls, n): 81 | if n is G.V: 82 | return False, cls.hangul_lv_or_v 83 | if n is G.T: 84 | return False, cls.hangul_lvt_or_t 85 | return cls.default(n) 86 | 87 | @classmethod 88 | def hangul_lvt_or_t(cls, n): 89 | if n is G.T: 90 | return False, cls.hangul_lvt_or_t 91 | return cls.default(n) 92 | 93 | # Emojis 94 | @classmethod 95 | def emoji(cls, n): 96 | if n is G.EXTEND: 97 | return False, cls.emoji 98 | if n is G.E_MODIFIER: 99 | return False, cls.default 100 | return cls.default(n) 101 | 102 | @classmethod 103 | def zwj(cls, n): 104 | if n is G.GLUE_AFTER_ZWJ: 105 | return False, cls.default 106 | if n is G.E_BASE_GAZ: 107 | return False, cls.emoji 108 | return cls.default(n) 109 | 110 | # Regional indication (flag) 111 | @classmethod 112 | def ri(cls, n): 113 | if n is G.REGIONAL_INDICATOR: 114 | return False, cls.default 115 | return cls.default(n) 116 | 117 | class BreakPossibility(Enum): 118 | CERTAIN = "certain" 119 | POSSIBLE = "possible" 120 | NO_BREAK = "nobreak" 121 | 122 | 123 | def get_break_possibility(a, b): 124 | # Probably most common, included as short circuit before checking all else 125 | if a is G.OTHER and b is G.OTHER: 126 | return BreakPossibility.CERTAIN 127 | 128 | assert isinstance(a, G) 129 | assert isinstance(b, G) 130 | 131 | # Only break if preceeded by an uneven number of REGIONAL_INDICATORS 132 | if a is G.REGIONAL_INDICATOR and b is G.REGIONAL_INDICATOR: 133 | return BreakPossibility.POSSIBLE 134 | 135 | # Only if preceeded by E_BASE or EBG 136 | if a is G.EXTEND and b is G.E_MODIFIER: 137 | return BreakPossibility.POSSIBLE 138 | 139 | if a is G.CR and b is G.LF: 140 | return BreakPossibility.NO_BREAK 141 | 142 | if a in [G.CONTROL, G.CR, G.LF] or b in [G.CONTROL, G.CR, G.LF]: 143 | return BreakPossibility.CERTAIN 144 | 145 | if a is G.L and b in [G.L, G.V, G.LV, G.LVT]: 146 | return BreakPossibility.NO_BREAK 147 | 148 | if a in [G.LV, G.V] and b in [G.V, G.T]: 149 | return BreakPossibility.NO_BREAK 150 | 151 | if a in [G.LVT, G.T] and b is G.T: 152 | return BreakPossibility.NO_BREAK 153 | 154 | if b in [G.EXTEND, G.ZWJ, G.SPACING_MARK] or a is G.PREPEND: 155 | return BreakPossibility.NO_BREAK 156 | 157 | if a in [G.E_BASE, G.E_BASE_GAZ] and b is G.E_MODIFIER: 158 | return BreakPossibility.NO_BREAK 159 | 160 | if a is G.ZWJ and b in [G.GLUE_AFTER_ZWJ, G.E_BASE_GAZ]: 161 | return BreakPossibility.NO_BREAK 162 | 163 | # everything else, assumes all other rules are included above 164 | return BreakPossibility.CERTAIN 165 | 166 | 167 | def get_last_certain_break_index(string, index): 168 | if index >= len(string): 169 | return len(string) 170 | 171 | prev = get_group(string[index]) 172 | while True: 173 | if index <= 0: 174 | return 0 175 | index -= 1 176 | cur = get_group(string[index]) 177 | if get_break_possibility(cur, prev) == BreakPossibility.CERTAIN: 178 | return index + 1 179 | prev = cur 180 | 181 | 182 | class GraphemeIterator: 183 | def __init__(self, string): 184 | self.str_iter = iter(string) 185 | try: 186 | self.buffer = next(self.str_iter) 187 | except StopIteration: 188 | self.buffer = None 189 | else: 190 | _, state = FSM.default(get_group(self.buffer)) 191 | self.state = state 192 | 193 | def __iter__(self): 194 | return self 195 | 196 | def __next__(self): 197 | for codepoint in self.str_iter: 198 | should_break, state = self.state(get_group(codepoint)) 199 | self.state = state 200 | 201 | if should_break: 202 | return self._break(codepoint) 203 | self.buffer += codepoint 204 | 205 | if self.buffer: 206 | return self._break(None) 207 | 208 | raise StopIteration() 209 | 210 | def _break(self, new): 211 | old_buffer = self.buffer 212 | self.buffer = new 213 | return old_buffer 214 | -------------------------------------------------------------------------------- /bokepy/grapheme/grapheme_property_group.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import string 4 | from enum import Enum 5 | 6 | 7 | class GraphemePropertyGroup(Enum): 8 | PREPEND = "Prepend" 9 | CR = "CR" 10 | LF = "LF" 11 | CONTROL = "Control" 12 | EXTEND = "Extend" 13 | REGIONAL_INDICATOR = "Regional_Indicator" 14 | SPACING_MARK = "SpacingMark" 15 | L = "L" 16 | V = "V" 17 | T = "T" 18 | LV = "LV" 19 | LVT = "LVT" 20 | E_BASE = "E_Base" 21 | E_MODIFIER = "E_Modifier" 22 | ZWJ = "ZWJ" 23 | GLUE_AFTER_ZWJ = "Glue_After_Zwj" 24 | E_BASE_GAZ = "E_Base_GAZ" 25 | 26 | OTHER = "Other" 27 | 28 | COMMON_OTHER_GROUP_CHARS = "" 29 | 30 | def get_group(char): 31 | if char in COMMON_OTHER_GROUP_CHARS: 32 | return GraphemePropertyGroup.OTHER 33 | else: 34 | return get_group_ord(ord(char)) 35 | 36 | 37 | def get_group_ord(char): 38 | group = SINGLE_CHAR_MAPPINGS.get(char, None) 39 | if group: 40 | return group 41 | 42 | return RANGE_TREE.get_value(char) or GraphemePropertyGroup.OTHER 43 | 44 | 45 | class ContainerNode: 46 | """ 47 | Simple implementation of interval based BTree with no support for deletion. 48 | """ 49 | def __init__(self, children): 50 | self.children = self._sorted(children) 51 | self._set_min_max() 52 | 53 | def _set_min_max(self): 54 | self.min = self.children[0].min 55 | self.max = self.children[-1].max 56 | 57 | # Adds an item to the node or it's subnodes. Returns a new node if this node is split, or None. 58 | def add(self, item): 59 | for child in self.children: 60 | if child.min <= item.min <= child.max: 61 | assert child.min <= item.max <= child.max 62 | new_child = child.add(item) 63 | if new_child: 64 | return self._add_child(new_child) 65 | else: 66 | self._set_min_max() 67 | return None 68 | return self._add_child(item) 69 | 70 | def get_value(self, key): 71 | for child in self.children: 72 | if child.min <= key <= child.max: 73 | return child.get_value(key) 74 | return None 75 | 76 | def _add_child(self, child): 77 | self.children.append(child) 78 | self.children = self._sorted(self.children) 79 | other = None 80 | if len(self.children) >= 4: 81 | other = ContainerNode(self.children[2:]) 82 | self.children = self.children[0:2] 83 | self._set_min_max() 84 | return other 85 | 86 | def _sorted(self, children): 87 | return sorted(children, key=lambda c: c.min) 88 | 89 | 90 | class LeafNode: 91 | def __init__(self, range_min, range_max, group): 92 | self.min = range_min 93 | self.max = range_max 94 | self.group = group 95 | 96 | # Assumes range check has already been done 97 | def get_value(self, _key): 98 | return self.group 99 | 100 | with open(os.path.join(os.path.dirname(__file__), "data/grapheme_break_property.json"), 'r') as f: 101 | data = json.load(f) 102 | 103 | assert len(data) == len(GraphemePropertyGroup) - 1 104 | 105 | SINGLE_CHAR_MAPPINGS = {} 106 | 107 | for key, value in data.items(): 108 | group = GraphemePropertyGroup(key) 109 | for char in value["single_chars"]: 110 | SINGLE_CHAR_MAPPINGS[int(char, 16)] = group 111 | 112 | RANGE_TREE = None 113 | for key, value in data.items(): 114 | for range_ in value["ranges"]: 115 | min_ = int(range_[0], 16) 116 | max_ = int(range_[1], 16) 117 | group = GraphemePropertyGroup(key) 118 | if max_ - min_ < 20: 119 | for i in range(min_, max_ + 1): 120 | SINGLE_CHAR_MAPPINGS[i] = group 121 | continue 122 | new_node = LeafNode( min_, max_, group) 123 | if RANGE_TREE: 124 | new_subtree = RANGE_TREE.add(new_node) 125 | if new_subtree: 126 | RANGE_TREE = ContainerNode([RANGE_TREE, new_subtree]) 127 | else: 128 | RANGE_TREE = ContainerNode([new_node]) 129 | 130 | common_ascii = string.ascii_letters + string.digits + string.punctuation 131 | COMMON_OTHER_GROUP_CHARS = "".join(c for c in common_ascii if get_group(c) == GraphemePropertyGroup.OTHER) 132 | del data 133 | -------------------------------------------------------------------------------- /bokepy/utils.py: -------------------------------------------------------------------------------- 1 | import boke 2 | import timeit 3 | 4 | 5 | def time_estimator(data, 6 | function, 7 | units='seconds', 8 | sampling_fraction=1000): 9 | 10 | ''' 11 | 12 | Estimates the time it takes to perform a given function 13 | with a given dataset. 14 | 15 | ''' 16 | 17 | # load the method 18 | method_to_call = getattr(boke, function) 19 | 20 | sampling_fraction = sampling_fraction 21 | estimate_corrector = 1.1 22 | 23 | sample_size = round(len(data) / sampling_fraction) 24 | start_time = timeit.default_timer() 25 | 26 | out = method_to_call(data[:sample_size]) 27 | end_time = timeit.default_timer() 28 | estimate_seconds = (end_time - start_time) * sampling_fraction 29 | estimated_time = round(estimate_seconds * estimate_corrector, -1) 30 | 31 | if units is 'minutes': 32 | estimated_time = estimated_time / 60 33 | print("It will take roughly %d minutes" % estimated_time) 34 | else: 35 | print("It will take roughly %d seconds" % estimated_time) 36 | -------------------------------------------------------------------------------- /extras/rt_index.txt: -------------------------------------------------------------------------------- 1 | # biographies = 0:10 2 | # indexes and lineage histories = 11:17 3 | # maha-yoga = 17:1417 4 | # anu-yoga = 1418:1444 5 | # ati-yoga = 1445:1861 6 | # supplementary = 1861:2122 7 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | pandas 2 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | # 3 | # Copyright (C) 2018 Mikko Kotila 4 | 5 | 6 | DESCRIPTION = "Tibetan Language Workbench" 7 | LONG_DESCRIPTION = """\ 8 | bö (tib. བོད་) means Tibet, and ke (tib. སྐད) means language, 9 | so together böke means Tibetan Language. bokepy is a shorthand 10 | for boke and python, and is a Tibetan Language Processing Library 11 | built for handling the most common language processing tasks in a 12 | straightforward way. Bokepy is built from the ground up to facilitate 13 | for a wide range of research challenges, including those far beyond the 14 | scope of typical scholarly interest. This includes rapid testing of ideas 15 | and prototyping of completely new technology solutions. 16 | """ 17 | 18 | DISTNAME = 'bokepy' 19 | MAINTAINER = 'Mikko Kotila' 20 | MAINTAINER_EMAIL = 'mailme@mikkokotila.com' 21 | URL = 'https://github.com/mikkokotila/bokepy' 22 | LICENSE = 'MIT' 23 | DOWNLOAD_URL = 'https://github.com/mikkokotila/bokepy.git' 24 | VERSION = '0.1' 25 | 26 | try: 27 | from setuptools import setup 28 | _has_setuptools = True 29 | except ImportError: 30 | from distutils.core import setup 31 | 32 | def check_dependencies(): 33 | 34 | install_requires = [] 35 | 36 | try: 37 | import pandas 38 | except ImportError: 39 | install_requires.append('pandas') 40 | 41 | return install_requires 42 | 43 | if __name__ == "__main__": 44 | 45 | install_requires = check_dependencies() 46 | 47 | setup(name=DISTNAME, 48 | author=MAINTAINER, 49 | author_email=MAINTAINER_EMAIL, 50 | maintainer=MAINTAINER, 51 | maintainer_email=MAINTAINER_EMAIL, 52 | description=DESCRIPTION, 53 | long_description=LONG_DESCRIPTION, 54 | license=LICENSE, 55 | url=URL, 56 | version=VERSION, 57 | download_url=DOWNLOAD_URL, 58 | install_requires=install_requires, 59 | packages=['bokepy', 60 | 'bokepy.grapheme'], 61 | 62 | classifiers=[ 63 | 'Intended Audience :: Science/Research', 64 | 'Programming Language :: Python :: 2.7', 65 | 'Programming Language :: Python :: 2.7', 66 | 'License :: OSI Approved :: MIT License', 67 | 'Topic :: Scientific/Engineering :: Human Machine Interfaces', 68 | 'Topic :: Scientific/Engineering :: Artificial Intelligence', 69 | 'Topic :: Scientific/Engineering :: Linguistics', 70 | 'Operating System :: POSIX', 71 | 'Operating System :: Unix', 72 | 'Operating System :: MacOS']) 73 | -------------------------------------------------------------------------------- /tests/basic_tests.py: -------------------------------------------------------------------------------- 1 | import boke 2 | 3 | test_sample = 100000 4 | 5 | text = boke.ingest_text('rt_texts_raw.txt') 6 | 7 | chars = boke.text_to_chars(text[:test_sample]) 8 | words = boke.text_to_words(text[:test_sample]) 9 | syllables = boke.text_to_syllables(text[:test_sample]) 10 | sentences = boke.text_to_sentence(text[:test_sample]) 11 | 12 | print(len(chars)) 13 | print(len(syllables)) 14 | print(len(words)) 15 | print(len(sentences)) 16 | 17 | print(chars[100:110]) 18 | print(syllables[100:110]) 19 | print(words[100:110]) 20 | print(sentences[100:110]) 21 | 22 | print(boke.syllable_counts(chars).head(5)) 23 | print(boke.syllable_counts(syllables).head(5)) 24 | print(boke.syllable_counts(words).head(5)) 25 | print(boke.syllable_counts(sentences).head(5)) 26 | 27 | chars_freq = boke.syllable_counts(chars) 28 | syllables_freq = boke.syllable_counts(syllables) 29 | words_freq = boke.syllable_counts(words) 30 | sentences_freq = boke.syllable_counts(sentences) 31 | 32 | boke.share_by_order(chars_freq) 33 | boke.share_by_order(syllables_freq) 34 | boke.share_by_order(words_freq) 35 | boke.share_by_order(sentences_freq) 36 | 37 | chars_pairs = boke.syllable_grams(chars) 38 | syllable_pairs = boke.syllable_grams(syllables) 39 | words_pairs = boke.syllable_grams(words) 40 | sentences_pairs = boke.syllable_grams(sentences) 41 | 42 | print(chars_pairs[:10]) 43 | print(syllable_pairs[:10]) 44 | print(words_pairs[:10]) 45 | print(sentences_pairs[:10]) 46 | --------------------------------------------------------------------------------