├── .github └── workflows │ └── tests.yml ├── .gitignore ├── LICENSE ├── README.md ├── doc.go ├── export_test.go ├── go.mod ├── maketesttables.go ├── ragel ├── unicode2ragel.rb ├── uscript.rl └── uwb.rl ├── segment.go ├── segment_fuzz.go ├── segment_fuzz_test.go ├── segment_test.go ├── segment_words.go ├── segment_words.rl ├── segment_words_prod.go ├── segment_words_test.go └── tables_test.go /.github/workflows/tests.yml: -------------------------------------------------------------------------------- 1 | on: 2 | push: 3 | branches: 4 | - master 5 | pull_request: 6 | name: Tests 7 | jobs: 8 | test: 9 | strategy: 10 | matrix: 11 | go-version: [1.17.x, 1.18.x, 1.19.x] 12 | platform: [ubuntu-latest, macos-latest, windows-latest] 13 | runs-on: ${{ matrix.platform }} 14 | steps: 15 | - name: Install Go 16 | uses: actions/setup-go@v1 17 | with: 18 | go-version: ${{ matrix.go-version }} 19 | - name: Checkout code 20 | uses: actions/checkout@v2 21 | - name: Test 22 | run: | 23 | go version 24 | go test -race ./... 25 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | #* 2 | *.sublime-* 3 | *~ 4 | .#* 5 | .project 6 | .settings 7 | .DS_Store 8 | /maketesttables 9 | /workdir 10 | /segment-fuzz.zip -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | 2 | Apache License 3 | Version 2.0, January 2004 4 | http://www.apache.org/licenses/ 5 | 6 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 7 | 8 | 1. Definitions. 9 | 10 | "License" shall mean the terms and conditions for use, reproduction, 11 | and distribution as defined by Sections 1 through 9 of this document. 12 | 13 | "Licensor" shall mean the copyright owner or entity authorized by 14 | the copyright owner that is granting the License. 15 | 16 | "Legal Entity" shall mean the union of the acting entity and all 17 | other entities that control, are controlled by, or are under common 18 | control with that entity. For the purposes of this definition, 19 | "control" means (i) the power, direct or indirect, to cause the 20 | direction or management of such entity, whether by contract or 21 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 22 | outstanding shares, or (iii) beneficial ownership of such entity. 23 | 24 | "You" (or "Your") shall mean an individual or Legal Entity 25 | exercising permissions granted by this License. 26 | 27 | "Source" form shall mean the preferred form for making modifications, 28 | including but not limited to software source code, documentation 29 | source, and configuration files. 30 | 31 | "Object" form shall mean any form resulting from mechanical 32 | transformation or translation of a Source form, including but 33 | not limited to compiled object code, generated documentation, 34 | and conversions to other media types. 35 | 36 | "Work" shall mean the work of authorship, whether in Source or 37 | Object form, made available under the License, as indicated by a 38 | copyright notice that is included in or attached to the work 39 | (an example is provided in the Appendix below). 40 | 41 | "Derivative Works" shall mean any work, whether in Source or Object 42 | form, that is based on (or derived from) the Work and for which the 43 | editorial revisions, annotations, elaborations, or other modifications 44 | represent, as a whole, an original work of authorship. For the purposes 45 | of this License, Derivative Works shall not include works that remain 46 | separable from, or merely link (or bind by name) to the interfaces of, 47 | the Work and Derivative Works thereof. 48 | 49 | "Contribution" shall mean any work of authorship, including 50 | the original version of the Work and any modifications or additions 51 | to that Work or Derivative Works thereof, that is intentionally 52 | submitted to Licensor for inclusion in the Work by the copyright owner 53 | or by an individual or Legal Entity authorized to submit on behalf of 54 | the copyright owner. For the purposes of this definition, "submitted" 55 | means any form of electronic, verbal, or written communication sent 56 | to the Licensor or its representatives, including but not limited to 57 | communication on electronic mailing lists, source code control systems, 58 | and issue tracking systems that are managed by, or on behalf of, the 59 | Licensor for the purpose of discussing and improving the Work, but 60 | excluding communication that is conspicuously marked or otherwise 61 | designated in writing by the copyright owner as "Not a Contribution." 62 | 63 | "Contributor" shall mean Licensor and any individual or Legal Entity 64 | on behalf of whom a Contribution has been received by Licensor and 65 | subsequently incorporated within the Work. 66 | 67 | 2. Grant of Copyright License. Subject to the terms and conditions of 68 | this License, each Contributor hereby grants to You a perpetual, 69 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 70 | copyright license to reproduce, prepare Derivative Works of, 71 | publicly display, publicly perform, sublicense, and distribute the 72 | Work and such Derivative Works in Source or Object form. 73 | 74 | 3. Grant of Patent License. Subject to the terms and conditions of 75 | this License, each Contributor hereby grants to You a perpetual, 76 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 77 | (except as stated in this section) patent license to make, have made, 78 | use, offer to sell, sell, import, and otherwise transfer the Work, 79 | where such license applies only to those patent claims licensable 80 | by such Contributor that are necessarily infringed by their 81 | Contribution(s) alone or by combination of their Contribution(s) 82 | with the Work to which such Contribution(s) was submitted. If You 83 | institute patent litigation against any entity (including a 84 | cross-claim or counterclaim in a lawsuit) alleging that the Work 85 | or a Contribution incorporated within the Work constitutes direct 86 | or contributory patent infringement, then any patent licenses 87 | granted to You under this License for that Work shall terminate 88 | as of the date such litigation is filed. 89 | 90 | 4. Redistribution. You may reproduce and distribute copies of the 91 | Work or Derivative Works thereof in any medium, with or without 92 | modifications, and in Source or Object form, provided that You 93 | meet the following conditions: 94 | 95 | (a) You must give any other recipients of the Work or 96 | Derivative Works a copy of this License; and 97 | 98 | (b) You must cause any modified files to carry prominent notices 99 | stating that You changed the files; and 100 | 101 | (c) You must retain, in the Source form of any Derivative Works 102 | that You distribute, all copyright, patent, trademark, and 103 | attribution notices from the Source form of the Work, 104 | excluding those notices that do not pertain to any part of 105 | the Derivative Works; and 106 | 107 | (d) If the Work includes a "NOTICE" text file as part of its 108 | distribution, then any Derivative Works that You distribute must 109 | include a readable copy of the attribution notices contained 110 | within such NOTICE file, excluding those notices that do not 111 | pertain to any part of the Derivative Works, in at least one 112 | of the following places: within a NOTICE text file distributed 113 | as part of the Derivative Works; within the Source form or 114 | documentation, if provided along with the Derivative Works; or, 115 | within a display generated by the Derivative Works, if and 116 | wherever such third-party notices normally appear. The contents 117 | of the NOTICE file are for informational purposes only and 118 | do not modify the License. You may add Your own attribution 119 | notices within Derivative Works that You distribute, alongside 120 | or as an addendum to the NOTICE text from the Work, provided 121 | that such additional attribution notices cannot be construed 122 | as modifying the License. 123 | 124 | You may add Your own copyright statement to Your modifications and 125 | may provide additional or different license terms and conditions 126 | for use, reproduction, or distribution of Your modifications, or 127 | for any such Derivative Works as a whole, provided Your use, 128 | reproduction, and distribution of the Work otherwise complies with 129 | the conditions stated in this License. 130 | 131 | 5. Submission of Contributions. Unless You explicitly state otherwise, 132 | any Contribution intentionally submitted for inclusion in the Work 133 | by You to the Licensor shall be under the terms and conditions of 134 | this License, without any additional terms or conditions. 135 | Notwithstanding the above, nothing herein shall supersede or modify 136 | the terms of any separate license agreement you may have executed 137 | with Licensor regarding such Contributions. 138 | 139 | 6. Trademarks. This License does not grant permission to use the trade 140 | names, trademarks, service marks, or product names of the Licensor, 141 | except as required for reasonable and customary use in describing the 142 | origin of the Work and reproducing the content of the NOTICE file. 143 | 144 | 7. Disclaimer of Warranty. Unless required by applicable law or 145 | agreed to in writing, Licensor provides the Work (and each 146 | Contributor provides its Contributions) on an "AS IS" BASIS, 147 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 148 | implied, including, without limitation, any warranties or conditions 149 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 150 | PARTICULAR PURPOSE. You are solely responsible for determining the 151 | appropriateness of using or redistributing the Work and assume any 152 | risks associated with Your exercise of permissions under this License. 153 | 154 | 8. Limitation of Liability. In no event and under no legal theory, 155 | whether in tort (including negligence), contract, or otherwise, 156 | unless required by applicable law (such as deliberate and grossly 157 | negligent acts) or agreed to in writing, shall any Contributor be 158 | liable to You for damages, including any direct, indirect, special, 159 | incidental, or consequential damages of any character arising as a 160 | result of this License or out of the use or inability to use the 161 | Work (including but not limited to damages for loss of goodwill, 162 | work stoppage, computer failure or malfunction, or any and all 163 | other commercial damages or losses), even if such Contributor 164 | has been advised of the possibility of such damages. 165 | 166 | 9. Accepting Warranty or Additional Liability. While redistributing 167 | the Work or Derivative Works thereof, You may choose to offer, 168 | and charge a fee for, acceptance of support, warranty, indemnity, 169 | or other liability obligations and/or rights consistent with this 170 | License. However, in accepting such obligations, You may act only 171 | on Your own behalf and on Your sole responsibility, not on behalf 172 | of any other Contributor, and only if You agree to indemnify, 173 | defend, and hold each Contributor harmless for any liability 174 | incurred by, or claims asserted against, such Contributor by reason 175 | of your accepting any such warranty or additional liability. 176 | 177 | END OF TERMS AND CONDITIONS 178 | 179 | APPENDIX: How to apply the Apache License to your work. 180 | 181 | To apply the Apache License to your work, attach the following 182 | boilerplate notice, with the fields enclosed by brackets "[]" 183 | replaced with your own identifying information. (Don't include 184 | the brackets!) The text should be enclosed in the appropriate 185 | comment syntax for the file format. We also recommend that a 186 | file or class name and description of purpose be included on the 187 | same "printed page" as the copyright notice for easier 188 | identification within third-party archives. 189 | 190 | Copyright [yyyy] [name of copyright owner] 191 | 192 | Licensed under the Apache License, Version 2.0 (the "License"); 193 | you may not use this file except in compliance with the License. 194 | You may obtain a copy of the License at 195 | 196 | http://www.apache.org/licenses/LICENSE-2.0 197 | 198 | Unless required by applicable law or agreed to in writing, software 199 | distributed under the License is distributed on an "AS IS" BASIS, 200 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 201 | See the License for the specific language governing permissions and 202 | limitations under the License. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # segment 2 | 3 | [![Tests](https://github.com/blevesearch/segment/workflows/Tests/badge.svg?branch=master&event=push)](https://github.com/blevesearch/segment/actions?query=workflow%3ATests+event%3Apush+branch%3Amaster) 4 | 5 | A Go library for performing Unicode Text Segmentation 6 | as described in [Unicode Standard Annex #29](http://www.unicode.org/reports/tr29/) 7 | 8 | ## Features 9 | 10 | * Currently only segmentation at Word Boundaries is supported. 11 | 12 | ## License 13 | 14 | Apache License Version 2.0 15 | 16 | ## Usage 17 | 18 | The functionality is exposed in two ways: 19 | 20 | 1. You can use a bufio.Scanner with the SplitWords implementation of SplitFunc. 21 | The SplitWords function will identify the appropriate word boundaries in the input 22 | text and the Scanner will return tokens at the appropriate place. 23 | 24 | scanner := bufio.NewScanner(...) 25 | scanner.Split(segment.SplitWords) 26 | for scanner.Scan() { 27 | tokenBytes := scanner.Bytes() 28 | } 29 | if err := scanner.Err(); err != nil { 30 | t.Fatal(err) 31 | } 32 | 33 | 2. Sometimes you would also like information returned about the type of token. 34 | To do this we have introduce a new type named Segmenter. It works just like Scanner 35 | but additionally a token type is returned. 36 | 37 | segmenter := segment.NewWordSegmenter(...) 38 | for segmenter.Segment() { 39 | tokenBytes := segmenter.Bytes()) 40 | tokenType := segmenter.Type() 41 | } 42 | if err := segmenter.Err(); err != nil { 43 | t.Fatal(err) 44 | } 45 | 46 | ## Choosing Implementation 47 | 48 | By default segment does NOT use the fastest runtime implementation. The reason is that it adds approximately 5s to compilation time and may require more than 1GB of ram on the machine performing compilation. 49 | 50 | However, you can choose to build with the fastest runtime implementation by passing the build tag as follows: 51 | 52 | -tags 'prod' 53 | 54 | ## Generating Code 55 | 56 | Several components in this package are generated. 57 | 58 | 1. Several Ragel rules files are generated from Unicode properties files. 59 | 2. Ragel machine is generated from the Ragel rules. 60 | 3. Test tables are generated from the Unicode test files. 61 | 62 | All of these can be generated by running: 63 | 64 | go generate 65 | 66 | ## Fuzzing 67 | 68 | There is support for fuzzing the segment library with [go-fuzz](https://github.com/dvyukov/go-fuzz). 69 | 70 | 1. Install go-fuzz if you haven't already: 71 | 72 | go get github.com/dvyukov/go-fuzz/go-fuzz 73 | go get github.com/dvyukov/go-fuzz/go-fuzz-build 74 | 75 | 2. Build the package with go-fuzz: 76 | 77 | go-fuzz-build github.com/blevesearch/segment 78 | 79 | 3. Convert the Unicode provided test cases into the initial corpus for go-fuzz: 80 | 81 | go test -v -run=TestGenerateWordSegmentFuzz -tags gofuzz_generate 82 | 83 | 4. Run go-fuzz: 84 | 85 | go-fuzz -bin=segment-fuzz.zip -workdir=workdir 86 | 87 | ## Status 88 | 89 | 90 | [![Build Status](https://travis-ci.org/blevesearch/segment.svg?branch=master)](https://travis-ci.org/blevesearch/segment) 91 | 92 | [![Coverage Status](https://img.shields.io/coveralls/blevesearch/segment.svg)](https://coveralls.io/r/blevesearch/segment?branch=master) 93 | 94 | [![GoDoc](https://godoc.org/github.com/blevesearch/segment?status.svg)](https://godoc.org/github.com/blevesearch/segment) -------------------------------------------------------------------------------- /doc.go: -------------------------------------------------------------------------------- 1 | // Copyright (c) 2014 Couchbase, Inc. 2 | // Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file 3 | // except in compliance with the License. You may obtain a copy of the License at 4 | // http://www.apache.org/licenses/LICENSE-2.0 5 | // Unless required by applicable law or agreed to in writing, software distributed under the 6 | // License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, 7 | // either express or implied. See the License for the specific language governing permissions 8 | // and limitations under the License. 9 | 10 | /* 11 | Package segment is a library for performing Unicode Text Segmentation 12 | as described in Unicode Standard Annex #29 http://www.unicode.org/reports/tr29/ 13 | 14 | Currently only segmentation at Word Boundaries is supported. 15 | 16 | The functionality is exposed in two ways: 17 | 18 | 1. You can use a bufio.Scanner with the SplitWords implementation of SplitFunc. 19 | The SplitWords function will identify the appropriate word boundaries in the input 20 | text and the Scanner will return tokens at the appropriate place. 21 | 22 | scanner := bufio.NewScanner(...) 23 | scanner.Split(segment.SplitWords) 24 | for scanner.Scan() { 25 | tokenBytes := scanner.Bytes() 26 | } 27 | if err := scanner.Err(); err != nil { 28 | t.Fatal(err) 29 | } 30 | 31 | 2. Sometimes you would also like information returned about the type of token. 32 | To do this we have introduce a new type named Segmenter. It works just like Scanner 33 | but additionally a token type is returned. 34 | 35 | segmenter := segment.NewWordSegmenter(...) 36 | for segmenter.Segment() { 37 | tokenBytes := segmenter.Bytes()) 38 | tokenType := segmenter.Type() 39 | } 40 | if err := segmenter.Err(); err != nil { 41 | t.Fatal(err) 42 | } 43 | 44 | */ 45 | package segment 46 | -------------------------------------------------------------------------------- /export_test.go: -------------------------------------------------------------------------------- 1 | // Copyright 2013 The Go Authors. All rights reserved. 2 | // Use of this source code is governed by a BSD-style 3 | // license that can be found in the LICENSE file. 4 | 5 | package segment 6 | 7 | // Exported for testing only. 8 | import ( 9 | "unicode/utf8" 10 | ) 11 | 12 | func (s *Segmenter) MaxTokenSize(n int) { 13 | if n < utf8.UTFMax || n > 1e9 { 14 | panic("bad max token size") 15 | } 16 | if n < len(s.buf) { 17 | s.buf = make([]byte, n) 18 | } 19 | s.maxTokenSize = n 20 | } 21 | -------------------------------------------------------------------------------- /go.mod: -------------------------------------------------------------------------------- 1 | module github.com/blevesearch/segment 2 | 3 | go 1.18 4 | -------------------------------------------------------------------------------- /maketesttables.go: -------------------------------------------------------------------------------- 1 | // Copyright (c) 2015 Couchbase, Inc. 2 | // Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file 3 | // except in compliance with the License. You may obtain a copy of the License at 4 | // http://www.apache.org/licenses/LICENSE-2.0 5 | // Unless required by applicable law or agreed to in writing, software distributed under the 6 | // License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, 7 | // either express or implied. See the License for the specific language governing permissions 8 | // and limitations under the License. 9 | 10 | // +build ignore 11 | 12 | package main 13 | 14 | import ( 15 | "bufio" 16 | "bytes" 17 | "flag" 18 | "fmt" 19 | "io" 20 | "log" 21 | "net/http" 22 | "os" 23 | "os/exec" 24 | "strconv" 25 | "strings" 26 | "unicode" 27 | ) 28 | 29 | var url = flag.String("url", 30 | "http://www.unicode.org/Public/"+unicode.Version+"/ucd/auxiliary/", 31 | "URL of Unicode database directory") 32 | var verbose = flag.Bool("verbose", 33 | false, 34 | "write data to stdout as it is parsed") 35 | var localFiles = flag.Bool("local", 36 | false, 37 | "data files have been copied to the current directory; for debugging only") 38 | 39 | var outputFile = flag.String("output", 40 | "", 41 | "output file for generated tables; default stdout") 42 | 43 | var output *bufio.Writer 44 | 45 | func main() { 46 | flag.Parse() 47 | setupOutput() 48 | 49 | graphemeTests := make([]test, 0) 50 | graphemeComments := make([]string, 0) 51 | graphemeTests, graphemeComments = loadUnicodeData("GraphemeBreakTest.txt", graphemeTests, graphemeComments) 52 | wordTests := make([]test, 0) 53 | wordComments := make([]string, 0) 54 | wordTests, wordComments = loadUnicodeData("WordBreakTest.txt", wordTests, wordComments) 55 | sentenceTests := make([]test, 0) 56 | sentenceComments := make([]string, 0) 57 | sentenceTests, sentenceComments = loadUnicodeData("SentenceBreakTest.txt", sentenceTests, sentenceComments) 58 | 59 | fmt.Fprintf(output, fileHeader, *url) 60 | generateTestTables("Grapheme", graphemeTests, graphemeComments) 61 | generateTestTables("Word", wordTests, wordComments) 62 | generateTestTables("Sentence", sentenceTests, sentenceComments) 63 | 64 | flushOutput() 65 | } 66 | 67 | // WordBreakProperty.txt has the form: 68 | // 05F0..05F2 ; Hebrew_Letter # Lo [3] HEBREW LIGATURE YIDDISH DOUBLE VAV..HEBREW LIGATURE YIDDISH DOUBLE YOD 69 | // FB1D ; Hebrew_Letter # Lo HEBREW LETTER YOD WITH HIRIQ 70 | func openReader(file string) (input io.ReadCloser) { 71 | if *localFiles { 72 | f, err := os.Open(file) 73 | if err != nil { 74 | log.Fatal(err) 75 | } 76 | input = f 77 | } else { 78 | path := *url + file 79 | resp, err := http.Get(path) 80 | if err != nil { 81 | log.Fatal(err) 82 | } 83 | if resp.StatusCode != 200 { 84 | log.Fatal("bad GET status for "+file, resp.Status) 85 | } 86 | input = resp.Body 87 | } 88 | return 89 | } 90 | 91 | func loadUnicodeData(filename string, tests []test, comments []string) ([]test, []string) { 92 | f := openReader(filename) 93 | defer f.Close() 94 | bufioReader := bufio.NewReader(f) 95 | line, err := bufioReader.ReadString('\n') 96 | for err == nil { 97 | tests, comments = parseLine(line, tests, comments) 98 | line, err = bufioReader.ReadString('\n') 99 | } 100 | // if the err was EOF still need to process last value 101 | if err == io.EOF { 102 | tests, comments = parseLine(line, tests, comments) 103 | } 104 | return tests, comments 105 | } 106 | 107 | const comment = "#" 108 | const brk = "÷" 109 | const nbrk = "×" 110 | 111 | type test [][]byte 112 | 113 | func parseLine(line string, tests []test, comments []string) ([]test, []string) { 114 | if strings.HasPrefix(line, comment) { 115 | return tests, comments 116 | } 117 | line = strings.TrimSpace(line) 118 | if len(line) == 0 { 119 | return tests, comments 120 | } 121 | commentStart := strings.Index(line, comment) 122 | comment := strings.TrimSpace(line[commentStart+1:]) 123 | if commentStart > 0 { 124 | line = line[0:commentStart] 125 | } 126 | pieces := strings.Split(line, brk) 127 | t := make(test, 0) 128 | for _, piece := range pieces { 129 | piece = strings.TrimSpace(piece) 130 | if len(piece) > 0 { 131 | codePoints := strings.Split(piece, nbrk) 132 | word := "" 133 | for _, codePoint := range codePoints { 134 | codePoint = strings.TrimSpace(codePoint) 135 | r, err := strconv.ParseInt(codePoint, 16, 64) 136 | if err != nil { 137 | log.Printf("err: %v for '%s'", err, string(r)) 138 | return tests, comments 139 | } 140 | 141 | word += string(r) 142 | } 143 | t = append(t, []byte(word)) 144 | } 145 | } 146 | tests = append(tests, t) 147 | comments = append(comments, comment) 148 | return tests, comments 149 | } 150 | 151 | func generateTestTables(prefix string, tests []test, comments []string) { 152 | fmt.Fprintf(output, testHeader, prefix) 153 | for i, t := range tests { 154 | fmt.Fprintf(output, "\t\t{\n") 155 | fmt.Fprintf(output, "\t\t\tinput: %#v,\n", bytes.Join(t, []byte{})) 156 | fmt.Fprintf(output, "\t\t\toutput: %s,\n", generateTest(t)) 157 | fmt.Fprintf(output, "\t\t\tcomment: `%s`,\n", comments[i]) 158 | fmt.Fprintf(output, "\t\t},\n") 159 | } 160 | fmt.Fprintf(output, "}\n") 161 | } 162 | 163 | func generateTest(t test) string { 164 | rv := "[][]byte{" 165 | for _, te := range t { 166 | rv += fmt.Sprintf("%#v,", te) 167 | } 168 | rv += "}" 169 | return rv 170 | } 171 | 172 | const fileHeader = `// Generated by running 173 | // maketesttables --url=%s 174 | // DO NOT EDIT 175 | 176 | package segment 177 | ` 178 | 179 | const testHeader = `var unicode%sTests = []struct { 180 | input []byte 181 | output [][]byte 182 | comment string 183 | }{ 184 | ` 185 | 186 | func setupOutput() { 187 | output = bufio.NewWriter(startGofmt()) 188 | } 189 | 190 | // startGofmt connects output to a gofmt process if -output is set. 191 | func startGofmt() io.Writer { 192 | if *outputFile == "" { 193 | return os.Stdout 194 | } 195 | stdout, err := os.Create(*outputFile) 196 | if err != nil { 197 | log.Fatal(err) 198 | } 199 | // Pipe output to gofmt. 200 | gofmt := exec.Command("gofmt") 201 | fd, err := gofmt.StdinPipe() 202 | if err != nil { 203 | log.Fatal(err) 204 | } 205 | gofmt.Stdout = stdout 206 | gofmt.Stderr = os.Stderr 207 | err = gofmt.Start() 208 | if err != nil { 209 | log.Fatal(err) 210 | } 211 | return fd 212 | } 213 | 214 | func flushOutput() { 215 | err := output.Flush() 216 | if err != nil { 217 | log.Fatal(err) 218 | } 219 | } 220 | -------------------------------------------------------------------------------- /ragel/unicode2ragel.rb: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env ruby 2 | # 3 | # This scripted has been updated to accept more command-line arguments: 4 | # 5 | # -u, --url URL to process 6 | # -m, --machine Machine name 7 | # -p, --properties Properties to add to the machine 8 | # -o, --output Write output to file 9 | # 10 | # Updated by: Marty Schoch 11 | # 12 | # This script uses the unicode spec to generate a Ragel state machine 13 | # that recognizes unicode alphanumeric characters. It generates 5 14 | # character classes: uupper, ulower, ualpha, udigit, and ualnum. 15 | # Currently supported encodings are UTF-8 [default] and UCS-4. 16 | # 17 | # Usage: unicode2ragel.rb [options] 18 | # -e, --encoding [ucs4 | utf8] Data encoding 19 | # -h, --help Show this message 20 | # 21 | # This script was originally written as part of the Ferret search 22 | # engine library. 23 | # 24 | # Author: Rakan El-Khalil 25 | 26 | require 'optparse' 27 | require 'open-uri' 28 | 29 | ENCODINGS = [ :utf8, :ucs4 ] 30 | ALPHTYPES = { :utf8 => "unsigned char", :ucs4 => "unsigned int" } 31 | DEFAULT_CHART_URL = "http://www.unicode.org/Public/5.1.0/ucd/DerivedCoreProperties.txt" 32 | DEFAULT_MACHINE_NAME= "WChar" 33 | 34 | ### 35 | # Display vars & default option 36 | 37 | TOTAL_WIDTH = 80 38 | RANGE_WIDTH = 23 39 | @encoding = :utf8 40 | @chart_url = DEFAULT_CHART_URL 41 | machine_name = DEFAULT_MACHINE_NAME 42 | properties = [] 43 | @output = $stdout 44 | 45 | ### 46 | # Option parsing 47 | 48 | cli_opts = OptionParser.new do |opts| 49 | opts.on("-e", "--encoding [ucs4 | utf8]", "Data encoding") do |o| 50 | @encoding = o.downcase.to_sym 51 | end 52 | opts.on("-h", "--help", "Show this message") do 53 | puts opts 54 | exit 55 | end 56 | opts.on("-u", "--url URL", "URL to process") do |o| 57 | @chart_url = o 58 | end 59 | opts.on("-m", "--machine MACHINE_NAME", "Machine name") do |o| 60 | machine_name = o 61 | end 62 | opts.on("-p", "--properties x,y,z", Array, "Properties to add to machine") do |o| 63 | properties = o 64 | end 65 | opts.on("-o", "--output FILE", "output file") do |o| 66 | @output = File.new(o, "w+") 67 | end 68 | end 69 | 70 | cli_opts.parse(ARGV) 71 | unless ENCODINGS.member? @encoding 72 | puts "Invalid encoding: #{@encoding}" 73 | puts cli_opts 74 | exit 75 | end 76 | 77 | ## 78 | # Downloads the document at url and yields every alpha line's hex 79 | # range and description. 80 | 81 | def each_alpha( url, property ) 82 | open( url ) do |file| 83 | file.each_line do |line| 84 | next if line =~ /^#/; 85 | next if line !~ /; #{property} #/; 86 | 87 | range, description = line.split(/;/) 88 | range.strip! 89 | description.gsub!(/.*#/, '').strip! 90 | 91 | if range =~ /\.\./ 92 | start, stop = range.split '..' 93 | else start = stop = range 94 | end 95 | 96 | yield start.hex .. stop.hex, description 97 | end 98 | end 99 | end 100 | 101 | ### 102 | # Formats to hex at minimum width 103 | 104 | def to_hex( n ) 105 | r = "%0X" % n 106 | r = "0#{r}" unless (r.length % 2).zero? 107 | r 108 | end 109 | 110 | ### 111 | # UCS4 is just a straight hex conversion of the unicode codepoint. 112 | 113 | def to_ucs4( range ) 114 | rangestr = "0x" + to_hex(range.begin) 115 | rangestr << "..0x" + to_hex(range.end) if range.begin != range.end 116 | [ rangestr ] 117 | end 118 | 119 | ## 120 | # 0x00 - 0x7f -> 0zzzzzzz[7] 121 | # 0x80 - 0x7ff -> 110yyyyy[5] 10zzzzzz[6] 122 | # 0x800 - 0xffff -> 1110xxxx[4] 10yyyyyy[6] 10zzzzzz[6] 123 | # 0x010000 - 0x10ffff -> 11110www[3] 10xxxxxx[6] 10yyyyyy[6] 10zzzzzz[6] 124 | 125 | UTF8_BOUNDARIES = [0x7f, 0x7ff, 0xffff, 0x10ffff] 126 | 127 | def to_utf8_enc( n ) 128 | r = 0 129 | if n <= 0x7f 130 | r = n 131 | elsif n <= 0x7ff 132 | y = 0xc0 | (n >> 6) 133 | z = 0x80 | (n & 0x3f) 134 | r = y << 8 | z 135 | elsif n <= 0xffff 136 | x = 0xe0 | (n >> 12) 137 | y = 0x80 | (n >> 6) & 0x3f 138 | z = 0x80 | n & 0x3f 139 | r = x << 16 | y << 8 | z 140 | elsif n <= 0x10ffff 141 | w = 0xf0 | (n >> 18) 142 | x = 0x80 | (n >> 12) & 0x3f 143 | y = 0x80 | (n >> 6) & 0x3f 144 | z = 0x80 | n & 0x3f 145 | r = w << 24 | x << 16 | y << 8 | z 146 | end 147 | 148 | to_hex(r) 149 | end 150 | 151 | def from_utf8_enc( n ) 152 | n = n.hex 153 | r = 0 154 | if n <= 0x7f 155 | r = n 156 | elsif n <= 0xdfff 157 | y = (n >> 8) & 0x1f 158 | z = n & 0x3f 159 | r = y << 6 | z 160 | elsif n <= 0xefffff 161 | x = (n >> 16) & 0x0f 162 | y = (n >> 8) & 0x3f 163 | z = n & 0x3f 164 | r = x << 10 | y << 6 | z 165 | elsif n <= 0xf7ffffff 166 | w = (n >> 24) & 0x07 167 | x = (n >> 16) & 0x3f 168 | y = (n >> 8) & 0x3f 169 | z = n & 0x3f 170 | r = w << 18 | x << 12 | y << 6 | z 171 | end 172 | r 173 | end 174 | 175 | ### 176 | # Given a range, splits it up into ranges that can be continuously 177 | # encoded into utf8. Eg: 0x00 .. 0xff => [0x00..0x7f, 0x80..0xff] 178 | # This is not strictly needed since the current [5.1] unicode standard 179 | # doesn't have ranges that straddle utf8 boundaries. This is included 180 | # for completeness as there is no telling if that will ever change. 181 | 182 | def utf8_ranges( range ) 183 | ranges = [] 184 | UTF8_BOUNDARIES.each do |max| 185 | if range.begin <= max 186 | return ranges << range if range.end <= max 187 | 188 | ranges << range.begin .. max 189 | range = (max + 1) .. range.end 190 | end 191 | end 192 | ranges 193 | end 194 | 195 | def build_range( start, stop ) 196 | size = start.size/2 197 | left = size - 1 198 | return [""] if size < 1 199 | 200 | a = start[0..1] 201 | b = stop[0..1] 202 | 203 | ### 204 | # Shared prefix 205 | 206 | if a == b 207 | return build_range(start[2..-1], stop[2..-1]).map do |elt| 208 | "0x#{a} " + elt 209 | end 210 | end 211 | 212 | ### 213 | # Unshared prefix, end of run 214 | 215 | return ["0x#{a}..0x#{b} "] if left.zero? 216 | 217 | ### 218 | # Unshared prefix, not end of run 219 | # Range can be 0x123456..0x56789A 220 | # Which is equivalent to: 221 | # 0x123456 .. 0x12FFFF 222 | # 0x130000 .. 0x55FFFF 223 | # 0x560000 .. 0x56789A 224 | 225 | ret = [] 226 | ret << build_range(start, a + "FF" * left) 227 | 228 | ### 229 | # Only generate middle range if need be. 230 | 231 | if a.hex+1 != b.hex 232 | max = to_hex(b.hex - 1) 233 | max = "FF" if b == "FF" 234 | ret << "0x#{to_hex(a.hex+1)}..0x#{max} " + "0x00..0xFF " * left 235 | end 236 | 237 | ### 238 | # Don't generate last range if it is covered by first range 239 | 240 | ret << build_range(b + "00" * left, stop) unless b == "FF" 241 | ret.flatten! 242 | end 243 | 244 | def to_utf8( range ) 245 | utf8_ranges( range ).map do |r| 246 | build_range to_utf8_enc(r.begin), to_utf8_enc(r.end) 247 | end.flatten! 248 | end 249 | 250 | ## 251 | # Perform a 3-way comparison of the number of codepoints advertised by 252 | # the unicode spec for the given range, the originally parsed range, 253 | # and the resulting utf8 encoded range. 254 | 255 | def count_codepoints( code ) 256 | code.split(' ').inject(1) do |acc, elt| 257 | if elt =~ /0x(.+)\.\.0x(.+)/ 258 | if @encoding == :utf8 259 | acc * (from_utf8_enc($2) - from_utf8_enc($1) + 1) 260 | else 261 | acc * ($2.hex - $1.hex + 1) 262 | end 263 | else 264 | acc 265 | end 266 | end 267 | end 268 | 269 | def is_valid?( range, desc, codes ) 270 | spec_count = 1 271 | spec_count = $1.to_i if desc =~ /\[(\d+)\]/ 272 | range_count = range.end - range.begin + 1 273 | 274 | sum = codes.inject(0) { |acc, elt| acc + count_codepoints(elt) } 275 | sum == spec_count and sum == range_count 276 | end 277 | 278 | ## 279 | # Generate the state maching to stdout 280 | 281 | def generate_machine( name, property ) 282 | pipe = " " 283 | @output.puts " #{name} = " 284 | each_alpha( @chart_url, property ) do |range, desc| 285 | 286 | codes = (@encoding == :ucs4) ? to_ucs4(range) : to_utf8(range) 287 | 288 | raise "Invalid encoding of range #{range}: #{codes.inspect}" unless 289 | is_valid? range, desc, codes 290 | 291 | range_width = codes.map { |a| a.size }.max 292 | range_width = RANGE_WIDTH if range_width < RANGE_WIDTH 293 | 294 | desc_width = TOTAL_WIDTH - RANGE_WIDTH - 11 295 | desc_width -= (range_width - RANGE_WIDTH) if range_width > RANGE_WIDTH 296 | 297 | if desc.size > desc_width 298 | desc = desc[0..desc_width - 4] + "..." 299 | end 300 | 301 | codes.each_with_index do |r, idx| 302 | desc = "" unless idx.zero? 303 | code = "%-#{range_width}s" % r 304 | @output.puts " #{pipe} #{code} ##{desc}" 305 | pipe = "|" 306 | end 307 | end 308 | @output.puts " ;" 309 | @output.puts "" 310 | end 311 | 312 | @output.puts < 35 | ; 36 | 37 | LF = 38 | 0x0A #Cc 39 | ; 40 | 41 | Newline = 42 | 0x0B..0x0C #Cc [2] .. 43 | | 0xC2 0x85 #Cc 44 | | 0xE2 0x80 0xA8 #Zl LINE SEPARATOR 45 | | 0xE2 0x80 0xA9 #Zp PARAGRAPH SEPARATOR 46 | ; 47 | 48 | Extend = 49 | 0xCC 0x80..0xFF #Mn [112] COMBINING GRAVE ACCENT..COMBINING ... 50 | | 0xCD 0x00..0xAF # 51 | | 0xD2 0x83..0x87 #Mn [5] COMBINING CYRILLIC TITLO..COMBININ... 52 | | 0xD2 0x88..0x89 #Me [2] COMBINING CYRILLIC HUNDRED THOUSAN... 53 | | 0xD6 0x91..0xBD #Mn [45] HEBREW ACCENT ETNAHTA..HEBREW POIN... 54 | | 0xD6 0xBF #Mn HEBREW POINT RAFE 55 | | 0xD7 0x81..0x82 #Mn [2] HEBREW POINT SHIN DOT..HEBREW POIN... 56 | | 0xD7 0x84..0x85 #Mn [2] HEBREW MARK UPPER DOT..HEBREW MARK... 57 | | 0xD7 0x87 #Mn HEBREW POINT QAMATS QATAN 58 | | 0xD8 0x90..0x9A #Mn [11] ARABIC SIGN SALLALLAHOU ALAYHE WAS... 59 | | 0xD9 0x8B..0x9F #Mn [21] ARABIC FATHATAN..ARABIC WAVY HAMZA... 60 | | 0xD9 0xB0 #Mn ARABIC LETTER SUPERSCRIPT ALEF 61 | | 0xDB 0x96..0x9C #Mn [7] ARABIC SMALL HIGH LIGATURE SAD WIT... 62 | | 0xDB 0x9F..0xA4 #Mn [6] ARABIC SMALL HIGH ROUNDED ZERO..AR... 63 | | 0xDB 0xA7..0xA8 #Mn [2] ARABIC SMALL HIGH YEH..ARABIC SMAL... 64 | | 0xDB 0xAA..0xAD #Mn [4] ARABIC EMPTY CENTRE LOW STOP..ARAB... 65 | | 0xDC 0x91 #Mn SYRIAC LETTER SUPERSCRIPT ALAPH 66 | | 0xDC 0xB0..0xFF #Mn [27] SYRIAC PTHAHA ABOVE..SYRIAC BARREKH 67 | | 0xDD 0x00..0x8A # 68 | | 0xDE 0xA6..0xB0 #Mn [11] THAANA ABAFILI..THAANA SUKUN 69 | | 0xDF 0xAB..0xB3 #Mn [9] NKO COMBINING SHORT HIGH TONE..NKO... 70 | | 0xE0 0xA0 0x96..0x99 #Mn [4] SAMARITAN MARK IN..SAMARITAN MARK ... 71 | | 0xE0 0xA0 0x9B..0xA3 #Mn [9] SAMARITAN MARK EPENTHETIC YUT..SAM... 72 | | 0xE0 0xA0 0xA5..0xA7 #Mn [3] SAMARITAN VOWEL SIGN SHORT A..SAMA... 73 | | 0xE0 0xA0 0xA9..0xAD #Mn [5] SAMARITAN VOWEL SIGN LONG I..SAMAR... 74 | | 0xE0 0xA1 0x99..0x9B #Mn [3] MANDAIC AFFRICATION MARK..MANDAIC ... 75 | | 0xE0 0xA3 0xA3..0xFF #Mn [32] ARABIC TURNED DAMMA BELOW..DEVANAG... 76 | | 0xE0 0xA4 0x00..0x82 # 77 | | 0xE0 0xA4 0x83 #Mc DEVANAGARI SIGN VISARGA 78 | | 0xE0 0xA4 0xBA #Mn DEVANAGARI VOWEL SIGN OE 79 | | 0xE0 0xA4 0xBB #Mc DEVANAGARI VOWEL SIGN OOE 80 | | 0xE0 0xA4 0xBC #Mn DEVANAGARI SIGN NUKTA 81 | | 0xE0 0xA4 0xBE..0xFF #Mc [3] DEVANAGARI VOWEL SIGN AA..DEVANAGA... 82 | | 0xE0 0xA5 0x00..0x80 # 83 | | 0xE0 0xA5 0x81..0x88 #Mn [8] DEVANAGARI VOWEL SIGN U..DEVANAGAR... 84 | | 0xE0 0xA5 0x89..0x8C #Mc [4] DEVANAGARI VOWEL SIGN CANDRA O..DE... 85 | | 0xE0 0xA5 0x8D #Mn DEVANAGARI SIGN VIRAMA 86 | | 0xE0 0xA5 0x8E..0x8F #Mc [2] DEVANAGARI VOWEL SIGN PRISHTHAMATR... 87 | | 0xE0 0xA5 0x91..0x97 #Mn [7] DEVANAGARI STRESS SIGN UDATTA..DEV... 88 | | 0xE0 0xA5 0xA2..0xA3 #Mn [2] DEVANAGARI VOWEL SIGN VOCALIC L..D... 89 | | 0xE0 0xA6 0x81 #Mn BENGALI SIGN CANDRABINDU 90 | | 0xE0 0xA6 0x82..0x83 #Mc [2] BENGALI SIGN ANUSVARA..BENGALI SIG... 91 | | 0xE0 0xA6 0xBC #Mn BENGALI SIGN NUKTA 92 | | 0xE0 0xA6 0xBE..0xFF #Mc [3] BENGALI VOWEL SIGN AA..BENGALI VOW... 93 | | 0xE0 0xA7 0x00..0x80 # 94 | | 0xE0 0xA7 0x81..0x84 #Mn [4] BENGALI VOWEL SIGN U..BENGALI VOWE... 95 | | 0xE0 0xA7 0x87..0x88 #Mc [2] BENGALI VOWEL SIGN E..BENGALI VOWE... 96 | | 0xE0 0xA7 0x8B..0x8C #Mc [2] BENGALI VOWEL SIGN O..BENGALI VOWE... 97 | | 0xE0 0xA7 0x8D #Mn BENGALI SIGN VIRAMA 98 | | 0xE0 0xA7 0x97 #Mc BENGALI AU LENGTH MARK 99 | | 0xE0 0xA7 0xA2..0xA3 #Mn [2] BENGALI VOWEL SIGN VOCALIC L..BENG... 100 | | 0xE0 0xA8 0x81..0x82 #Mn [2] GURMUKHI SIGN ADAK BINDI..GURMUKHI... 101 | | 0xE0 0xA8 0x83 #Mc GURMUKHI SIGN VISARGA 102 | | 0xE0 0xA8 0xBC #Mn GURMUKHI SIGN NUKTA 103 | | 0xE0 0xA8 0xBE..0xFF #Mc [3] GURMUKHI VOWEL SIGN AA..GURMUKHI V... 104 | | 0xE0 0xA9 0x00..0x80 # 105 | | 0xE0 0xA9 0x81..0x82 #Mn [2] GURMUKHI VOWEL SIGN U..GURMUKHI VO... 106 | | 0xE0 0xA9 0x87..0x88 #Mn [2] GURMUKHI VOWEL SIGN EE..GURMUKHI V... 107 | | 0xE0 0xA9 0x8B..0x8D #Mn [3] GURMUKHI VOWEL SIGN OO..GURMUKHI S... 108 | | 0xE0 0xA9 0x91 #Mn GURMUKHI SIGN UDAAT 109 | | 0xE0 0xA9 0xB0..0xB1 #Mn [2] GURMUKHI TIPPI..GURMUKHI ADDAK 110 | | 0xE0 0xA9 0xB5 #Mn GURMUKHI SIGN YAKASH 111 | | 0xE0 0xAA 0x81..0x82 #Mn [2] GUJARATI SIGN CANDRABINDU..GUJARAT... 112 | | 0xE0 0xAA 0x83 #Mc GUJARATI SIGN VISARGA 113 | | 0xE0 0xAA 0xBC #Mn GUJARATI SIGN NUKTA 114 | | 0xE0 0xAA 0xBE..0xFF #Mc [3] GUJARATI VOWEL SIGN AA..GUJARATI V... 115 | | 0xE0 0xAB 0x00..0x80 # 116 | | 0xE0 0xAB 0x81..0x85 #Mn [5] GUJARATI VOWEL SIGN U..GUJARATI VO... 117 | | 0xE0 0xAB 0x87..0x88 #Mn [2] GUJARATI VOWEL SIGN E..GUJARATI VO... 118 | | 0xE0 0xAB 0x89 #Mc GUJARATI VOWEL SIGN CANDRA O 119 | | 0xE0 0xAB 0x8B..0x8C #Mc [2] GUJARATI VOWEL SIGN O..GUJARATI VO... 120 | | 0xE0 0xAB 0x8D #Mn GUJARATI SIGN VIRAMA 121 | | 0xE0 0xAB 0xA2..0xA3 #Mn [2] GUJARATI VOWEL SIGN VOCALIC L..GUJ... 122 | | 0xE0 0xAC 0x81 #Mn ORIYA SIGN CANDRABINDU 123 | | 0xE0 0xAC 0x82..0x83 #Mc [2] ORIYA SIGN ANUSVARA..ORIYA SIGN VI... 124 | | 0xE0 0xAC 0xBC #Mn ORIYA SIGN NUKTA 125 | | 0xE0 0xAC 0xBE #Mc ORIYA VOWEL SIGN AA 126 | | 0xE0 0xAC 0xBF #Mn ORIYA VOWEL SIGN I 127 | | 0xE0 0xAD 0x80 #Mc ORIYA VOWEL SIGN II 128 | | 0xE0 0xAD 0x81..0x84 #Mn [4] ORIYA VOWEL SIGN U..ORIYA VOWEL SI... 129 | | 0xE0 0xAD 0x87..0x88 #Mc [2] ORIYA VOWEL SIGN E..ORIYA VOWEL SI... 130 | | 0xE0 0xAD 0x8B..0x8C #Mc [2] ORIYA VOWEL SIGN O..ORIYA VOWEL SI... 131 | | 0xE0 0xAD 0x8D #Mn ORIYA SIGN VIRAMA 132 | | 0xE0 0xAD 0x96 #Mn ORIYA AI LENGTH MARK 133 | | 0xE0 0xAD 0x97 #Mc ORIYA AU LENGTH MARK 134 | | 0xE0 0xAD 0xA2..0xA3 #Mn [2] ORIYA VOWEL SIGN VOCALIC L..ORIYA ... 135 | | 0xE0 0xAE 0x82 #Mn TAMIL SIGN ANUSVARA 136 | | 0xE0 0xAE 0xBE..0xBF #Mc [2] TAMIL VOWEL SIGN AA..TAMIL VOWEL S... 137 | | 0xE0 0xAF 0x80 #Mn TAMIL VOWEL SIGN II 138 | | 0xE0 0xAF 0x81..0x82 #Mc [2] TAMIL VOWEL SIGN U..TAMIL VOWEL SI... 139 | | 0xE0 0xAF 0x86..0x88 #Mc [3] TAMIL VOWEL SIGN E..TAMIL VOWEL SI... 140 | | 0xE0 0xAF 0x8A..0x8C #Mc [3] TAMIL VOWEL SIGN O..TAMIL VOWEL SI... 141 | | 0xE0 0xAF 0x8D #Mn TAMIL SIGN VIRAMA 142 | | 0xE0 0xAF 0x97 #Mc TAMIL AU LENGTH MARK 143 | | 0xE0 0xB0 0x80 #Mn TELUGU SIGN COMBINING CANDRABINDU ... 144 | | 0xE0 0xB0 0x81..0x83 #Mc [3] TELUGU SIGN CANDRABINDU..TELUGU SI... 145 | | 0xE0 0xB0 0xBE..0xFF #Mn [3] TELUGU VOWEL SIGN AA..TELUGU VOWEL... 146 | | 0xE0 0xB1 0x00..0x80 # 147 | | 0xE0 0xB1 0x81..0x84 #Mc [4] TELUGU VOWEL SIGN U..TELUGU VOWEL ... 148 | | 0xE0 0xB1 0x86..0x88 #Mn [3] TELUGU VOWEL SIGN E..TELUGU VOWEL ... 149 | | 0xE0 0xB1 0x8A..0x8D #Mn [4] TELUGU VOWEL SIGN O..TELUGU SIGN V... 150 | | 0xE0 0xB1 0x95..0x96 #Mn [2] TELUGU LENGTH MARK..TELUGU AI LENG... 151 | | 0xE0 0xB1 0xA2..0xA3 #Mn [2] TELUGU VOWEL SIGN VOCALIC L..TELUG... 152 | | 0xE0 0xB2 0x81 #Mn KANNADA SIGN CANDRABINDU 153 | | 0xE0 0xB2 0x82..0x83 #Mc [2] KANNADA SIGN ANUSVARA..KANNADA SIG... 154 | | 0xE0 0xB2 0xBC #Mn KANNADA SIGN NUKTA 155 | | 0xE0 0xB2 0xBE #Mc KANNADA VOWEL SIGN AA 156 | | 0xE0 0xB2 0xBF #Mn KANNADA VOWEL SIGN I 157 | | 0xE0 0xB3 0x80..0x84 #Mc [5] KANNADA VOWEL SIGN II..KANNADA VOW... 158 | | 0xE0 0xB3 0x86 #Mn KANNADA VOWEL SIGN E 159 | | 0xE0 0xB3 0x87..0x88 #Mc [2] KANNADA VOWEL SIGN EE..KANNADA VOW... 160 | | 0xE0 0xB3 0x8A..0x8B #Mc [2] KANNADA VOWEL SIGN O..KANNADA VOWE... 161 | | 0xE0 0xB3 0x8C..0x8D #Mn [2] KANNADA VOWEL SIGN AU..KANNADA SIG... 162 | | 0xE0 0xB3 0x95..0x96 #Mc [2] KANNADA LENGTH MARK..KANNADA AI LE... 163 | | 0xE0 0xB3 0xA2..0xA3 #Mn [2] KANNADA VOWEL SIGN VOCALIC L..KANN... 164 | | 0xE0 0xB4 0x81 #Mn MALAYALAM SIGN CANDRABINDU 165 | | 0xE0 0xB4 0x82..0x83 #Mc [2] MALAYALAM SIGN ANUSVARA..MALAYALAM... 166 | | 0xE0 0xB4 0xBE..0xFF #Mc [3] MALAYALAM VOWEL SIGN AA..MALAYALAM... 167 | | 0xE0 0xB5 0x00..0x80 # 168 | | 0xE0 0xB5 0x81..0x84 #Mn [4] MALAYALAM VOWEL SIGN U..MALAYALAM ... 169 | | 0xE0 0xB5 0x86..0x88 #Mc [3] MALAYALAM VOWEL SIGN E..MALAYALAM ... 170 | | 0xE0 0xB5 0x8A..0x8C #Mc [3] MALAYALAM VOWEL SIGN O..MALAYALAM ... 171 | | 0xE0 0xB5 0x8D #Mn MALAYALAM SIGN VIRAMA 172 | | 0xE0 0xB5 0x97 #Mc MALAYALAM AU LENGTH MARK 173 | | 0xE0 0xB5 0xA2..0xA3 #Mn [2] MALAYALAM VOWEL SIGN VOCALIC L..MA... 174 | | 0xE0 0xB6 0x82..0x83 #Mc [2] SINHALA SIGN ANUSVARAYA..SINHALA S... 175 | | 0xE0 0xB7 0x8A #Mn SINHALA SIGN AL-LAKUNA 176 | | 0xE0 0xB7 0x8F..0x91 #Mc [3] SINHALA VOWEL SIGN AELA-PILLA..SIN... 177 | | 0xE0 0xB7 0x92..0x94 #Mn [3] SINHALA VOWEL SIGN KETTI IS-PILLA.... 178 | | 0xE0 0xB7 0x96 #Mn SINHALA VOWEL SIGN DIGA PAA-PILLA 179 | | 0xE0 0xB7 0x98..0x9F #Mc [8] SINHALA VOWEL SIGN GAETTA-PILLA..S... 180 | | 0xE0 0xB7 0xB2..0xB3 #Mc [2] SINHALA VOWEL SIGN DIGA GAETTA-PIL... 181 | | 0xE0 0xB8 0xB1 #Mn THAI CHARACTER MAI HAN-AKAT 182 | | 0xE0 0xB8 0xB4..0xBA #Mn [7] THAI CHARACTER SARA I..THAI CHARAC... 183 | | 0xE0 0xB9 0x87..0x8E #Mn [8] THAI CHARACTER MAITAIKHU..THAI CHA... 184 | | 0xE0 0xBA 0xB1 #Mn LAO VOWEL SIGN MAI KAN 185 | | 0xE0 0xBA 0xB4..0xB9 #Mn [6] LAO VOWEL SIGN I..LAO VOWEL SIGN UU 186 | | 0xE0 0xBA 0xBB..0xBC #Mn [2] LAO VOWEL SIGN MAI KON..LAO SEMIVO... 187 | | 0xE0 0xBB 0x88..0x8D #Mn [6] LAO TONE MAI EK..LAO NIGGAHITA 188 | | 0xE0 0xBC 0x98..0x99 #Mn [2] TIBETAN ASTROLOGICAL SIGN -KHYUD P... 189 | | 0xE0 0xBC 0xB5 #Mn TIBETAN MARK NGAS BZUNG NYI ZLA 190 | | 0xE0 0xBC 0xB7 #Mn TIBETAN MARK NGAS BZUNG SGOR RTAGS 191 | | 0xE0 0xBC 0xB9 #Mn TIBETAN MARK TSA -PHRU 192 | | 0xE0 0xBC 0xBE..0xBF #Mc [2] TIBETAN SIGN YAR TSHES..TIBETAN SI... 193 | | 0xE0 0xBD 0xB1..0xBE #Mn [14] TIBETAN VOWEL SIGN AA..TIBETAN SIG... 194 | | 0xE0 0xBD 0xBF #Mc TIBETAN SIGN RNAM BCAD 195 | | 0xE0 0xBE 0x80..0x84 #Mn [5] TIBETAN VOWEL SIGN REVERSED I..TIB... 196 | | 0xE0 0xBE 0x86..0x87 #Mn [2] TIBETAN SIGN LCI RTAGS..TIBETAN SI... 197 | | 0xE0 0xBE 0x8D..0x97 #Mn [11] TIBETAN SUBJOINED SIGN LCE TSA CAN... 198 | | 0xE0 0xBE 0x99..0xBC #Mn [36] TIBETAN SUBJOINED LETTER NYA..TIBE... 199 | | 0xE0 0xBF 0x86 #Mn TIBETAN SYMBOL PADMA GDAN 200 | | 0xE1 0x80 0xAB..0xAC #Mc [2] MYANMAR VOWEL SIGN TALL AA..MYANMA... 201 | | 0xE1 0x80 0xAD..0xB0 #Mn [4] MYANMAR VOWEL SIGN I..MYANMAR VOWE... 202 | | 0xE1 0x80 0xB1 #Mc MYANMAR VOWEL SIGN E 203 | | 0xE1 0x80 0xB2..0xB7 #Mn [6] MYANMAR VOWEL SIGN AI..MYANMAR SIG... 204 | | 0xE1 0x80 0xB8 #Mc MYANMAR SIGN VISARGA 205 | | 0xE1 0x80 0xB9..0xBA #Mn [2] MYANMAR SIGN VIRAMA..MYANMAR SIGN ... 206 | | 0xE1 0x80 0xBB..0xBC #Mc [2] MYANMAR CONSONANT SIGN MEDIAL YA..... 207 | | 0xE1 0x80 0xBD..0xBE #Mn [2] MYANMAR CONSONANT SIGN MEDIAL WA..... 208 | | 0xE1 0x81 0x96..0x97 #Mc [2] MYANMAR VOWEL SIGN VOCALIC R..MYAN... 209 | | 0xE1 0x81 0x98..0x99 #Mn [2] MYANMAR VOWEL SIGN VOCALIC L..MYAN... 210 | | 0xE1 0x81 0x9E..0xA0 #Mn [3] MYANMAR CONSONANT SIGN MON MEDIAL ... 211 | | 0xE1 0x81 0xA2..0xA4 #Mc [3] MYANMAR VOWEL SIGN SGAW KAREN EU..... 212 | | 0xE1 0x81 0xA7..0xAD #Mc [7] MYANMAR VOWEL SIGN WESTERN PWO KAR... 213 | | 0xE1 0x81 0xB1..0xB4 #Mn [4] MYANMAR VOWEL SIGN GEBA KAREN I..M... 214 | | 0xE1 0x82 0x82 #Mn MYANMAR CONSONANT SIGN SHAN MEDIAL WA 215 | | 0xE1 0x82 0x83..0x84 #Mc [2] MYANMAR VOWEL SIGN SHAN AA..MYANMA... 216 | | 0xE1 0x82 0x85..0x86 #Mn [2] MYANMAR VOWEL SIGN SHAN E ABOVE..M... 217 | | 0xE1 0x82 0x87..0x8C #Mc [6] MYANMAR SIGN SHAN TONE-2..MYANMAR ... 218 | | 0xE1 0x82 0x8D #Mn MYANMAR SIGN SHAN COUNCIL EMPHATIC... 219 | | 0xE1 0x82 0x8F #Mc MYANMAR SIGN RUMAI PALAUNG TONE-5 220 | | 0xE1 0x82 0x9A..0x9C #Mc [3] MYANMAR SIGN KHAMTI TONE-1..MYANMA... 221 | | 0xE1 0x82 0x9D #Mn MYANMAR VOWEL SIGN AITON AI 222 | | 0xE1 0x8D 0x9D..0x9F #Mn [3] ETHIOPIC COMBINING GEMINATION AND ... 223 | | 0xE1 0x9C 0x92..0x94 #Mn [3] TAGALOG VOWEL SIGN I..TAGALOG SIGN... 224 | | 0xE1 0x9C 0xB2..0xB4 #Mn [3] HANUNOO VOWEL SIGN I..HANUNOO SIGN... 225 | | 0xE1 0x9D 0x92..0x93 #Mn [2] BUHID VOWEL SIGN I..BUHID VOWEL SI... 226 | | 0xE1 0x9D 0xB2..0xB3 #Mn [2] TAGBANWA VOWEL SIGN I..TAGBANWA VO... 227 | | 0xE1 0x9E 0xB4..0xB5 #Mn [2] KHMER VOWEL INHERENT AQ..KHMER VOW... 228 | | 0xE1 0x9E 0xB6 #Mc KHMER VOWEL SIGN AA 229 | | 0xE1 0x9E 0xB7..0xBD #Mn [7] KHMER VOWEL SIGN I..KHMER VOWEL SI... 230 | | 0xE1 0x9E 0xBE..0xFF #Mc [8] KHMER VOWEL SIGN OE..KHMER VOWEL S... 231 | | 0xE1 0x9F 0x00..0x85 # 232 | | 0xE1 0x9F 0x86 #Mn KHMER SIGN NIKAHIT 233 | | 0xE1 0x9F 0x87..0x88 #Mc [2] KHMER SIGN REAHMUK..KHMER SIGN YUU... 234 | | 0xE1 0x9F 0x89..0x93 #Mn [11] KHMER SIGN MUUSIKATOAN..KHMER SIGN... 235 | | 0xE1 0x9F 0x9D #Mn KHMER SIGN ATTHACAN 236 | | 0xE1 0xA0 0x8B..0x8D #Mn [3] MONGOLIAN FREE VARIATION SELECTOR ... 237 | | 0xE1 0xA2 0xA9 #Mn MONGOLIAN LETTER ALI GALI DAGALGA 238 | | 0xE1 0xA4 0xA0..0xA2 #Mn [3] LIMBU VOWEL SIGN A..LIMBU VOWEL SI... 239 | | 0xE1 0xA4 0xA3..0xA6 #Mc [4] LIMBU VOWEL SIGN EE..LIMBU VOWEL S... 240 | | 0xE1 0xA4 0xA7..0xA8 #Mn [2] LIMBU VOWEL SIGN E..LIMBU VOWEL SI... 241 | | 0xE1 0xA4 0xA9..0xAB #Mc [3] LIMBU SUBJOINED LETTER YA..LIMBU S... 242 | | 0xE1 0xA4 0xB0..0xB1 #Mc [2] LIMBU SMALL LETTER KA..LIMBU SMALL... 243 | | 0xE1 0xA4 0xB2 #Mn LIMBU SMALL LETTER ANUSVARA 244 | | 0xE1 0xA4 0xB3..0xB8 #Mc [6] LIMBU SMALL LETTER TA..LIMBU SMALL... 245 | | 0xE1 0xA4 0xB9..0xBB #Mn [3] LIMBU SIGN MUKPHRENG..LIMBU SIGN SA-I 246 | | 0xE1 0xA8 0x97..0x98 #Mn [2] BUGINESE VOWEL SIGN I..BUGINESE VO... 247 | | 0xE1 0xA8 0x99..0x9A #Mc [2] BUGINESE VOWEL SIGN E..BUGINESE VO... 248 | | 0xE1 0xA8 0x9B #Mn BUGINESE VOWEL SIGN AE 249 | | 0xE1 0xA9 0x95 #Mc TAI THAM CONSONANT SIGN MEDIAL RA 250 | | 0xE1 0xA9 0x96 #Mn TAI THAM CONSONANT SIGN MEDIAL LA 251 | | 0xE1 0xA9 0x97 #Mc TAI THAM CONSONANT SIGN LA TANG LAI 252 | | 0xE1 0xA9 0x98..0x9E #Mn [7] TAI THAM SIGN MAI KANG LAI..TAI TH... 253 | | 0xE1 0xA9 0xA0 #Mn TAI THAM SIGN SAKOT 254 | | 0xE1 0xA9 0xA1 #Mc TAI THAM VOWEL SIGN A 255 | | 0xE1 0xA9 0xA2 #Mn TAI THAM VOWEL SIGN MAI SAT 256 | | 0xE1 0xA9 0xA3..0xA4 #Mc [2] TAI THAM VOWEL SIGN AA..TAI THAM V... 257 | | 0xE1 0xA9 0xA5..0xAC #Mn [8] TAI THAM VOWEL SIGN I..TAI THAM VO... 258 | | 0xE1 0xA9 0xAD..0xB2 #Mc [6] TAI THAM VOWEL SIGN OY..TAI THAM V... 259 | | 0xE1 0xA9 0xB3..0xBC #Mn [10] TAI THAM VOWEL SIGN OA ABOVE..TAI ... 260 | | 0xE1 0xA9 0xBF #Mn TAI THAM COMBINING CRYPTOGRAMMIC DOT 261 | | 0xE1 0xAA 0xB0..0xBD #Mn [14] COMBINING DOUBLED CIRCUMFLEX ACCEN... 262 | | 0xE1 0xAA 0xBE #Me COMBINING PARENTHESES OVERLAY 263 | | 0xE1 0xAC 0x80..0x83 #Mn [4] BALINESE SIGN ULU RICEM..BALINESE ... 264 | | 0xE1 0xAC 0x84 #Mc BALINESE SIGN BISAH 265 | | 0xE1 0xAC 0xB4 #Mn BALINESE SIGN REREKAN 266 | | 0xE1 0xAC 0xB5 #Mc BALINESE VOWEL SIGN TEDUNG 267 | | 0xE1 0xAC 0xB6..0xBA #Mn [5] BALINESE VOWEL SIGN ULU..BALINESE ... 268 | | 0xE1 0xAC 0xBB #Mc BALINESE VOWEL SIGN RA REPA TEDUNG 269 | | 0xE1 0xAC 0xBC #Mn BALINESE VOWEL SIGN LA LENGA 270 | | 0xE1 0xAC 0xBD..0xFF #Mc [5] BALINESE VOWEL SIGN LA LENGA TEDUN... 271 | | 0xE1 0xAD 0x00..0x81 # 272 | | 0xE1 0xAD 0x82 #Mn BALINESE VOWEL SIGN PEPET 273 | | 0xE1 0xAD 0x83..0x84 #Mc [2] BALINESE VOWEL SIGN PEPET TEDUNG..... 274 | | 0xE1 0xAD 0xAB..0xB3 #Mn [9] BALINESE MUSICAL SYMBOL COMBINING ... 275 | | 0xE1 0xAE 0x80..0x81 #Mn [2] SUNDANESE SIGN PANYECEK..SUNDANESE... 276 | | 0xE1 0xAE 0x82 #Mc SUNDANESE SIGN PANGWISAD 277 | | 0xE1 0xAE 0xA1 #Mc SUNDANESE CONSONANT SIGN PAMINGKAL 278 | | 0xE1 0xAE 0xA2..0xA5 #Mn [4] SUNDANESE CONSONANT SIGN PANYAKRA.... 279 | | 0xE1 0xAE 0xA6..0xA7 #Mc [2] SUNDANESE VOWEL SIGN PANAELAENG..S... 280 | | 0xE1 0xAE 0xA8..0xA9 #Mn [2] SUNDANESE VOWEL SIGN PAMEPET..SUND... 281 | | 0xE1 0xAE 0xAA #Mc SUNDANESE SIGN PAMAAEH 282 | | 0xE1 0xAE 0xAB..0xAD #Mn [3] SUNDANESE SIGN VIRAMA..SUNDANESE C... 283 | | 0xE1 0xAF 0xA6 #Mn BATAK SIGN TOMPI 284 | | 0xE1 0xAF 0xA7 #Mc BATAK VOWEL SIGN E 285 | | 0xE1 0xAF 0xA8..0xA9 #Mn [2] BATAK VOWEL SIGN PAKPAK E..BATAK V... 286 | | 0xE1 0xAF 0xAA..0xAC #Mc [3] BATAK VOWEL SIGN I..BATAK VOWEL SI... 287 | | 0xE1 0xAF 0xAD #Mn BATAK VOWEL SIGN KARO O 288 | | 0xE1 0xAF 0xAE #Mc BATAK VOWEL SIGN U 289 | | 0xE1 0xAF 0xAF..0xB1 #Mn [3] BATAK VOWEL SIGN U FOR SIMALUNGUN ... 290 | | 0xE1 0xAF 0xB2..0xB3 #Mc [2] BATAK PANGOLAT..BATAK PANONGONAN 291 | | 0xE1 0xB0 0xA4..0xAB #Mc [8] LEPCHA SUBJOINED LETTER YA..LEPCHA... 292 | | 0xE1 0xB0 0xAC..0xB3 #Mn [8] LEPCHA VOWEL SIGN E..LEPCHA CONSON... 293 | | 0xE1 0xB0 0xB4..0xB5 #Mc [2] LEPCHA CONSONANT SIGN NYIN-DO..LEP... 294 | | 0xE1 0xB0 0xB6..0xB7 #Mn [2] LEPCHA SIGN RAN..LEPCHA SIGN NUKTA 295 | | 0xE1 0xB3 0x90..0x92 #Mn [3] VEDIC TONE KARSHANA..VEDIC TONE PR... 296 | | 0xE1 0xB3 0x94..0xA0 #Mn [13] VEDIC SIGN YAJURVEDIC MIDLINE SVAR... 297 | | 0xE1 0xB3 0xA1 #Mc VEDIC TONE ATHARVAVEDIC INDEPENDEN... 298 | | 0xE1 0xB3 0xA2..0xA8 #Mn [7] VEDIC SIGN VISARGA SVARITA..VEDIC ... 299 | | 0xE1 0xB3 0xAD #Mn VEDIC SIGN TIRYAK 300 | | 0xE1 0xB3 0xB2..0xB3 #Mc [2] VEDIC SIGN ARDHAVISARGA..VEDIC SIG... 301 | | 0xE1 0xB3 0xB4 #Mn VEDIC TONE CANDRA ABOVE 302 | | 0xE1 0xB3 0xB8..0xB9 #Mn [2] VEDIC TONE RING ABOVE..VEDIC TONE ... 303 | | 0xE1 0xB7 0x80..0xB5 #Mn [54] COMBINING DOTTED GRAVE ACCENT..COM... 304 | | 0xE1 0xB7 0xBC..0xBF #Mn [4] COMBINING DOUBLE INVERTED BREVE BE... 305 | | 0xE2 0x80 0x8C..0x8D #Cf [2] ZERO WIDTH NON-JOINER..ZERO WIDTH ... 306 | | 0xE2 0x83 0x90..0x9C #Mn [13] COMBINING LEFT HARPOON ABOVE..COMB... 307 | | 0xE2 0x83 0x9D..0xA0 #Me [4] COMBINING ENCLOSING CIRCLE..COMBIN... 308 | | 0xE2 0x83 0xA1 #Mn COMBINING LEFT RIGHT ARROW ABOVE 309 | | 0xE2 0x83 0xA2..0xA4 #Me [3] COMBINING ENCLOSING SCREEN..COMBIN... 310 | | 0xE2 0x83 0xA5..0xB0 #Mn [12] COMBINING REVERSE SOLIDUS OVERLAY.... 311 | | 0xE2 0xB3 0xAF..0xB1 #Mn [3] COPTIC COMBINING NI ABOVE..COPTIC ... 312 | | 0xE2 0xB5 0xBF #Mn TIFINAGH CONSONANT JOINER 313 | | 0xE2 0xB7 0xA0..0xBF #Mn [32] COMBINING CYRILLIC LETTER BE..COMB... 314 | | 0xE3 0x80 0xAA..0xAD #Mn [4] IDEOGRAPHIC LEVEL TONE MARK..IDEOG... 315 | | 0xE3 0x80 0xAE..0xAF #Mc [2] HANGUL SINGLE DOT TONE MARK..HANGU... 316 | | 0xE3 0x82 0x99..0x9A #Mn [2] COMBINING KATAKANA-HIRAGANA VOICED... 317 | | 0xEA 0x99 0xAF #Mn COMBINING CYRILLIC VZMET 318 | | 0xEA 0x99 0xB0..0xB2 #Me [3] COMBINING CYRILLIC TEN MILLIONS SI... 319 | | 0xEA 0x99 0xB4..0xBD #Mn [10] COMBINING CYRILLIC LETTER UKRAINIA... 320 | | 0xEA 0x9A 0x9E..0x9F #Mn [2] COMBINING CYRILLIC LETTER EF..COMB... 321 | | 0xEA 0x9B 0xB0..0xB1 #Mn [2] BAMUM COMBINING MARK KOQNDON..BAMU... 322 | | 0xEA 0xA0 0x82 #Mn SYLOTI NAGRI SIGN DVISVARA 323 | | 0xEA 0xA0 0x86 #Mn SYLOTI NAGRI SIGN HASANTA 324 | | 0xEA 0xA0 0x8B #Mn SYLOTI NAGRI SIGN ANUSVARA 325 | | 0xEA 0xA0 0xA3..0xA4 #Mc [2] SYLOTI NAGRI VOWEL SIGN A..SYLOTI ... 326 | | 0xEA 0xA0 0xA5..0xA6 #Mn [2] SYLOTI NAGRI VOWEL SIGN U..SYLOTI ... 327 | | 0xEA 0xA0 0xA7 #Mc SYLOTI NAGRI VOWEL SIGN OO 328 | | 0xEA 0xA2 0x80..0x81 #Mc [2] SAURASHTRA SIGN ANUSVARA..SAURASHT... 329 | | 0xEA 0xA2 0xB4..0xFF #Mc [16] SAURASHTRA CONSONANT SIGN HAARU..S... 330 | | 0xEA 0xA3 0x00..0x83 # 331 | | 0xEA 0xA3 0x84 #Mn SAURASHTRA SIGN VIRAMA 332 | | 0xEA 0xA3 0xA0..0xB1 #Mn [18] COMBINING DEVANAGARI DIGIT ZERO..C... 333 | | 0xEA 0xA4 0xA6..0xAD #Mn [8] KAYAH LI VOWEL UE..KAYAH LI TONE C... 334 | | 0xEA 0xA5 0x87..0x91 #Mn [11] REJANG VOWEL SIGN I..REJANG CONSON... 335 | | 0xEA 0xA5 0x92..0x93 #Mc [2] REJANG CONSONANT SIGN H..REJANG VI... 336 | | 0xEA 0xA6 0x80..0x82 #Mn [3] JAVANESE SIGN PANYANGGA..JAVANESE ... 337 | | 0xEA 0xA6 0x83 #Mc JAVANESE SIGN WIGNYAN 338 | | 0xEA 0xA6 0xB3 #Mn JAVANESE SIGN CECAK TELU 339 | | 0xEA 0xA6 0xB4..0xB5 #Mc [2] JAVANESE VOWEL SIGN TARUNG..JAVANE... 340 | | 0xEA 0xA6 0xB6..0xB9 #Mn [4] JAVANESE VOWEL SIGN WULU..JAVANESE... 341 | | 0xEA 0xA6 0xBA..0xBB #Mc [2] JAVANESE VOWEL SIGN TALING..JAVANE... 342 | | 0xEA 0xA6 0xBC #Mn JAVANESE VOWEL SIGN PEPET 343 | | 0xEA 0xA6 0xBD..0xFF #Mc [4] JAVANESE CONSONANT SIGN KERET..JAV... 344 | | 0xEA 0xA7 0x00..0x80 # 345 | | 0xEA 0xA7 0xA5 #Mn MYANMAR SIGN SHAN SAW 346 | | 0xEA 0xA8 0xA9..0xAE #Mn [6] CHAM VOWEL SIGN AA..CHAM VOWEL SIG... 347 | | 0xEA 0xA8 0xAF..0xB0 #Mc [2] CHAM VOWEL SIGN O..CHAM VOWEL SIGN AI 348 | | 0xEA 0xA8 0xB1..0xB2 #Mn [2] CHAM VOWEL SIGN AU..CHAM VOWEL SIG... 349 | | 0xEA 0xA8 0xB3..0xB4 #Mc [2] CHAM CONSONANT SIGN YA..CHAM CONSO... 350 | | 0xEA 0xA8 0xB5..0xB6 #Mn [2] CHAM CONSONANT SIGN LA..CHAM CONSO... 351 | | 0xEA 0xA9 0x83 #Mn CHAM CONSONANT SIGN FINAL NG 352 | | 0xEA 0xA9 0x8C #Mn CHAM CONSONANT SIGN FINAL M 353 | | 0xEA 0xA9 0x8D #Mc CHAM CONSONANT SIGN FINAL H 354 | | 0xEA 0xA9 0xBB #Mc MYANMAR SIGN PAO KAREN TONE 355 | | 0xEA 0xA9 0xBC #Mn MYANMAR SIGN TAI LAING TONE-2 356 | | 0xEA 0xA9 0xBD #Mc MYANMAR SIGN TAI LAING TONE-5 357 | | 0xEA 0xAA 0xB0 #Mn TAI VIET MAI KANG 358 | | 0xEA 0xAA 0xB2..0xB4 #Mn [3] TAI VIET VOWEL I..TAI VIET VOWEL U 359 | | 0xEA 0xAA 0xB7..0xB8 #Mn [2] TAI VIET MAI KHIT..TAI VIET VOWEL IA 360 | | 0xEA 0xAA 0xBE..0xBF #Mn [2] TAI VIET VOWEL AM..TAI VIET TONE M... 361 | | 0xEA 0xAB 0x81 #Mn TAI VIET TONE MAI THO 362 | | 0xEA 0xAB 0xAB #Mc MEETEI MAYEK VOWEL SIGN II 363 | | 0xEA 0xAB 0xAC..0xAD #Mn [2] MEETEI MAYEK VOWEL SIGN UU..MEETEI... 364 | | 0xEA 0xAB 0xAE..0xAF #Mc [2] MEETEI MAYEK VOWEL SIGN AU..MEETEI... 365 | | 0xEA 0xAB 0xB5 #Mc MEETEI MAYEK VOWEL SIGN VISARGA 366 | | 0xEA 0xAB 0xB6 #Mn MEETEI MAYEK VIRAMA 367 | | 0xEA 0xAF 0xA3..0xA4 #Mc [2] MEETEI MAYEK VOWEL SIGN ONAP..MEET... 368 | | 0xEA 0xAF 0xA5 #Mn MEETEI MAYEK VOWEL SIGN ANAP 369 | | 0xEA 0xAF 0xA6..0xA7 #Mc [2] MEETEI MAYEK VOWEL SIGN YENAP..MEE... 370 | | 0xEA 0xAF 0xA8 #Mn MEETEI MAYEK VOWEL SIGN UNAP 371 | | 0xEA 0xAF 0xA9..0xAA #Mc [2] MEETEI MAYEK VOWEL SIGN CHEINAP..M... 372 | | 0xEA 0xAF 0xAC #Mc MEETEI MAYEK LUM IYEK 373 | | 0xEA 0xAF 0xAD #Mn MEETEI MAYEK APUN IYEK 374 | | 0xEF 0xAC 0x9E #Mn HEBREW POINT JUDEO-SPANISH VARIKA 375 | | 0xEF 0xB8 0x80..0x8F #Mn [16] VARIATION SELECTOR-1..VARIATION SE... 376 | | 0xEF 0xB8 0xA0..0xAF #Mn [16] COMBINING LIGATURE LEFT HALF..COMB... 377 | | 0xEF 0xBE 0x9E..0x9F #Lm [2] HALFWIDTH KATAKANA VOICED SOUND MA... 378 | | 0xF0 0x90 0x87 0xBD #Mn PHAISTOS DISC SIGN COMBINING OBLIQ... 379 | | 0xF0 0x90 0x8B 0xA0 #Mn COPTIC EPACT THOUSANDS MARK 380 | | 0xF0 0x90 0x8D 0xB6..0xBA #Mn [5] COMBINING OLD PERMIC LETTER AN.... 381 | | 0xF0 0x90 0xA8 0x81..0x83 #Mn [3] KHAROSHTHI VOWEL SIGN I..KHAROS... 382 | | 0xF0 0x90 0xA8 0x85..0x86 #Mn [2] KHAROSHTHI VOWEL SIGN E..KHAROS... 383 | | 0xF0 0x90 0xA8 0x8C..0x8F #Mn [4] KHAROSHTHI VOWEL LENGTH MARK..K... 384 | | 0xF0 0x90 0xA8 0xB8..0xBA #Mn [3] KHAROSHTHI SIGN BAR ABOVE..KHAR... 385 | | 0xF0 0x90 0xA8 0xBF #Mn KHAROSHTHI VIRAMA 386 | | 0xF0 0x90 0xAB 0xA5..0xA6 #Mn [2] MANICHAEAN ABBREVIATION MARK AB... 387 | | 0xF0 0x91 0x80 0x80 #Mc BRAHMI SIGN CANDRABINDU 388 | | 0xF0 0x91 0x80 0x81 #Mn BRAHMI SIGN ANUSVARA 389 | | 0xF0 0x91 0x80 0x82 #Mc BRAHMI SIGN VISARGA 390 | | 0xF0 0x91 0x80 0xB8..0xFF #Mn [15] BRAHMI VOWEL SIGN AA..BRAHMI VI... 391 | | 0xF0 0x91 0x81 0x00..0x86 # 392 | | 0xF0 0x91 0x81 0xBF..0xFF #Mn [3] BRAHMI NUMBER JOINER..KAITHI SI... 393 | | 0xF0 0x91 0x82 0x00..0x81 # 394 | | 0xF0 0x91 0x82 0x82 #Mc KAITHI SIGN VISARGA 395 | | 0xF0 0x91 0x82 0xB0..0xB2 #Mc [3] KAITHI VOWEL SIGN AA..KAITHI VO... 396 | | 0xF0 0x91 0x82 0xB3..0xB6 #Mn [4] KAITHI VOWEL SIGN U..KAITHI VOW... 397 | | 0xF0 0x91 0x82 0xB7..0xB8 #Mc [2] KAITHI VOWEL SIGN O..KAITHI VOW... 398 | | 0xF0 0x91 0x82 0xB9..0xBA #Mn [2] KAITHI SIGN VIRAMA..KAITHI SIGN... 399 | | 0xF0 0x91 0x84 0x80..0x82 #Mn [3] CHAKMA SIGN CANDRABINDU..CHAKMA... 400 | | 0xF0 0x91 0x84 0xA7..0xAB #Mn [5] CHAKMA VOWEL SIGN A..CHAKMA VOW... 401 | | 0xF0 0x91 0x84 0xAC #Mc CHAKMA VOWEL SIGN E 402 | | 0xF0 0x91 0x84 0xAD..0xB4 #Mn [8] CHAKMA VOWEL SIGN AI..CHAKMA MA... 403 | | 0xF0 0x91 0x85 0xB3 #Mn MAHAJANI SIGN NUKTA 404 | | 0xF0 0x91 0x86 0x80..0x81 #Mn [2] SHARADA SIGN CANDRABINDU..SHARA... 405 | | 0xF0 0x91 0x86 0x82 #Mc SHARADA SIGN VISARGA 406 | | 0xF0 0x91 0x86 0xB3..0xB5 #Mc [3] SHARADA VOWEL SIGN AA..SHARADA ... 407 | | 0xF0 0x91 0x86 0xB6..0xBE #Mn [9] SHARADA VOWEL SIGN U..SHARADA V... 408 | | 0xF0 0x91 0x86 0xBF..0xFF #Mc [2] SHARADA VOWEL SIGN AU..SHARADA ... 409 | | 0xF0 0x91 0x87 0x00..0x80 # 410 | | 0xF0 0x91 0x87 0x8A..0x8C #Mn [3] SHARADA SIGN NUKTA..SHARADA EXT... 411 | | 0xF0 0x91 0x88 0xAC..0xAE #Mc [3] KHOJKI VOWEL SIGN AA..KHOJKI VO... 412 | | 0xF0 0x91 0x88 0xAF..0xB1 #Mn [3] KHOJKI VOWEL SIGN U..KHOJKI VOW... 413 | | 0xF0 0x91 0x88 0xB2..0xB3 #Mc [2] KHOJKI VOWEL SIGN O..KHOJKI VOW... 414 | | 0xF0 0x91 0x88 0xB4 #Mn KHOJKI SIGN ANUSVARA 415 | | 0xF0 0x91 0x88 0xB5 #Mc KHOJKI SIGN VIRAMA 416 | | 0xF0 0x91 0x88 0xB6..0xB7 #Mn [2] KHOJKI SIGN NUKTA..KHOJKI SIGN ... 417 | | 0xF0 0x91 0x8B 0x9F #Mn KHUDAWADI SIGN ANUSVARA 418 | | 0xF0 0x91 0x8B 0xA0..0xA2 #Mc [3] KHUDAWADI VOWEL SIGN AA..KHUDAW... 419 | | 0xF0 0x91 0x8B 0xA3..0xAA #Mn [8] KHUDAWADI VOWEL SIGN U..KHUDAWA... 420 | | 0xF0 0x91 0x8C 0x80..0x81 #Mn [2] GRANTHA SIGN COMBINING ANUSVARA... 421 | | 0xF0 0x91 0x8C 0x82..0x83 #Mc [2] GRANTHA SIGN ANUSVARA..GRANTHA ... 422 | | 0xF0 0x91 0x8C 0xBC #Mn GRANTHA SIGN NUKTA 423 | | 0xF0 0x91 0x8C 0xBE..0xBF #Mc [2] GRANTHA VOWEL SIGN AA..GRANTHA ... 424 | | 0xF0 0x91 0x8D 0x80 #Mn GRANTHA VOWEL SIGN II 425 | | 0xF0 0x91 0x8D 0x81..0x84 #Mc [4] GRANTHA VOWEL SIGN U..GRANTHA V... 426 | | 0xF0 0x91 0x8D 0x87..0x88 #Mc [2] GRANTHA VOWEL SIGN EE..GRANTHA ... 427 | | 0xF0 0x91 0x8D 0x8B..0x8D #Mc [3] GRANTHA VOWEL SIGN OO..GRANTHA ... 428 | | 0xF0 0x91 0x8D 0x97 #Mc GRANTHA AU LENGTH MARK 429 | | 0xF0 0x91 0x8D 0xA2..0xA3 #Mc [2] GRANTHA VOWEL SIGN VOCALIC L..G... 430 | | 0xF0 0x91 0x8D 0xA6..0xAC #Mn [7] COMBINING GRANTHA DIGIT ZERO..C... 431 | | 0xF0 0x91 0x8D 0xB0..0xB4 #Mn [5] COMBINING GRANTHA LETTER A..COM... 432 | | 0xF0 0x91 0x92 0xB0..0xB2 #Mc [3] TIRHUTA VOWEL SIGN AA..TIRHUTA ... 433 | | 0xF0 0x91 0x92 0xB3..0xB8 #Mn [6] TIRHUTA VOWEL SIGN U..TIRHUTA V... 434 | | 0xF0 0x91 0x92 0xB9 #Mc TIRHUTA VOWEL SIGN E 435 | | 0xF0 0x91 0x92 0xBA #Mn TIRHUTA VOWEL SIGN SHORT E 436 | | 0xF0 0x91 0x92 0xBB..0xBE #Mc [4] TIRHUTA VOWEL SIGN AI..TIRHUTA ... 437 | | 0xF0 0x91 0x92 0xBF..0xFF #Mn [2] TIRHUTA SIGN CANDRABINDU..TIRHU... 438 | | 0xF0 0x91 0x93 0x00..0x80 # 439 | | 0xF0 0x91 0x93 0x81 #Mc TIRHUTA SIGN VISARGA 440 | | 0xF0 0x91 0x93 0x82..0x83 #Mn [2] TIRHUTA SIGN VIRAMA..TIRHUTA SI... 441 | | 0xF0 0x91 0x96 0xAF..0xB1 #Mc [3] SIDDHAM VOWEL SIGN AA..SIDDHAM ... 442 | | 0xF0 0x91 0x96 0xB2..0xB5 #Mn [4] SIDDHAM VOWEL SIGN U..SIDDHAM V... 443 | | 0xF0 0x91 0x96 0xB8..0xBB #Mc [4] SIDDHAM VOWEL SIGN E..SIDDHAM V... 444 | | 0xF0 0x91 0x96 0xBC..0xBD #Mn [2] SIDDHAM SIGN CANDRABINDU..SIDDH... 445 | | 0xF0 0x91 0x96 0xBE #Mc SIDDHAM SIGN VISARGA 446 | | 0xF0 0x91 0x96 0xBF..0xFF #Mn [2] SIDDHAM SIGN VIRAMA..SIDDHAM SI... 447 | | 0xF0 0x91 0x97 0x00..0x80 # 448 | | 0xF0 0x91 0x97 0x9C..0x9D #Mn [2] SIDDHAM VOWEL SIGN ALTERNATE U.... 449 | | 0xF0 0x91 0x98 0xB0..0xB2 #Mc [3] MODI VOWEL SIGN AA..MODI VOWEL ... 450 | | 0xF0 0x91 0x98 0xB3..0xBA #Mn [8] MODI VOWEL SIGN U..MODI VOWEL S... 451 | | 0xF0 0x91 0x98 0xBB..0xBC #Mc [2] MODI VOWEL SIGN O..MODI VOWEL S... 452 | | 0xF0 0x91 0x98 0xBD #Mn MODI SIGN ANUSVARA 453 | | 0xF0 0x91 0x98 0xBE #Mc MODI SIGN VISARGA 454 | | 0xF0 0x91 0x98 0xBF..0xFF #Mn [2] MODI SIGN VIRAMA..MODI SIGN ARD... 455 | | 0xF0 0x91 0x99 0x00..0x80 # 456 | | 0xF0 0x91 0x9A 0xAB #Mn TAKRI SIGN ANUSVARA 457 | | 0xF0 0x91 0x9A 0xAC #Mc TAKRI SIGN VISARGA 458 | | 0xF0 0x91 0x9A 0xAD #Mn TAKRI VOWEL SIGN AA 459 | | 0xF0 0x91 0x9A 0xAE..0xAF #Mc [2] TAKRI VOWEL SIGN I..TAKRI VOWEL... 460 | | 0xF0 0x91 0x9A 0xB0..0xB5 #Mn [6] TAKRI VOWEL SIGN U..TAKRI VOWEL... 461 | | 0xF0 0x91 0x9A 0xB6 #Mc TAKRI SIGN VIRAMA 462 | | 0xF0 0x91 0x9A 0xB7 #Mn TAKRI SIGN NUKTA 463 | | 0xF0 0x91 0x9C 0x9D..0x9F #Mn [3] AHOM CONSONANT SIGN MEDIAL LA..... 464 | | 0xF0 0x91 0x9C 0xA0..0xA1 #Mc [2] AHOM VOWEL SIGN A..AHOM VOWEL S... 465 | | 0xF0 0x91 0x9C 0xA2..0xA5 #Mn [4] AHOM VOWEL SIGN I..AHOM VOWEL S... 466 | | 0xF0 0x91 0x9C 0xA6 #Mc AHOM VOWEL SIGN E 467 | | 0xF0 0x91 0x9C 0xA7..0xAB #Mn [5] AHOM VOWEL SIGN AW..AHOM SIGN K... 468 | | 0xF0 0x96 0xAB 0xB0..0xB4 #Mn [5] BASSA VAH COMBINING HIGH TONE..... 469 | | 0xF0 0x96 0xAC 0xB0..0xB6 #Mn [7] PAHAWH HMONG MARK CIM TUB..PAHA... 470 | | 0xF0 0x96 0xBD 0x91..0xBE #Mc [46] MIAO SIGN ASPIRATION..MIAO VOWE... 471 | | 0xF0 0x96 0xBE 0x8F..0x92 #Mn [4] MIAO TONE RIGHT..MIAO TONE BELOW 472 | | 0xF0 0x9B 0xB2 0x9D..0x9E #Mn [2] DUPLOYAN THICK LETTER SELECTOR.... 473 | | 0xF0 0x9D 0x85 0xA5..0xA6 #Mc [2] MUSICAL SYMBOL COMBINING STEM..... 474 | | 0xF0 0x9D 0x85 0xA7..0xA9 #Mn [3] MUSICAL SYMBOL COMBINING TREMOL... 475 | | 0xF0 0x9D 0x85 0xAD..0xB2 #Mc [6] MUSICAL SYMBOL COMBINING AUGMEN... 476 | | 0xF0 0x9D 0x85 0xBB..0xFF #Mn [8] MUSICAL SYMBOL COMBINING ACCENT... 477 | | 0xF0 0x9D 0x86 0x00..0x82 # 478 | | 0xF0 0x9D 0x86 0x85..0x8B #Mn [7] MUSICAL SYMBOL COMBINING DOIT..... 479 | | 0xF0 0x9D 0x86 0xAA..0xAD #Mn [4] MUSICAL SYMBOL COMBINING DOWN B... 480 | | 0xF0 0x9D 0x89 0x82..0x84 #Mn [3] COMBINING GREEK MUSICAL TRISEME... 481 | | 0xF0 0x9D 0xA8 0x80..0xB6 #Mn [55] SIGNWRITING HEAD RIM..SIGNWRITI... 482 | | 0xF0 0x9D 0xA8 0xBB..0xFF #Mn [50] SIGNWRITING MOUTH CLOSED NEUTRA... 483 | | 0xF0 0x9D 0xA9 0x00..0xAC # 484 | | 0xF0 0x9D 0xA9 0xB5 #Mn SIGNWRITING UPPER BODY TILTING FRO... 485 | | 0xF0 0x9D 0xAA 0x84 #Mn SIGNWRITING LOCATION HEAD NECK 486 | | 0xF0 0x9D 0xAA 0x9B..0x9F #Mn [5] SIGNWRITING FILL MODIFIER-2..SI... 487 | | 0xF0 0x9D 0xAA 0xA1..0xAF #Mn [15] SIGNWRITING ROTATION MODIFIER-2... 488 | | 0xF0 0x9E 0xA3 0x90..0x96 #Mn [7] MENDE KIKAKUI COMBINING NUMBER ... 489 | | 0xF3 0xA0 0x84 0x80..0xFF #Mn [240] VARIATION SELECTOR-17..VA... 490 | | 0xF3 0xA0 0x85..0x86 0x00..0xFF # 491 | | 0xF3 0xA0 0x87 0x00..0xAF # 492 | ; 493 | 494 | Format = 495 | 0xC2 0xAD #Cf SOFT HYPHEN 496 | | 0xD8 0x80..0x85 #Cf [6] ARABIC NUMBER SIGN..ARABIC NUMBER ... 497 | | 0xD8 0x9C #Cf ARABIC LETTER MARK 498 | | 0xDB 0x9D #Cf ARABIC END OF AYAH 499 | | 0xDC 0x8F #Cf SYRIAC ABBREVIATION MARK 500 | | 0xE1 0xA0 0x8E #Cf MONGOLIAN VOWEL SEPARATOR 501 | | 0xE2 0x80 0x8E..0x8F #Cf [2] LEFT-TO-RIGHT MARK..RIGHT-TO-LEFT ... 502 | | 0xE2 0x80 0xAA..0xAE #Cf [5] LEFT-TO-RIGHT EMBEDDING..RIGHT-TO-... 503 | | 0xE2 0x81 0xA0..0xA4 #Cf [5] WORD JOINER..INVISIBLE PLUS 504 | | 0xE2 0x81 0xA6..0xAF #Cf [10] LEFT-TO-RIGHT ISOLATE..NOMINAL DIG... 505 | | 0xEF 0xBB 0xBF #Cf ZERO WIDTH NO-BREAK SPACE 506 | | 0xEF 0xBF 0xB9..0xBB #Cf [3] INTERLINEAR ANNOTATION ANCHOR..INT... 507 | | 0xF0 0x91 0x82 0xBD #Cf KAITHI NUMBER SIGN 508 | | 0xF0 0x9B 0xB2 0xA0..0xA3 #Cf [4] SHORTHAND FORMAT LETTER OVERLAP... 509 | | 0xF0 0x9D 0x85 0xB3..0xBA #Cf [8] MUSICAL SYMBOL BEGIN BEAM..MUSI... 510 | | 0xF3 0xA0 0x80 0x81 #Cf LANGUAGE TAG 511 | | 0xF3 0xA0 0x80 0xA0..0xFF #Cf [96] TAG SPACE..CANCEL TAG 512 | | 0xF3 0xA0 0x81 0x00..0xBF # 513 | ; 514 | 515 | Katakana = 516 | 0xE3 0x80 0xB1..0xB5 #Lm [5] VERTICAL KANA REPEAT MARK..VERTICA... 517 | | 0xE3 0x82 0x9B..0x9C #Sk [2] KATAKANA-HIRAGANA VOICED SOUND MAR... 518 | | 0xE3 0x82 0xA0 #Pd KATAKANA-HIRAGANA DOUBLE HYPHEN 519 | | 0xE3 0x82 0xA1..0xFF #Lo [90] KATAKANA LETTER SMALL A..KATAKANA ... 520 | | 0xE3 0x83 0x00..0xBA # 521 | | 0xE3 0x83 0xBC..0xBE #Lm [3] KATAKANA-HIRAGANA PROLONGED SOUND ... 522 | | 0xE3 0x83 0xBF #Lo KATAKANA DIGRAPH KOTO 523 | | 0xE3 0x87 0xB0..0xBF #Lo [16] KATAKANA LETTER SMALL KU..KATAKANA... 524 | | 0xE3 0x8B 0x90..0xBE #So [47] CIRCLED KATAKANA A..CIRCLED KATAKA... 525 | | 0xE3 0x8C 0x80..0xFF #So [88] SQUARE APAATO..SQUARE WATTO 526 | | 0xE3 0x8D 0x00..0x97 # 527 | | 0xEF 0xBD 0xA6..0xAF #Lo [10] HALFWIDTH KATAKANA LETTER WO..HALF... 528 | | 0xEF 0xBD 0xB0 #Lm HALFWIDTH KATAKANA-HIRAGANA PROLON... 529 | | 0xEF 0xBD 0xB1..0xFF #Lo [45] HALFWIDTH KATAKANA LETTER A..HALFW... 530 | | 0xEF 0xBE 0x00..0x9D # 531 | | 0xF0 0x9B 0x80 0x80 #Lo KATAKANA LETTER ARCHAIC E 532 | ; 533 | 534 | ALetter = 535 | 0x41..0x5A #L& [26] LATIN CAPITAL LETTER A..LATIN CAPI... 536 | | 0x61..0x7A #L& [26] LATIN SMALL LETTER A..LATIN SMALL ... 537 | | 0xC2 0xAA #Lo FEMININE ORDINAL INDICATOR 538 | | 0xC2 0xB5 #L& MICRO SIGN 539 | | 0xC2 0xBA #Lo MASCULINE ORDINAL INDICATOR 540 | | 0xC3 0x80..0x96 #L& [23] LATIN CAPITAL LETTER A WITH GRAVE.... 541 | | 0xC3 0x98..0xB6 #L& [31] LATIN CAPITAL LETTER O WITH STROKE... 542 | | 0xC3 0xB8..0xFF #L& [195] LATIN SMALL LETTER O WITH STROKE..... 543 | | 0xC4..0xC5 0x00..0xFF # 544 | | 0xC6 0x00..0xBA # 545 | | 0xC6 0xBB #Lo LATIN LETTER TWO WITH STROKE 546 | | 0xC6 0xBC..0xBF #L& [4] LATIN CAPITAL LETTER TONE FIVE..LA... 547 | | 0xC7 0x80..0x83 #Lo [4] LATIN LETTER DENTAL CLICK..LATIN L... 548 | | 0xC7 0x84..0xFF #L& [208] LATIN CAPITAL LETTER DZ WITH CARON... 549 | | 0xC8..0xC9 0x00..0xFF # 550 | | 0xCA 0x00..0x93 # 551 | | 0xCA 0x94 #Lo LATIN LETTER GLOTTAL STOP 552 | | 0xCA 0x95..0xAF #L& [27] LATIN LETTER PHARYNGEAL VOICED FRI... 553 | | 0xCA 0xB0..0xFF #Lm [18] MODIFIER LETTER SMALL H..MODIFIER ... 554 | | 0xCB 0x00..0x81 # 555 | | 0xCB 0x86..0x91 #Lm [12] MODIFIER LETTER CIRCUMFLEX ACCENT.... 556 | | 0xCB 0xA0..0xA4 #Lm [5] MODIFIER LETTER SMALL GAMMA..MODIF... 557 | | 0xCB 0xAC #Lm MODIFIER LETTER VOICING 558 | | 0xCB 0xAE #Lm MODIFIER LETTER DOUBLE APOSTROPHE 559 | | 0xCD 0xB0..0xB3 #L& [4] GREEK CAPITAL LETTER HETA..GREEK S... 560 | | 0xCD 0xB4 #Lm GREEK NUMERAL SIGN 561 | | 0xCD 0xB6..0xB7 #L& [2] GREEK CAPITAL LETTER PAMPHYLIAN DI... 562 | | 0xCD 0xBA #Lm GREEK YPOGEGRAMMENI 563 | | 0xCD 0xBB..0xBD #L& [3] GREEK SMALL REVERSED LUNATE SIGMA ... 564 | | 0xCD 0xBF #L& GREEK CAPITAL LETTER YOT 565 | | 0xCE 0x86 #L& GREEK CAPITAL LETTER ALPHA WITH TONOS 566 | | 0xCE 0x88..0x8A #L& [3] GREEK CAPITAL LETTER EPSILON WITH ... 567 | | 0xCE 0x8C #L& GREEK CAPITAL LETTER OMICRON WITH ... 568 | | 0xCE 0x8E..0xA1 #L& [20] GREEK CAPITAL LETTER UPSILON WITH ... 569 | | 0xCE 0xA3..0xFF #L& [83] GREEK CAPITAL LETTER SIGMA..GREEK ... 570 | | 0xCF 0x00..0xB5 # 571 | | 0xCF 0xB7..0xFF #L& [139] GREEK CAPITAL LETTER SHO..CYRILLIC... 572 | | 0xD0..0xD1 0x00..0xFF # 573 | | 0xD2 0x00..0x81 # 574 | | 0xD2 0x8A..0xFF #L& [166] CYRILLIC CAPITAL LETTER SHORT I WI... 575 | | 0xD3..0xD3 0x00..0xFF # 576 | | 0xD4 0x00..0xAF # 577 | | 0xD4 0xB1..0xFF #L& [38] ARMENIAN CAPITAL LETTER AYB..ARMEN... 578 | | 0xD5 0x00..0x96 # 579 | | 0xD5 0x99 #Lm ARMENIAN MODIFIER LETTER LEFT HALF... 580 | | 0xD5 0xA1..0xFF #L& [39] ARMENIAN SMALL LETTER AYB..ARMENIA... 581 | | 0xD6 0x00..0x87 # 582 | | 0xD7 0xB3 #Po HEBREW PUNCTUATION GERESH 583 | | 0xD8 0xA0..0xBF #Lo [32] ARABIC LETTER KASHMIRI YEH..ARABIC... 584 | | 0xD9 0x80 #Lm ARABIC TATWEEL 585 | | 0xD9 0x81..0x8A #Lo [10] ARABIC LETTER FEH..ARABIC LETTER YEH 586 | | 0xD9 0xAE..0xAF #Lo [2] ARABIC LETTER DOTLESS BEH..ARABIC ... 587 | | 0xD9 0xB1..0xFF #Lo [99] ARABIC LETTER ALEF WASLA..ARABIC L... 588 | | 0xDA..0xDA 0x00..0xFF # 589 | | 0xDB 0x00..0x93 # 590 | | 0xDB 0x95 #Lo ARABIC LETTER AE 591 | | 0xDB 0xA5..0xA6 #Lm [2] ARABIC SMALL WAW..ARABIC SMALL YEH 592 | | 0xDB 0xAE..0xAF #Lo [2] ARABIC LETTER DAL WITH INVERTED V.... 593 | | 0xDB 0xBA..0xBC #Lo [3] ARABIC LETTER SHEEN WITH DOT BELOW... 594 | | 0xDB 0xBF #Lo ARABIC LETTER HEH WITH INVERTED V 595 | | 0xDC 0x90 #Lo SYRIAC LETTER ALAPH 596 | | 0xDC 0x92..0xAF #Lo [30] SYRIAC LETTER BETH..SYRIAC LETTER ... 597 | | 0xDD 0x8D..0xFF #Lo [89] SYRIAC LETTER SOGDIAN ZHAIN..THAAN... 598 | | 0xDE 0x00..0xA5 # 599 | | 0xDE 0xB1 #Lo THAANA LETTER NAA 600 | | 0xDF 0x8A..0xAA #Lo [33] NKO LETTER A..NKO LETTER JONA RA 601 | | 0xDF 0xB4..0xB5 #Lm [2] NKO HIGH TONE APOSTROPHE..NKO LOW ... 602 | | 0xDF 0xBA #Lm NKO LAJANYALAN 603 | | 0xE0 0xA0 0x80..0x95 #Lo [22] SAMARITAN LETTER ALAF..SAMARITAN L... 604 | | 0xE0 0xA0 0x9A #Lm SAMARITAN MODIFIER LETTER EPENTHET... 605 | | 0xE0 0xA0 0xA4 #Lm SAMARITAN MODIFIER LETTER SHORT A 606 | | 0xE0 0xA0 0xA8 #Lm SAMARITAN MODIFIER LETTER I 607 | | 0xE0 0xA1 0x80..0x98 #Lo [25] MANDAIC LETTER HALQA..MANDAIC LETT... 608 | | 0xE0 0xA2 0xA0..0xB4 #Lo [21] ARABIC LETTER BEH WITH SMALL V BEL... 609 | | 0xE0 0xA4 0x84..0xB9 #Lo [54] DEVANAGARI LETTER SHORT A..DEVANAG... 610 | | 0xE0 0xA4 0xBD #Lo DEVANAGARI SIGN AVAGRAHA 611 | | 0xE0 0xA5 0x90 #Lo DEVANAGARI OM 612 | | 0xE0 0xA5 0x98..0xA1 #Lo [10] DEVANAGARI LETTER QA..DEVANAGARI L... 613 | | 0xE0 0xA5 0xB1 #Lm DEVANAGARI SIGN HIGH SPACING DOT 614 | | 0xE0 0xA5 0xB2..0xFF #Lo [15] DEVANAGARI LETTER CANDRA A..BENGAL... 615 | | 0xE0 0xA6 0x00..0x80 # 616 | | 0xE0 0xA6 0x85..0x8C #Lo [8] BENGALI LETTER A..BENGALI LETTER V... 617 | | 0xE0 0xA6 0x8F..0x90 #Lo [2] BENGALI LETTER E..BENGALI LETTER AI 618 | | 0xE0 0xA6 0x93..0xA8 #Lo [22] BENGALI LETTER O..BENGALI LETTER NA 619 | | 0xE0 0xA6 0xAA..0xB0 #Lo [7] BENGALI LETTER PA..BENGALI LETTER RA 620 | | 0xE0 0xA6 0xB2 #Lo BENGALI LETTER LA 621 | | 0xE0 0xA6 0xB6..0xB9 #Lo [4] BENGALI LETTER SHA..BENGALI LETTER HA 622 | | 0xE0 0xA6 0xBD #Lo BENGALI SIGN AVAGRAHA 623 | | 0xE0 0xA7 0x8E #Lo BENGALI LETTER KHANDA TA 624 | | 0xE0 0xA7 0x9C..0x9D #Lo [2] BENGALI LETTER RRA..BENGALI LETTER... 625 | | 0xE0 0xA7 0x9F..0xA1 #Lo [3] BENGALI LETTER YYA..BENGALI LETTER... 626 | | 0xE0 0xA7 0xB0..0xB1 #Lo [2] BENGALI LETTER RA WITH MIDDLE DIAG... 627 | | 0xE0 0xA8 0x85..0x8A #Lo [6] GURMUKHI LETTER A..GURMUKHI LETTER UU 628 | | 0xE0 0xA8 0x8F..0x90 #Lo [2] GURMUKHI LETTER EE..GURMUKHI LETTE... 629 | | 0xE0 0xA8 0x93..0xA8 #Lo [22] GURMUKHI LETTER OO..GURMUKHI LETTE... 630 | | 0xE0 0xA8 0xAA..0xB0 #Lo [7] GURMUKHI LETTER PA..GURMUKHI LETTE... 631 | | 0xE0 0xA8 0xB2..0xB3 #Lo [2] GURMUKHI LETTER LA..GURMUKHI LETTE... 632 | | 0xE0 0xA8 0xB5..0xB6 #Lo [2] GURMUKHI LETTER VA..GURMUKHI LETTE... 633 | | 0xE0 0xA8 0xB8..0xB9 #Lo [2] GURMUKHI LETTER SA..GURMUKHI LETTE... 634 | | 0xE0 0xA9 0x99..0x9C #Lo [4] GURMUKHI LETTER KHHA..GURMUKHI LET... 635 | | 0xE0 0xA9 0x9E #Lo GURMUKHI LETTER FA 636 | | 0xE0 0xA9 0xB2..0xB4 #Lo [3] GURMUKHI IRI..GURMUKHI EK ONKAR 637 | | 0xE0 0xAA 0x85..0x8D #Lo [9] GUJARATI LETTER A..GUJARATI VOWEL ... 638 | | 0xE0 0xAA 0x8F..0x91 #Lo [3] GUJARATI LETTER E..GUJARATI VOWEL ... 639 | | 0xE0 0xAA 0x93..0xA8 #Lo [22] GUJARATI LETTER O..GUJARATI LETTER NA 640 | | 0xE0 0xAA 0xAA..0xB0 #Lo [7] GUJARATI LETTER PA..GUJARATI LETTE... 641 | | 0xE0 0xAA 0xB2..0xB3 #Lo [2] GUJARATI LETTER LA..GUJARATI LETTE... 642 | | 0xE0 0xAA 0xB5..0xB9 #Lo [5] GUJARATI LETTER VA..GUJARATI LETTE... 643 | | 0xE0 0xAA 0xBD #Lo GUJARATI SIGN AVAGRAHA 644 | | 0xE0 0xAB 0x90 #Lo GUJARATI OM 645 | | 0xE0 0xAB 0xA0..0xA1 #Lo [2] GUJARATI LETTER VOCALIC RR..GUJARA... 646 | | 0xE0 0xAB 0xB9 #Lo GUJARATI LETTER ZHA 647 | | 0xE0 0xAC 0x85..0x8C #Lo [8] ORIYA LETTER A..ORIYA LETTER VOCAL... 648 | | 0xE0 0xAC 0x8F..0x90 #Lo [2] ORIYA LETTER E..ORIYA LETTER AI 649 | | 0xE0 0xAC 0x93..0xA8 #Lo [22] ORIYA LETTER O..ORIYA LETTER NA 650 | | 0xE0 0xAC 0xAA..0xB0 #Lo [7] ORIYA LETTER PA..ORIYA LETTER RA 651 | | 0xE0 0xAC 0xB2..0xB3 #Lo [2] ORIYA LETTER LA..ORIYA LETTER LLA 652 | | 0xE0 0xAC 0xB5..0xB9 #Lo [5] ORIYA LETTER VA..ORIYA LETTER HA 653 | | 0xE0 0xAC 0xBD #Lo ORIYA SIGN AVAGRAHA 654 | | 0xE0 0xAD 0x9C..0x9D #Lo [2] ORIYA LETTER RRA..ORIYA LETTER RHA 655 | | 0xE0 0xAD 0x9F..0xA1 #Lo [3] ORIYA LETTER YYA..ORIYA LETTER VOC... 656 | | 0xE0 0xAD 0xB1 #Lo ORIYA LETTER WA 657 | | 0xE0 0xAE 0x83 #Lo TAMIL SIGN VISARGA 658 | | 0xE0 0xAE 0x85..0x8A #Lo [6] TAMIL LETTER A..TAMIL LETTER UU 659 | | 0xE0 0xAE 0x8E..0x90 #Lo [3] TAMIL LETTER E..TAMIL LETTER AI 660 | | 0xE0 0xAE 0x92..0x95 #Lo [4] TAMIL LETTER O..TAMIL LETTER KA 661 | | 0xE0 0xAE 0x99..0x9A #Lo [2] TAMIL LETTER NGA..TAMIL LETTER CA 662 | | 0xE0 0xAE 0x9C #Lo TAMIL LETTER JA 663 | | 0xE0 0xAE 0x9E..0x9F #Lo [2] TAMIL LETTER NYA..TAMIL LETTER TTA 664 | | 0xE0 0xAE 0xA3..0xA4 #Lo [2] TAMIL LETTER NNA..TAMIL LETTER TA 665 | | 0xE0 0xAE 0xA8..0xAA #Lo [3] TAMIL LETTER NA..TAMIL LETTER PA 666 | | 0xE0 0xAE 0xAE..0xB9 #Lo [12] TAMIL LETTER MA..TAMIL LETTER HA 667 | | 0xE0 0xAF 0x90 #Lo TAMIL OM 668 | | 0xE0 0xB0 0x85..0x8C #Lo [8] TELUGU LETTER A..TELUGU LETTER VOC... 669 | | 0xE0 0xB0 0x8E..0x90 #Lo [3] TELUGU LETTER E..TELUGU LETTER AI 670 | | 0xE0 0xB0 0x92..0xA8 #Lo [23] TELUGU LETTER O..TELUGU LETTER NA 671 | | 0xE0 0xB0 0xAA..0xB9 #Lo [16] TELUGU LETTER PA..TELUGU LETTER HA 672 | | 0xE0 0xB0 0xBD #Lo TELUGU SIGN AVAGRAHA 673 | | 0xE0 0xB1 0x98..0x9A #Lo [3] TELUGU LETTER TSA..TELUGU LETTER RRRA 674 | | 0xE0 0xB1 0xA0..0xA1 #Lo [2] TELUGU LETTER VOCALIC RR..TELUGU L... 675 | | 0xE0 0xB2 0x85..0x8C #Lo [8] KANNADA LETTER A..KANNADA LETTER V... 676 | | 0xE0 0xB2 0x8E..0x90 #Lo [3] KANNADA LETTER E..KANNADA LETTER AI 677 | | 0xE0 0xB2 0x92..0xA8 #Lo [23] KANNADA LETTER O..KANNADA LETTER NA 678 | | 0xE0 0xB2 0xAA..0xB3 #Lo [10] KANNADA LETTER PA..KANNADA LETTER LLA 679 | | 0xE0 0xB2 0xB5..0xB9 #Lo [5] KANNADA LETTER VA..KANNADA LETTER HA 680 | | 0xE0 0xB2 0xBD #Lo KANNADA SIGN AVAGRAHA 681 | | 0xE0 0xB3 0x9E #Lo KANNADA LETTER FA 682 | | 0xE0 0xB3 0xA0..0xA1 #Lo [2] KANNADA LETTER VOCALIC RR..KANNADA... 683 | | 0xE0 0xB3 0xB1..0xB2 #Lo [2] KANNADA SIGN JIHVAMULIYA..KANNADA ... 684 | | 0xE0 0xB4 0x85..0x8C #Lo [8] MALAYALAM LETTER A..MALAYALAM LETT... 685 | | 0xE0 0xB4 0x8E..0x90 #Lo [3] MALAYALAM LETTER E..MALAYALAM LETT... 686 | | 0xE0 0xB4 0x92..0xBA #Lo [41] MALAYALAM LETTER O..MALAYALAM LETT... 687 | | 0xE0 0xB4 0xBD #Lo MALAYALAM SIGN AVAGRAHA 688 | | 0xE0 0xB5 0x8E #Lo MALAYALAM LETTER DOT REPH 689 | | 0xE0 0xB5 0x9F..0xA1 #Lo [3] MALAYALAM LETTER ARCHAIC II..MALAY... 690 | | 0xE0 0xB5 0xBA..0xBF #Lo [6] MALAYALAM LETTER CHILLU NN..MALAYA... 691 | | 0xE0 0xB6 0x85..0x96 #Lo [18] SINHALA LETTER AYANNA..SINHALA LET... 692 | | 0xE0 0xB6 0x9A..0xB1 #Lo [24] SINHALA LETTER ALPAPRAANA KAYANNA.... 693 | | 0xE0 0xB6 0xB3..0xBB #Lo [9] SINHALA LETTER SANYAKA DAYANNA..SI... 694 | | 0xE0 0xB6 0xBD #Lo SINHALA LETTER DANTAJA LAYANNA 695 | | 0xE0 0xB7 0x80..0x86 #Lo [7] SINHALA LETTER VAYANNA..SINHALA LE... 696 | | 0xE0 0xBC 0x80 #Lo TIBETAN SYLLABLE OM 697 | | 0xE0 0xBD 0x80..0x87 #Lo [8] TIBETAN LETTER KA..TIBETAN LETTER JA 698 | | 0xE0 0xBD 0x89..0xAC #Lo [36] TIBETAN LETTER NYA..TIBETAN LETTER... 699 | | 0xE0 0xBE 0x88..0x8C #Lo [5] TIBETAN SIGN LCE TSA CAN..TIBETAN ... 700 | | 0xE1 0x82 0xA0..0xFF #L& [38] GEORGIAN CAPITAL LETTER AN..GEORGI... 701 | | 0xE1 0x83 0x00..0x85 # 702 | | 0xE1 0x83 0x87 #L& GEORGIAN CAPITAL LETTER YN 703 | | 0xE1 0x83 0x8D #L& GEORGIAN CAPITAL LETTER AEN 704 | | 0xE1 0x83 0x90..0xBA #Lo [43] GEORGIAN LETTER AN..GEORGIAN LETTE... 705 | | 0xE1 0x83 0xBC #Lm MODIFIER LETTER GEORGIAN NAR 706 | | 0xE1 0x83 0xBD..0xFF #Lo [332] GEORGIAN LETTER AEN..ETHIOPIC ... 707 | | 0xE1 0x84..0x88 0x00..0xFF # 708 | | 0xE1 0x89 0x00..0x88 # 709 | | 0xE1 0x89 0x8A..0x8D #Lo [4] ETHIOPIC SYLLABLE QWI..ETHIOPIC SY... 710 | | 0xE1 0x89 0x90..0x96 #Lo [7] ETHIOPIC SYLLABLE QHA..ETHIOPIC SY... 711 | | 0xE1 0x89 0x98 #Lo ETHIOPIC SYLLABLE QHWA 712 | | 0xE1 0x89 0x9A..0x9D #Lo [4] ETHIOPIC SYLLABLE QHWI..ETHIOPIC S... 713 | | 0xE1 0x89 0xA0..0xFF #Lo [41] ETHIOPIC SYLLABLE BA..ETHIOPIC SYL... 714 | | 0xE1 0x8A 0x00..0x88 # 715 | | 0xE1 0x8A 0x8A..0x8D #Lo [4] ETHIOPIC SYLLABLE XWI..ETHIOPIC SY... 716 | | 0xE1 0x8A 0x90..0xB0 #Lo [33] ETHIOPIC SYLLABLE NA..ETHIOPIC SYL... 717 | | 0xE1 0x8A 0xB2..0xB5 #Lo [4] ETHIOPIC SYLLABLE KWI..ETHIOPIC SY... 718 | | 0xE1 0x8A 0xB8..0xBE #Lo [7] ETHIOPIC SYLLABLE KXA..ETHIOPIC SY... 719 | | 0xE1 0x8B 0x80 #Lo ETHIOPIC SYLLABLE KXWA 720 | | 0xE1 0x8B 0x82..0x85 #Lo [4] ETHIOPIC SYLLABLE KXWI..ETHIOPIC S... 721 | | 0xE1 0x8B 0x88..0x96 #Lo [15] ETHIOPIC SYLLABLE WA..ETHIOPIC SYL... 722 | | 0xE1 0x8B 0x98..0xFF #Lo [57] ETHIOPIC SYLLABLE ZA..ETHIOPIC SYL... 723 | | 0xE1 0x8C 0x00..0x90 # 724 | | 0xE1 0x8C 0x92..0x95 #Lo [4] ETHIOPIC SYLLABLE GWI..ETHIOPIC SY... 725 | | 0xE1 0x8C 0x98..0xFF #Lo [67] ETHIOPIC SYLLABLE GGA..ETHIOPIC SY... 726 | | 0xE1 0x8D 0x00..0x9A # 727 | | 0xE1 0x8E 0x80..0x8F #Lo [16] ETHIOPIC SYLLABLE SEBATBEIT MWA..E... 728 | | 0xE1 0x8E 0xA0..0xFF #L& [86] CHEROKEE LETTER A..CHEROKEE LETTER MV 729 | | 0xE1 0x8F 0x00..0xB5 # 730 | | 0xE1 0x8F 0xB8..0xBD #L& [6] CHEROKEE SMALL LETTER YE..CHEROKEE... 731 | | 0xE1 0x90 0x81..0xFF #Lo [620] CANADIAN SYLLABICS E..CANADIAN... 732 | | 0xE1 0x91..0x98 0x00..0xFF # 733 | | 0xE1 0x99 0x00..0xAC # 734 | | 0xE1 0x99 0xAF..0xBF #Lo [17] CANADIAN SYLLABICS QAI..CANADIAN S... 735 | | 0xE1 0x9A 0x81..0x9A #Lo [26] OGHAM LETTER BEITH..OGHAM LETTER P... 736 | | 0xE1 0x9A 0xA0..0xFF #Lo [75] RUNIC LETTER FEHU FEOH FE F..RUNIC... 737 | | 0xE1 0x9B 0x00..0xAA # 738 | | 0xE1 0x9B 0xAE..0xB0 #Nl [3] RUNIC ARLAUG SYMBOL..RUNIC BELGTHO... 739 | | 0xE1 0x9B 0xB1..0xB8 #Lo [8] RUNIC LETTER K..RUNIC LETTER FRANK... 740 | | 0xE1 0x9C 0x80..0x8C #Lo [13] TAGALOG LETTER A..TAGALOG LETTER YA 741 | | 0xE1 0x9C 0x8E..0x91 #Lo [4] TAGALOG LETTER LA..TAGALOG LETTER HA 742 | | 0xE1 0x9C 0xA0..0xB1 #Lo [18] HANUNOO LETTER A..HANUNOO LETTER HA 743 | | 0xE1 0x9D 0x80..0x91 #Lo [18] BUHID LETTER A..BUHID LETTER HA 744 | | 0xE1 0x9D 0xA0..0xAC #Lo [13] TAGBANWA LETTER A..TAGBANWA LETTER YA 745 | | 0xE1 0x9D 0xAE..0xB0 #Lo [3] TAGBANWA LETTER LA..TAGBANWA LETTE... 746 | | 0xE1 0xA0 0xA0..0xFF #Lo [35] MONGOLIAN LETTER A..MONGOLIAN LETT... 747 | | 0xE1 0xA1 0x00..0x82 # 748 | | 0xE1 0xA1 0x83 #Lm MONGOLIAN LETTER TODO LONG VOWEL SIGN 749 | | 0xE1 0xA1 0x84..0xB7 #Lo [52] MONGOLIAN LETTER TODO E..MONGOLIAN... 750 | | 0xE1 0xA2 0x80..0xA8 #Lo [41] MONGOLIAN LETTER ALI GALI ANUSVARA... 751 | | 0xE1 0xA2 0xAA #Lo MONGOLIAN LETTER MANCHU ALI GALI LHA 752 | | 0xE1 0xA2 0xB0..0xFF #Lo [70] CANADIAN SYLLABICS OY..CANADIAN SY... 753 | | 0xE1 0xA3 0x00..0xB5 # 754 | | 0xE1 0xA4 0x80..0x9E #Lo [31] LIMBU VOWEL-CARRIER LETTER..LIMBU ... 755 | | 0xE1 0xA8 0x80..0x96 #Lo [23] BUGINESE LETTER KA..BUGINESE LETTE... 756 | | 0xE1 0xAC 0x85..0xB3 #Lo [47] BALINESE LETTER AKARA..BALINESE LE... 757 | | 0xE1 0xAD 0x85..0x8B #Lo [7] BALINESE LETTER KAF SASAK..BALINES... 758 | | 0xE1 0xAE 0x83..0xA0 #Lo [30] SUNDANESE LETTER A..SUNDANESE LETT... 759 | | 0xE1 0xAE 0xAE..0xAF #Lo [2] SUNDANESE LETTER KHA..SUNDANESE LE... 760 | | 0xE1 0xAE 0xBA..0xFF #Lo [44] SUNDANESE AVAGRAHA..BATAK LETTER U 761 | | 0xE1 0xAF 0x00..0xA5 # 762 | | 0xE1 0xB0 0x80..0xA3 #Lo [36] LEPCHA LETTER KA..LEPCHA LETTER A 763 | | 0xE1 0xB1 0x8D..0x8F #Lo [3] LEPCHA LETTER TTA..LEPCHA LETTER DDA 764 | | 0xE1 0xB1 0x9A..0xB7 #Lo [30] OL CHIKI LETTER LA..OL CHIKI LETTE... 765 | | 0xE1 0xB1 0xB8..0xBD #Lm [6] OL CHIKI MU TTUDDAG..OL CHIKI AHAD 766 | | 0xE1 0xB3 0xA9..0xAC #Lo [4] VEDIC SIGN ANUSVARA ANTARGOMUKHA..... 767 | | 0xE1 0xB3 0xAE..0xB1 #Lo [4] VEDIC SIGN HEXIFORM LONG ANUSVARA.... 768 | | 0xE1 0xB3 0xB5..0xB6 #Lo [2] VEDIC SIGN JIHVAMULIYA..VEDIC SIGN... 769 | | 0xE1 0xB4 0x80..0xAB #L& [44] LATIN LETTER SMALL CAPITAL A..CYRI... 770 | | 0xE1 0xB4 0xAC..0xFF #Lm [63] MODIFIER LETTER CAPITAL A..GREEK S... 771 | | 0xE1 0xB5 0x00..0xAA # 772 | | 0xE1 0xB5 0xAB..0xB7 #L& [13] LATIN SMALL LETTER UE..LATIN SMALL... 773 | | 0xE1 0xB5 0xB8 #Lm MODIFIER LETTER CYRILLIC EN 774 | | 0xE1 0xB5 0xB9..0xFF #L& [34] LATIN SMALL LETTER INSULAR G..LATI... 775 | | 0xE1 0xB6 0x00..0x9A # 776 | | 0xE1 0xB6 0x9B..0xBF #Lm [37] MODIFIER LETTER SMALL TURNED ALPHA... 777 | | 0xE1 0xB8 0x80..0xFF #L& [278] LATIN CAPITAL LETTER A WITH RI... 778 | | 0xE1 0xB9..0xBB 0x00..0xFF # 779 | | 0xE1 0xBC 0x00..0x95 # 780 | | 0xE1 0xBC 0x98..0x9D #L& [6] GREEK CAPITAL LETTER EPSILON WITH ... 781 | | 0xE1 0xBC 0xA0..0xFF #L& [38] GREEK SMALL LETTER ETA WITH PSILI.... 782 | | 0xE1 0xBD 0x00..0x85 # 783 | | 0xE1 0xBD 0x88..0x8D #L& [6] GREEK CAPITAL LETTER OMICRON WITH ... 784 | | 0xE1 0xBD 0x90..0x97 #L& [8] GREEK SMALL LETTER UPSILON WITH PS... 785 | | 0xE1 0xBD 0x99 #L& GREEK CAPITAL LETTER UPSILON WITH ... 786 | | 0xE1 0xBD 0x9B #L& GREEK CAPITAL LETTER UPSILON WITH ... 787 | | 0xE1 0xBD 0x9D #L& GREEK CAPITAL LETTER UPSILON WITH ... 788 | | 0xE1 0xBD 0x9F..0xBD #L& [31] GREEK CAPITAL LETTER UPSILON WITH ... 789 | | 0xE1 0xBE 0x80..0xB4 #L& [53] GREEK SMALL LETTER ALPHA WITH PSIL... 790 | | 0xE1 0xBE 0xB6..0xBC #L& [7] GREEK SMALL LETTER ALPHA WITH PERI... 791 | | 0xE1 0xBE 0xBE #L& GREEK PROSGEGRAMMENI 792 | | 0xE1 0xBF 0x82..0x84 #L& [3] GREEK SMALL LETTER ETA WITH VARIA ... 793 | | 0xE1 0xBF 0x86..0x8C #L& [7] GREEK SMALL LETTER ETA WITH PERISP... 794 | | 0xE1 0xBF 0x90..0x93 #L& [4] GREEK SMALL LETTER IOTA WITH VRACH... 795 | | 0xE1 0xBF 0x96..0x9B #L& [6] GREEK SMALL LETTER IOTA WITH PERIS... 796 | | 0xE1 0xBF 0xA0..0xAC #L& [13] GREEK SMALL LETTER UPSILON WITH VR... 797 | | 0xE1 0xBF 0xB2..0xB4 #L& [3] GREEK SMALL LETTER OMEGA WITH VARI... 798 | | 0xE1 0xBF 0xB6..0xBC #L& [7] GREEK SMALL LETTER OMEGA WITH PERI... 799 | | 0xE2 0x81 0xB1 #Lm SUPERSCRIPT LATIN SMALL LETTER I 800 | | 0xE2 0x81 0xBF #Lm SUPERSCRIPT LATIN SMALL LETTER N 801 | | 0xE2 0x82 0x90..0x9C #Lm [13] LATIN SUBSCRIPT SMALL LETTER A..LA... 802 | | 0xE2 0x84 0x82 #L& DOUBLE-STRUCK CAPITAL C 803 | | 0xE2 0x84 0x87 #L& EULER CONSTANT 804 | | 0xE2 0x84 0x8A..0x93 #L& [10] SCRIPT SMALL G..SCRIPT SMALL L 805 | | 0xE2 0x84 0x95 #L& DOUBLE-STRUCK CAPITAL N 806 | | 0xE2 0x84 0x99..0x9D #L& [5] DOUBLE-STRUCK CAPITAL P..DOUBLE-ST... 807 | | 0xE2 0x84 0xA4 #L& DOUBLE-STRUCK CAPITAL Z 808 | | 0xE2 0x84 0xA6 #L& OHM SIGN 809 | | 0xE2 0x84 0xA8 #L& BLACK-LETTER CAPITAL Z 810 | | 0xE2 0x84 0xAA..0xAD #L& [4] KELVIN SIGN..BLACK-LETTER CAPITAL C 811 | | 0xE2 0x84 0xAF..0xB4 #L& [6] SCRIPT SMALL E..SCRIPT SMALL O 812 | | 0xE2 0x84 0xB5..0xB8 #Lo [4] ALEF SYMBOL..DALET SYMBOL 813 | | 0xE2 0x84 0xB9 #L& INFORMATION SOURCE 814 | | 0xE2 0x84 0xBC..0xBF #L& [4] DOUBLE-STRUCK SMALL PI..DOUBLE-STR... 815 | | 0xE2 0x85 0x85..0x89 #L& [5] DOUBLE-STRUCK ITALIC CAPITAL D..DO... 816 | | 0xE2 0x85 0x8E #L& TURNED SMALL F 817 | | 0xE2 0x85 0xA0..0xFF #Nl [35] ROMAN NUMERAL ONE..ROMAN NUMERAL T... 818 | | 0xE2 0x86 0x00..0x82 # 819 | | 0xE2 0x86 0x83..0x84 #L& [2] ROMAN NUMERAL REVERSED ONE HUNDRED... 820 | | 0xE2 0x86 0x85..0x88 #Nl [4] ROMAN NUMERAL SIX LATE FORM..ROMAN... 821 | | 0xE2 0x92 0xB6..0xFF #So [52] CIRCLED LATIN CAPITAL LETTER A..CI... 822 | | 0xE2 0x93 0x00..0xA9 # 823 | | 0xE2 0xB0 0x80..0xAE #L& [47] GLAGOLITIC CAPITAL LETTER AZU..GLA... 824 | | 0xE2 0xB0 0xB0..0xFF #L& [47] GLAGOLITIC SMALL LETTER AZU..GLAGO... 825 | | 0xE2 0xB1 0x00..0x9E # 826 | | 0xE2 0xB1 0xA0..0xBB #L& [28] LATIN CAPITAL LETTER L WITH DOUBLE... 827 | | 0xE2 0xB1 0xBC..0xBD #Lm [2] LATIN SUBSCRIPT SMALL LETTER J..MO... 828 | | 0xE2 0xB1 0xBE..0xFF #L& [103] LATIN CAPITAL LETTER S WITH SW... 829 | | 0xE2 0xB2..0xB2 0x00..0xFF # 830 | | 0xE2 0xB3 0x00..0xA4 # 831 | | 0xE2 0xB3 0xAB..0xAE #L& [4] COPTIC CAPITAL LETTER CRYPTOGRAMMI... 832 | | 0xE2 0xB3 0xB2..0xB3 #L& [2] COPTIC CAPITAL LETTER BOHAIRIC KHE... 833 | | 0xE2 0xB4 0x80..0xA5 #L& [38] GEORGIAN SMALL LETTER AN..GEORGIAN... 834 | | 0xE2 0xB4 0xA7 #L& GEORGIAN SMALL LETTER YN 835 | | 0xE2 0xB4 0xAD #L& GEORGIAN SMALL LETTER AEN 836 | | 0xE2 0xB4 0xB0..0xFF #Lo [56] TIFINAGH LETTER YA..TIFINAGH LETTE... 837 | | 0xE2 0xB5 0x00..0xA7 # 838 | | 0xE2 0xB5 0xAF #Lm TIFINAGH MODIFIER LETTER LABIALIZA... 839 | | 0xE2 0xB6 0x80..0x96 #Lo [23] ETHIOPIC SYLLABLE LOA..ETHIOPIC SY... 840 | | 0xE2 0xB6 0xA0..0xA6 #Lo [7] ETHIOPIC SYLLABLE SSA..ETHIOPIC SY... 841 | | 0xE2 0xB6 0xA8..0xAE #Lo [7] ETHIOPIC SYLLABLE CCA..ETHIOPIC SY... 842 | | 0xE2 0xB6 0xB0..0xB6 #Lo [7] ETHIOPIC SYLLABLE ZZA..ETHIOPIC SY... 843 | | 0xE2 0xB6 0xB8..0xBE #Lo [7] ETHIOPIC SYLLABLE CCHA..ETHIOPIC S... 844 | | 0xE2 0xB7 0x80..0x86 #Lo [7] ETHIOPIC SYLLABLE QYA..ETHIOPIC SY... 845 | | 0xE2 0xB7 0x88..0x8E #Lo [7] ETHIOPIC SYLLABLE KYA..ETHIOPIC SY... 846 | | 0xE2 0xB7 0x90..0x96 #Lo [7] ETHIOPIC SYLLABLE XYA..ETHIOPIC SY... 847 | | 0xE2 0xB7 0x98..0x9E #Lo [7] ETHIOPIC SYLLABLE GYA..ETHIOPIC SY... 848 | | 0xE2 0xB8 0xAF #Lm VERTICAL TILDE 849 | | 0xE3 0x80 0x85 #Lm IDEOGRAPHIC ITERATION MARK 850 | | 0xE3 0x80 0xBB #Lm VERTICAL IDEOGRAPHIC ITERATION MARK 851 | | 0xE3 0x80 0xBC #Lo MASU MARK 852 | | 0xE3 0x84 0x85..0xAD #Lo [41] BOPOMOFO LETTER B..BOPOMOFO LETTER IH 853 | | 0xE3 0x84 0xB1..0xFF #Lo [94] HANGUL LETTER KIYEOK..HANGUL L... 854 | | 0xE3 0x85..0x85 0x00..0xFF # 855 | | 0xE3 0x86 0x00..0x8E # 856 | | 0xE3 0x86 0xA0..0xBA #Lo [27] BOPOMOFO LETTER BU..BOPOMOFO LETTE... 857 | | 0xEA 0x80 0x80..0x94 #Lo [21] YI SYLLABLE IT..YI SYLLABLE E 858 | | 0xEA 0x80 0x95 #Lm YI SYLLABLE WU 859 | | 0xEA 0x80 0x96..0xFF #Lo [1143] YI SYLLABLE BIT..YI SYLLABLE YYR 860 | | 0xEA 0x81..0x91 0x00..0xFF # 861 | | 0xEA 0x92 0x00..0x8C # 862 | | 0xEA 0x93 0x90..0xB7 #Lo [40] LISU LETTER BA..LISU LETTER OE 863 | | 0xEA 0x93 0xB8..0xBD #Lm [6] LISU LETTER TONE MYA TI..LISU LETT... 864 | | 0xEA 0x94 0x80..0xFF #Lo [268] VAI SYLLABLE EE..VAI SYLLABLE NG 865 | | 0xEA 0x95..0x97 0x00..0xFF # 866 | | 0xEA 0x98 0x00..0x8B # 867 | | 0xEA 0x98 0x8C #Lm VAI SYLLABLE LENGTHENER 868 | | 0xEA 0x98 0x90..0x9F #Lo [16] VAI SYLLABLE NDOLE FA..VAI SYMBOL ... 869 | | 0xEA 0x98 0xAA..0xAB #Lo [2] VAI SYLLABLE NDOLE MA..VAI SYLLABL... 870 | | 0xEA 0x99 0x80..0xAD #L& [46] CYRILLIC CAPITAL LETTER ZEMLYA..CY... 871 | | 0xEA 0x99 0xAE #Lo CYRILLIC LETTER MULTIOCULAR O 872 | | 0xEA 0x99 0xBF #Lm CYRILLIC PAYEROK 873 | | 0xEA 0x9A 0x80..0x9B #L& [28] CYRILLIC CAPITAL LETTER DWE..CYRIL... 874 | | 0xEA 0x9A 0x9C..0x9D #Lm [2] MODIFIER LETTER CYRILLIC HARD SIGN... 875 | | 0xEA 0x9A 0xA0..0xFF #Lo [70] BAMUM LETTER A..BAMUM LETTER KI 876 | | 0xEA 0x9B 0x00..0xA5 # 877 | | 0xEA 0x9B 0xA6..0xAF #Nl [10] BAMUM LETTER MO..BAMUM LETTER KOGHOM 878 | | 0xEA 0x9C 0x97..0x9F #Lm [9] MODIFIER LETTER DOT VERTICAL BAR..... 879 | | 0xEA 0x9C 0xA2..0xFF #L& [78] LATIN CAPITAL LETTER EGYPTOLOGICAL... 880 | | 0xEA 0x9D 0x00..0xAF # 881 | | 0xEA 0x9D 0xB0 #Lm MODIFIER LETTER US 882 | | 0xEA 0x9D 0xB1..0xFF #L& [23] LATIN SMALL LETTER DUM..LATIN SMAL... 883 | | 0xEA 0x9E 0x00..0x87 # 884 | | 0xEA 0x9E 0x88 #Lm MODIFIER LETTER LOW CIRCUMFLEX ACCENT 885 | | 0xEA 0x9E 0x8B..0x8E #L& [4] LATIN CAPITAL LETTER SALTILLO..LAT... 886 | | 0xEA 0x9E 0x8F #Lo LATIN LETTER SINOLOGICAL DOT 887 | | 0xEA 0x9E 0x90..0xAD #L& [30] LATIN CAPITAL LETTER N WITH DESCEN... 888 | | 0xEA 0x9E 0xB0..0xB7 #L& [8] LATIN CAPITAL LETTER TURNED K..LAT... 889 | | 0xEA 0x9F 0xB7 #Lo LATIN EPIGRAPHIC LETTER SIDEWAYS I 890 | | 0xEA 0x9F 0xB8..0xB9 #Lm [2] MODIFIER LETTER CAPITAL H WITH STR... 891 | | 0xEA 0x9F 0xBA #L& LATIN LETTER SMALL CAPITAL TURNED M 892 | | 0xEA 0x9F 0xBB..0xFF #Lo [7] LATIN EPIGRAPHIC LETTER REVERSED F... 893 | | 0xEA 0xA0 0x00..0x81 # 894 | | 0xEA 0xA0 0x83..0x85 #Lo [3] SYLOTI NAGRI LETTER U..SYLOTI NAGR... 895 | | 0xEA 0xA0 0x87..0x8A #Lo [4] SYLOTI NAGRI LETTER KO..SYLOTI NAG... 896 | | 0xEA 0xA0 0x8C..0xA2 #Lo [23] SYLOTI NAGRI LETTER CO..SYLOTI NAG... 897 | | 0xEA 0xA1 0x80..0xB3 #Lo [52] PHAGS-PA LETTER KA..PHAGS-PA LETTE... 898 | | 0xEA 0xA2 0x82..0xB3 #Lo [50] SAURASHTRA LETTER A..SAURASHTRA LE... 899 | | 0xEA 0xA3 0xB2..0xB7 #Lo [6] DEVANAGARI SIGN SPACING CANDRABIND... 900 | | 0xEA 0xA3 0xBB #Lo DEVANAGARI HEADSTROKE 901 | | 0xEA 0xA3 0xBD #Lo DEVANAGARI JAIN OM 902 | | 0xEA 0xA4 0x8A..0xA5 #Lo [28] KAYAH LI LETTER KA..KAYAH LI LETTE... 903 | | 0xEA 0xA4 0xB0..0xFF #Lo [23] REJANG LETTER KA..REJANG LETTER A 904 | | 0xEA 0xA5 0x00..0x86 # 905 | | 0xEA 0xA5 0xA0..0xBC #Lo [29] HANGUL CHOSEONG TIKEUT-MIEUM..HANG... 906 | | 0xEA 0xA6 0x84..0xB2 #Lo [47] JAVANESE LETTER A..JAVANESE LETTER HA 907 | | 0xEA 0xA7 0x8F #Lm JAVANESE PANGRANGKEP 908 | | 0xEA 0xA8 0x80..0xA8 #Lo [41] CHAM LETTER A..CHAM LETTER HA 909 | | 0xEA 0xA9 0x80..0x82 #Lo [3] CHAM LETTER FINAL K..CHAM LETTER F... 910 | | 0xEA 0xA9 0x84..0x8B #Lo [8] CHAM LETTER FINAL CH..CHAM LETTER ... 911 | | 0xEA 0xAB 0xA0..0xAA #Lo [11] MEETEI MAYEK LETTER E..MEETEI MAYE... 912 | | 0xEA 0xAB 0xB2 #Lo MEETEI MAYEK ANJI 913 | | 0xEA 0xAB 0xB3..0xB4 #Lm [2] MEETEI MAYEK SYLLABLE REPETITION M... 914 | | 0xEA 0xAC 0x81..0x86 #Lo [6] ETHIOPIC SYLLABLE TTHU..ETHIOPIC S... 915 | | 0xEA 0xAC 0x89..0x8E #Lo [6] ETHIOPIC SYLLABLE DDHU..ETHIOPIC S... 916 | | 0xEA 0xAC 0x91..0x96 #Lo [6] ETHIOPIC SYLLABLE DZU..ETHIOPIC SY... 917 | | 0xEA 0xAC 0xA0..0xA6 #Lo [7] ETHIOPIC SYLLABLE CCHHA..ETHIOPIC ... 918 | | 0xEA 0xAC 0xA8..0xAE #Lo [7] ETHIOPIC SYLLABLE BBA..ETHIOPIC SY... 919 | | 0xEA 0xAC 0xB0..0xFF #L& [43] LATIN SMALL LETTER BARRED ALPHA..L... 920 | | 0xEA 0xAD 0x00..0x9A # 921 | | 0xEA 0xAD 0x9C..0x9F #Lm [4] MODIFIER LETTER SMALL HENG..MODIFI... 922 | | 0xEA 0xAD 0xA0..0xA5 #L& [6] LATIN SMALL LETTER SAKHA YAT..GREE... 923 | | 0xEA 0xAD 0xB0..0xFF #L& [80] CHEROKEE SMALL LETTER A..CHEROKEE ... 924 | | 0xEA 0xAE 0x00..0xBF # 925 | | 0xEA 0xAF 0x80..0xA2 #Lo [35] MEETEI MAYEK LETTER KOK..MEETEI MA... 926 | | 0xEA 0xB0 0x80..0xFF #Lo [11172] HANGUL SYLLABLE GA..HA... 927 | | 0xEA 0xB1..0xFF 0x00..0xFF # 928 | | 0xEB..0xEC 0x00..0xFF 0x00..0xFF # 929 | | 0xED 0x00 0x00..0xFF # 930 | | 0xED 0x01..0x9D 0x00..0xFF # 931 | | 0xED 0x9E 0x00..0xA3 # 932 | | 0xED 0x9E 0xB0..0xFF #Lo [23] HANGUL JUNGSEONG O-YEO..HANGUL JUN... 933 | | 0xED 0x9F 0x00..0x86 # 934 | | 0xED 0x9F 0x8B..0xBB #Lo [49] HANGUL JONGSEONG NIEUN-RIEUL..HANG... 935 | | 0xEF 0xAC 0x80..0x86 #L& [7] LATIN SMALL LIGATURE FF..LATIN SMA... 936 | | 0xEF 0xAC 0x93..0x97 #L& [5] ARMENIAN SMALL LIGATURE MEN NOW..A... 937 | | 0xEF 0xAD 0x90..0xFF #Lo [98] ARABIC LETTER ALEF WASLA ISOLATED ... 938 | | 0xEF 0xAE 0x00..0xB1 # 939 | | 0xEF 0xAF 0x93..0xFF #Lo [363] ARABIC LETTER NG ISOLATED FORM... 940 | | 0xEF 0xB0..0xB3 0x00..0xFF # 941 | | 0xEF 0xB4 0x00..0xBD # 942 | | 0xEF 0xB5 0x90..0xFF #Lo [64] ARABIC LIGATURE TEH WITH JEEM WITH... 943 | | 0xEF 0xB6 0x00..0x8F # 944 | | 0xEF 0xB6 0x92..0xFF #Lo [54] ARABIC LIGATURE MEEM WITH JEEM WIT... 945 | | 0xEF 0xB7 0x00..0x87 # 946 | | 0xEF 0xB7 0xB0..0xBB #Lo [12] ARABIC LIGATURE SALLA USED AS KORA... 947 | | 0xEF 0xB9 0xB0..0xB4 #Lo [5] ARABIC FATHATAN ISOLATED FORM..ARA... 948 | | 0xEF 0xB9 0xB6..0xFF #Lo [135] ARABIC FATHA ISOLATED FORM..AR... 949 | | 0xEF 0xBA..0xBA 0x00..0xFF # 950 | | 0xEF 0xBB 0x00..0xBC # 951 | | 0xEF 0xBC 0xA1..0xBA #L& [26] FULLWIDTH LATIN CAPITAL LETTER A..... 952 | | 0xEF 0xBD 0x81..0x9A #L& [26] FULLWIDTH LATIN SMALL LETTER A..FU... 953 | | 0xEF 0xBE 0xA0..0xBE #Lo [31] HALFWIDTH HANGUL FILLER..HALFWIDTH... 954 | | 0xEF 0xBF 0x82..0x87 #Lo [6] HALFWIDTH HANGUL LETTER A..HALFWID... 955 | | 0xEF 0xBF 0x8A..0x8F #Lo [6] HALFWIDTH HANGUL LETTER YEO..HALFW... 956 | | 0xEF 0xBF 0x92..0x97 #Lo [6] HALFWIDTH HANGUL LETTER YO..HALFWI... 957 | | 0xEF 0xBF 0x9A..0x9C #Lo [3] HALFWIDTH HANGUL LETTER EU..HALFWI... 958 | | 0xF0 0x90 0x80 0x80..0x8B #Lo [12] LINEAR B SYLLABLE B008 A..LINEA... 959 | | 0xF0 0x90 0x80 0x8D..0xA6 #Lo [26] LINEAR B SYLLABLE B036 JO..LINE... 960 | | 0xF0 0x90 0x80 0xA8..0xBA #Lo [19] LINEAR B SYLLABLE B060 RA..LINE... 961 | | 0xF0 0x90 0x80 0xBC..0xBD #Lo [2] LINEAR B SYLLABLE B017 ZA..LINE... 962 | | 0xF0 0x90 0x80 0xBF..0xFF #Lo [15] LINEAR B SYLLABLE B020 ZO..LINE... 963 | | 0xF0 0x90 0x81 0x00..0x8D # 964 | | 0xF0 0x90 0x81 0x90..0x9D #Lo [14] LINEAR B SYMBOL B018..LINEAR B ... 965 | | 0xF0 0x90 0x82 0x80..0xFF #Lo [123] LINEAR B IDEOGRAM B100 MAN..LIN... 966 | | 0xF0 0x90 0x83 0x00..0xBA # 967 | | 0xF0 0x90 0x85 0x80..0xB4 #Nl [53] GREEK ACROPHONIC ATTIC ONE QUAR... 968 | | 0xF0 0x90 0x8A 0x80..0x9C #Lo [29] LYCIAN LETTER A..LYCIAN LETTER X 969 | | 0xF0 0x90 0x8A 0xA0..0xFF #Lo [49] CARIAN LETTER A..CARIAN LETTER ... 970 | | 0xF0 0x90 0x8B 0x00..0x90 # 971 | | 0xF0 0x90 0x8C 0x80..0x9F #Lo [32] OLD ITALIC LETTER A..OLD ITALIC... 972 | | 0xF0 0x90 0x8C 0xB0..0xFF #Lo [17] GOTHIC LETTER AHSA..GOTHIC LETT... 973 | | 0xF0 0x90 0x8D 0x00..0x80 # 974 | | 0xF0 0x90 0x8D 0x81 #Nl GOTHIC LETTER NINETY 975 | | 0xF0 0x90 0x8D 0x82..0x89 #Lo [8] GOTHIC LETTER RAIDA..GOTHIC LET... 976 | | 0xF0 0x90 0x8D 0x8A #Nl GOTHIC LETTER NINE HUNDRED 977 | | 0xF0 0x90 0x8D 0x90..0xB5 #Lo [38] OLD PERMIC LETTER AN..OLD PERMI... 978 | | 0xF0 0x90 0x8E 0x80..0x9D #Lo [30] UGARITIC LETTER ALPA..UGARITIC ... 979 | | 0xF0 0x90 0x8E 0xA0..0xFF #Lo [36] OLD PERSIAN SIGN A..OLD PERSIAN... 980 | | 0xF0 0x90 0x8F 0x00..0x83 # 981 | | 0xF0 0x90 0x8F 0x88..0x8F #Lo [8] OLD PERSIAN SIGN AURAMAZDAA..OL... 982 | | 0xF0 0x90 0x8F 0x91..0x95 #Nl [5] OLD PERSIAN NUMBER ONE..OLD PER... 983 | | 0xF0 0x90 0x90 0x80..0xFF #L& [80] DESERET CAPITAL LETTER LONG I..... 984 | | 0xF0 0x90 0x91 0x00..0x8F # 985 | | 0xF0 0x90 0x91 0x90..0xFF #Lo [78] SHAVIAN LETTER PEEP..OSMANYA LE... 986 | | 0xF0 0x90 0x92 0x00..0x9D # 987 | | 0xF0 0x90 0x94 0x80..0xA7 #Lo [40] ELBASAN LETTER A..ELBASAN LETTE... 988 | | 0xF0 0x90 0x94 0xB0..0xFF #Lo [52] CAUCASIAN ALBANIAN LETTER ALT..... 989 | | 0xF0 0x90 0x95 0x00..0xA3 # 990 | | 0xF0 0x90 0x98 0x80..0xFF #Lo [311] LINEAR A SIGN AB001..LINE... 991 | | 0xF0 0x90 0x99..0x9B 0x00..0xFF # 992 | | 0xF0 0x90 0x9C 0x00..0xB6 # 993 | | 0xF0 0x90 0x9D 0x80..0x95 #Lo [22] LINEAR A SIGN A701 A..LINEAR A ... 994 | | 0xF0 0x90 0x9D 0xA0..0xA7 #Lo [8] LINEAR A SIGN A800..LINEAR A SI... 995 | | 0xF0 0x90 0xA0 0x80..0x85 #Lo [6] CYPRIOT SYLLABLE A..CYPRIOT SYL... 996 | | 0xF0 0x90 0xA0 0x88 #Lo CYPRIOT SYLLABLE JO 997 | | 0xF0 0x90 0xA0 0x8A..0xB5 #Lo [44] CYPRIOT SYLLABLE KA..CYPRIOT SY... 998 | | 0xF0 0x90 0xA0 0xB7..0xB8 #Lo [2] CYPRIOT SYLLABLE XA..CYPRIOT SY... 999 | | 0xF0 0x90 0xA0 0xBC #Lo CYPRIOT SYLLABLE ZA 1000 | | 0xF0 0x90 0xA0 0xBF..0xFF #Lo [23] CYPRIOT SYLLABLE ZO..IMPERIAL A... 1001 | | 0xF0 0x90 0xA1 0x00..0x95 # 1002 | | 0xF0 0x90 0xA1 0xA0..0xB6 #Lo [23] PALMYRENE LETTER ALEPH..PALMYRE... 1003 | | 0xF0 0x90 0xA2 0x80..0x9E #Lo [31] NABATAEAN LETTER FINAL ALEPH..N... 1004 | | 0xF0 0x90 0xA3 0xA0..0xB2 #Lo [19] HATRAN LETTER ALEPH..HATRAN LET... 1005 | | 0xF0 0x90 0xA3 0xB4..0xB5 #Lo [2] HATRAN LETTER SHIN..HATRAN LETT... 1006 | | 0xF0 0x90 0xA4 0x80..0x95 #Lo [22] PHOENICIAN LETTER ALF..PHOENICI... 1007 | | 0xF0 0x90 0xA4 0xA0..0xB9 #Lo [26] LYDIAN LETTER A..LYDIAN LETTER C 1008 | | 0xF0 0x90 0xA6 0x80..0xB7 #Lo [56] MEROITIC HIEROGLYPHIC LETTER A.... 1009 | | 0xF0 0x90 0xA6 0xBE..0xBF #Lo [2] MEROITIC CURSIVE LOGOGRAM RMT..... 1010 | | 0xF0 0x90 0xA8 0x80 #Lo KHAROSHTHI LETTER A 1011 | | 0xF0 0x90 0xA8 0x90..0x93 #Lo [4] KHAROSHTHI LETTER KA..KHAROSHTH... 1012 | | 0xF0 0x90 0xA8 0x95..0x97 #Lo [3] KHAROSHTHI LETTER CA..KHAROSHTH... 1013 | | 0xF0 0x90 0xA8 0x99..0xB3 #Lo [27] KHAROSHTHI LETTER NYA..KHAROSHT... 1014 | | 0xF0 0x90 0xA9 0xA0..0xBC #Lo [29] OLD SOUTH ARABIAN LETTER HE..OL... 1015 | | 0xF0 0x90 0xAA 0x80..0x9C #Lo [29] OLD NORTH ARABIAN LETTER HEH..O... 1016 | | 0xF0 0x90 0xAB 0x80..0x87 #Lo [8] MANICHAEAN LETTER ALEPH..MANICH... 1017 | | 0xF0 0x90 0xAB 0x89..0xA4 #Lo [28] MANICHAEAN LETTER ZAYIN..MANICH... 1018 | | 0xF0 0x90 0xAC 0x80..0xB5 #Lo [54] AVESTAN LETTER A..AVESTAN LETTE... 1019 | | 0xF0 0x90 0xAD 0x80..0x95 #Lo [22] INSCRIPTIONAL PARTHIAN LETTER A... 1020 | | 0xF0 0x90 0xAD 0xA0..0xB2 #Lo [19] INSCRIPTIONAL PAHLAVI LETTER AL... 1021 | | 0xF0 0x90 0xAE 0x80..0x91 #Lo [18] PSALTER PAHLAVI LETTER ALEPH..P... 1022 | | 0xF0 0x90 0xB0 0x80..0xFF #Lo [73] OLD TURKIC LETTER ORKHON A..OLD... 1023 | | 0xF0 0x90 0xB1 0x00..0x88 # 1024 | | 0xF0 0x90 0xB2 0x80..0xB2 #L& [51] OLD HUNGARIAN CAPITAL LETTER A.... 1025 | | 0xF0 0x90 0xB3 0x80..0xB2 #L& [51] OLD HUNGARIAN SMALL LETTER A..O... 1026 | | 0xF0 0x91 0x80 0x83..0xB7 #Lo [53] BRAHMI SIGN JIHVAMULIYA..BRAHMI... 1027 | | 0xF0 0x91 0x82 0x83..0xAF #Lo [45] KAITHI LETTER A..KAITHI LETTER HA 1028 | | 0xF0 0x91 0x83 0x90..0xA8 #Lo [25] SORA SOMPENG LETTER SAH..SORA S... 1029 | | 0xF0 0x91 0x84 0x83..0xA6 #Lo [36] CHAKMA LETTER AA..CHAKMA LETTER... 1030 | | 0xF0 0x91 0x85 0x90..0xB2 #Lo [35] MAHAJANI LETTER A..MAHAJANI LET... 1031 | | 0xF0 0x91 0x85 0xB6 #Lo MAHAJANI LIGATURE SHRI 1032 | | 0xF0 0x91 0x86 0x83..0xB2 #Lo [48] SHARADA LETTER A..SHARADA LETTE... 1033 | | 0xF0 0x91 0x87 0x81..0x84 #Lo [4] SHARADA SIGN AVAGRAHA..SHARADA OM 1034 | | 0xF0 0x91 0x87 0x9A #Lo SHARADA EKAM 1035 | | 0xF0 0x91 0x87 0x9C #Lo SHARADA HEADSTROKE 1036 | | 0xF0 0x91 0x88 0x80..0x91 #Lo [18] KHOJKI LETTER A..KHOJKI LETTER JJA 1037 | | 0xF0 0x91 0x88 0x93..0xAB #Lo [25] KHOJKI LETTER NYA..KHOJKI LETTE... 1038 | | 0xF0 0x91 0x8A 0x80..0x86 #Lo [7] MULTANI LETTER A..MULTANI LETTE... 1039 | | 0xF0 0x91 0x8A 0x88 #Lo MULTANI LETTER GHA 1040 | | 0xF0 0x91 0x8A 0x8A..0x8D #Lo [4] MULTANI LETTER CA..MULTANI LETT... 1041 | | 0xF0 0x91 0x8A 0x8F..0x9D #Lo [15] MULTANI LETTER NYA..MULTANI LET... 1042 | | 0xF0 0x91 0x8A 0x9F..0xA8 #Lo [10] MULTANI LETTER BHA..MULTANI LET... 1043 | | 0xF0 0x91 0x8A 0xB0..0xFF #Lo [47] KHUDAWADI LETTER A..KHUDAWADI L... 1044 | | 0xF0 0x91 0x8B 0x00..0x9E # 1045 | | 0xF0 0x91 0x8C 0x85..0x8C #Lo [8] GRANTHA LETTER A..GRANTHA LETTE... 1046 | | 0xF0 0x91 0x8C 0x8F..0x90 #Lo [2] GRANTHA LETTER EE..GRANTHA LETT... 1047 | | 0xF0 0x91 0x8C 0x93..0xA8 #Lo [22] GRANTHA LETTER OO..GRANTHA LETT... 1048 | | 0xF0 0x91 0x8C 0xAA..0xB0 #Lo [7] GRANTHA LETTER PA..GRANTHA LETT... 1049 | | 0xF0 0x91 0x8C 0xB2..0xB3 #Lo [2] GRANTHA LETTER LA..GRANTHA LETT... 1050 | | 0xF0 0x91 0x8C 0xB5..0xB9 #Lo [5] GRANTHA LETTER VA..GRANTHA LETT... 1051 | | 0xF0 0x91 0x8C 0xBD #Lo GRANTHA SIGN AVAGRAHA 1052 | | 0xF0 0x91 0x8D 0x90 #Lo GRANTHA OM 1053 | | 0xF0 0x91 0x8D 0x9D..0xA1 #Lo [5] GRANTHA SIGN PLUTA..GRANTHA LET... 1054 | | 0xF0 0x91 0x92 0x80..0xAF #Lo [48] TIRHUTA ANJI..TIRHUTA LETTER HA 1055 | | 0xF0 0x91 0x93 0x84..0x85 #Lo [2] TIRHUTA SIGN AVAGRAHA..TIRHUTA ... 1056 | | 0xF0 0x91 0x93 0x87 #Lo TIRHUTA OM 1057 | | 0xF0 0x91 0x96 0x80..0xAE #Lo [47] SIDDHAM LETTER A..SIDDHAM LETTE... 1058 | | 0xF0 0x91 0x97 0x98..0x9B #Lo [4] SIDDHAM LETTER THREE-CIRCLE ALT... 1059 | | 0xF0 0x91 0x98 0x80..0xAF #Lo [48] MODI LETTER A..MODI LETTER LLA 1060 | | 0xF0 0x91 0x99 0x84 #Lo MODI SIGN HUVA 1061 | | 0xF0 0x91 0x9A 0x80..0xAA #Lo [43] TAKRI LETTER A..TAKRI LETTER RRA 1062 | | 0xF0 0x91 0xA2 0xA0..0xFF #L& [64] WARANG CITI CAPITAL LETTER NGAA... 1063 | | 0xF0 0x91 0xA3 0x00..0x9F # 1064 | | 0xF0 0x91 0xA3 0xBF #Lo WARANG CITI OM 1065 | | 0xF0 0x91 0xAB 0x80..0xB8 #Lo [57] PAU CIN HAU LETTER PA..PAU CIN ... 1066 | | 0xF0 0x92 0x80 0x80..0xFF #Lo [922] CUNEIFORM SIGN A..CUNEIFO... 1067 | | 0xF0 0x92 0x81..0x8D 0x00..0xFF # 1068 | | 0xF0 0x92 0x8E 0x00..0x99 # 1069 | | 0xF0 0x92 0x90 0x80..0xFF #Nl [111] CUNEIFORM NUMERIC SIGN TWO ASH.... 1070 | | 0xF0 0x92 0x91 0x00..0xAE # 1071 | | 0xF0 0x92 0x92 0x80..0xFF #Lo [196] CUNEIFORM SIGN AB TIMES N... 1072 | | 0xF0 0x92 0x93..0x94 0x00..0xFF # 1073 | | 0xF0 0x92 0x95 0x00..0x83 # 1074 | | 0xF0 0x93 0x80 0x80..0xFF #Lo [1071] EGYPTIAN HIEROGLYPH A001... 1075 | | 0xF0 0x93 0x81..0x8F 0x00..0xFF # 1076 | | 0xF0 0x93 0x90 0x00..0xAE # 1077 | | 0xF0 0x94 0x90 0x80..0xFF #Lo [583] ANATOLIAN HIEROGLYPH A001... 1078 | | 0xF0 0x94 0x91..0x98 0x00..0xFF # 1079 | | 0xF0 0x94 0x99 0x00..0x86 # 1080 | | 0xF0 0x96 0xA0 0x80..0xFF #Lo [569] BAMUM LETTER PHASE-A NGKU... 1081 | | 0xF0 0x96 0xA1..0xA7 0x00..0xFF # 1082 | | 0xF0 0x96 0xA8 0x00..0xB8 # 1083 | | 0xF0 0x96 0xA9 0x80..0x9E #Lo [31] MRO LETTER TA..MRO LETTER TEK 1084 | | 0xF0 0x96 0xAB 0x90..0xAD #Lo [30] BASSA VAH LETTER ENNI..BASSA VA... 1085 | | 0xF0 0x96 0xAC 0x80..0xAF #Lo [48] PAHAWH HMONG VOWEL KEEB..PAHAWH... 1086 | | 0xF0 0x96 0xAD 0x80..0x83 #Lm [4] PAHAWH HMONG SIGN VOS SEEV..PAH... 1087 | | 0xF0 0x96 0xAD 0xA3..0xB7 #Lo [21] PAHAWH HMONG SIGN VOS LUB..PAHA... 1088 | | 0xF0 0x96 0xAD 0xBD..0xFF #Lo [19] PAHAWH HMONG CLAN SIGN TSHEEJ..... 1089 | | 0xF0 0x96 0xAE 0x00..0x8F # 1090 | | 0xF0 0x96 0xBC 0x80..0xFF #Lo [69] MIAO LETTER PA..MIAO LETTER HHA 1091 | | 0xF0 0x96 0xBD 0x00..0x84 # 1092 | | 0xF0 0x96 0xBD 0x90 #Lo MIAO LETTER NASALIZATION 1093 | | 0xF0 0x96 0xBE 0x93..0x9F #Lm [13] MIAO LETTER TONE-2..MIAO LETTER... 1094 | | 0xF0 0x9B 0xB0 0x80..0xFF #Lo [107] DUPLOYAN LETTER H..DUPLOYAN LET... 1095 | | 0xF0 0x9B 0xB1 0x00..0xAA # 1096 | | 0xF0 0x9B 0xB1 0xB0..0xBC #Lo [13] DUPLOYAN AFFIX LEFT HORIZONTAL ... 1097 | | 0xF0 0x9B 0xB2 0x80..0x88 #Lo [9] DUPLOYAN AFFIX HIGH ACUTE..DUPL... 1098 | | 0xF0 0x9B 0xB2 0x90..0x99 #Lo [10] DUPLOYAN AFFIX LOW ACUTE..DUPLO... 1099 | | 0xF0 0x9D 0x90 0x80..0xFF #L& [85] MATHEMATICAL BOLD CAPITAL A..MA... 1100 | | 0xF0 0x9D 0x91 0x00..0x94 # 1101 | | 0xF0 0x9D 0x91 0x96..0xFF #L& [71] MATHEMATICAL ITALIC SMALL I..MA... 1102 | | 0xF0 0x9D 0x92 0x00..0x9C # 1103 | | 0xF0 0x9D 0x92 0x9E..0x9F #L& [2] MATHEMATICAL SCRIPT CAPITAL C..... 1104 | | 0xF0 0x9D 0x92 0xA2 #L& MATHEMATICAL SCRIPT CAPITAL G 1105 | | 0xF0 0x9D 0x92 0xA5..0xA6 #L& [2] MATHEMATICAL SCRIPT CAPITAL J..... 1106 | | 0xF0 0x9D 0x92 0xA9..0xAC #L& [4] MATHEMATICAL SCRIPT CAPITAL N..... 1107 | | 0xF0 0x9D 0x92 0xAE..0xB9 #L& [12] MATHEMATICAL SCRIPT CAPITAL S..... 1108 | | 0xF0 0x9D 0x92 0xBB #L& MATHEMATICAL SCRIPT SMALL F 1109 | | 0xF0 0x9D 0x92 0xBD..0xFF #L& [7] MATHEMATICAL SCRIPT SMALL H..MA... 1110 | | 0xF0 0x9D 0x93 0x00..0x83 # 1111 | | 0xF0 0x9D 0x93 0x85..0xFF #L& [65] MATHEMATICAL SCRIPT SMALL P..MA... 1112 | | 0xF0 0x9D 0x94 0x00..0x85 # 1113 | | 0xF0 0x9D 0x94 0x87..0x8A #L& [4] MATHEMATICAL FRAKTUR CAPITAL D.... 1114 | | 0xF0 0x9D 0x94 0x8D..0x94 #L& [8] MATHEMATICAL FRAKTUR CAPITAL J.... 1115 | | 0xF0 0x9D 0x94 0x96..0x9C #L& [7] MATHEMATICAL FRAKTUR CAPITAL S.... 1116 | | 0xF0 0x9D 0x94 0x9E..0xB9 #L& [28] MATHEMATICAL FRAKTUR SMALL A..M... 1117 | | 0xF0 0x9D 0x94 0xBB..0xBE #L& [4] MATHEMATICAL DOUBLE-STRUCK CAPI... 1118 | | 0xF0 0x9D 0x95 0x80..0x84 #L& [5] MATHEMATICAL DOUBLE-STRUCK CAPI... 1119 | | 0xF0 0x9D 0x95 0x86 #L& MATHEMATICAL DOUBLE-STRUCK CAPITAL O 1120 | | 0xF0 0x9D 0x95 0x8A..0x90 #L& [7] MATHEMATICAL DOUBLE-STRUCK CAPI... 1121 | | 0xF0 0x9D 0x95 0x92..0xFF #L& [340] MATHEMATICAL DOUBLE-STRUC... 1122 | | 0xF0 0x9D 0x96..0x99 0x00..0xFF # 1123 | | 0xF0 0x9D 0x9A 0x00..0xA5 # 1124 | | 0xF0 0x9D 0x9A 0xA8..0xFF #L& [25] MATHEMATICAL BOLD CAPITAL ALPHA... 1125 | | 0xF0 0x9D 0x9B 0x00..0x80 # 1126 | | 0xF0 0x9D 0x9B 0x82..0x9A #L& [25] MATHEMATICAL BOLD SMALL ALPHA..... 1127 | | 0xF0 0x9D 0x9B 0x9C..0xBA #L& [31] MATHEMATICAL BOLD EPSILON SYMBO... 1128 | | 0xF0 0x9D 0x9B 0xBC..0xFF #L& [25] MATHEMATICAL ITALIC SMALL ALPHA... 1129 | | 0xF0 0x9D 0x9C 0x00..0x94 # 1130 | | 0xF0 0x9D 0x9C 0x96..0xB4 #L& [31] MATHEMATICAL ITALIC EPSILON SYM... 1131 | | 0xF0 0x9D 0x9C 0xB6..0xFF #L& [25] MATHEMATICAL BOLD ITALIC SMALL ... 1132 | | 0xF0 0x9D 0x9D 0x00..0x8E # 1133 | | 0xF0 0x9D 0x9D 0x90..0xAE #L& [31] MATHEMATICAL BOLD ITALIC EPSILO... 1134 | | 0xF0 0x9D 0x9D 0xB0..0xFF #L& [25] MATHEMATICAL SANS-SERIF BOLD SM... 1135 | | 0xF0 0x9D 0x9E 0x00..0x88 # 1136 | | 0xF0 0x9D 0x9E 0x8A..0xA8 #L& [31] MATHEMATICAL SANS-SERIF BOLD EP... 1137 | | 0xF0 0x9D 0x9E 0xAA..0xFF #L& [25] MATHEMATICAL SANS-SERIF BOLD IT... 1138 | | 0xF0 0x9D 0x9F 0x00..0x82 # 1139 | | 0xF0 0x9D 0x9F 0x84..0x8B #L& [8] MATHEMATICAL SANS-SERIF BOLD IT... 1140 | | 0xF0 0x9E 0xA0 0x80..0xFF #Lo [197] MENDE KIKAKUI SYLLABLE M0... 1141 | | 0xF0 0x9E 0xA1..0xA2 0x00..0xFF # 1142 | | 0xF0 0x9E 0xA3 0x00..0x84 # 1143 | | 0xF0 0x9E 0xB8 0x80..0x83 #Lo [4] ARABIC MATHEMATICAL ALEF..ARABI... 1144 | | 0xF0 0x9E 0xB8 0x85..0x9F #Lo [27] ARABIC MATHEMATICAL WAW..ARABIC... 1145 | | 0xF0 0x9E 0xB8 0xA1..0xA2 #Lo [2] ARABIC MATHEMATICAL INITIAL BEH... 1146 | | 0xF0 0x9E 0xB8 0xA4 #Lo ARABIC MATHEMATICAL INITIAL HEH 1147 | | 0xF0 0x9E 0xB8 0xA7 #Lo ARABIC MATHEMATICAL INITIAL HAH 1148 | | 0xF0 0x9E 0xB8 0xA9..0xB2 #Lo [10] ARABIC MATHEMATICAL INITIAL YEH... 1149 | | 0xF0 0x9E 0xB8 0xB4..0xB7 #Lo [4] ARABIC MATHEMATICAL INITIAL SHE... 1150 | | 0xF0 0x9E 0xB8 0xB9 #Lo ARABIC MATHEMATICAL INITIAL DAD 1151 | | 0xF0 0x9E 0xB8 0xBB #Lo ARABIC MATHEMATICAL INITIAL GHAIN 1152 | | 0xF0 0x9E 0xB9 0x82 #Lo ARABIC MATHEMATICAL TAILED JEEM 1153 | | 0xF0 0x9E 0xB9 0x87 #Lo ARABIC MATHEMATICAL TAILED HAH 1154 | | 0xF0 0x9E 0xB9 0x89 #Lo ARABIC MATHEMATICAL TAILED YEH 1155 | | 0xF0 0x9E 0xB9 0x8B #Lo ARABIC MATHEMATICAL TAILED LAM 1156 | | 0xF0 0x9E 0xB9 0x8D..0x8F #Lo [3] ARABIC MATHEMATICAL TAILED NOON... 1157 | | 0xF0 0x9E 0xB9 0x91..0x92 #Lo [2] ARABIC MATHEMATICAL TAILED SAD.... 1158 | | 0xF0 0x9E 0xB9 0x94 #Lo ARABIC MATHEMATICAL TAILED SHEEN 1159 | | 0xF0 0x9E 0xB9 0x97 #Lo ARABIC MATHEMATICAL TAILED KHAH 1160 | | 0xF0 0x9E 0xB9 0x99 #Lo ARABIC MATHEMATICAL TAILED DAD 1161 | | 0xF0 0x9E 0xB9 0x9B #Lo ARABIC MATHEMATICAL TAILED GHAIN 1162 | | 0xF0 0x9E 0xB9 0x9D #Lo ARABIC MATHEMATICAL TAILED DOTLESS... 1163 | | 0xF0 0x9E 0xB9 0x9F #Lo ARABIC MATHEMATICAL TAILED DOTLESS... 1164 | | 0xF0 0x9E 0xB9 0xA1..0xA2 #Lo [2] ARABIC MATHEMATICAL STRETCHED B... 1165 | | 0xF0 0x9E 0xB9 0xA4 #Lo ARABIC MATHEMATICAL STRETCHED HEH 1166 | | 0xF0 0x9E 0xB9 0xA7..0xAA #Lo [4] ARABIC MATHEMATICAL STRETCHED H... 1167 | | 0xF0 0x9E 0xB9 0xAC..0xB2 #Lo [7] ARABIC MATHEMATICAL STRETCHED M... 1168 | | 0xF0 0x9E 0xB9 0xB4..0xB7 #Lo [4] ARABIC MATHEMATICAL STRETCHED S... 1169 | | 0xF0 0x9E 0xB9 0xB9..0xBC #Lo [4] ARABIC MATHEMATICAL STRETCHED D... 1170 | | 0xF0 0x9E 0xB9 0xBE #Lo ARABIC MATHEMATICAL STRETCHED DOTL... 1171 | | 0xF0 0x9E 0xBA 0x80..0x89 #Lo [10] ARABIC MATHEMATICAL LOOPED ALEF... 1172 | | 0xF0 0x9E 0xBA 0x8B..0x9B #Lo [17] ARABIC MATHEMATICAL LOOPED LAM.... 1173 | | 0xF0 0x9E 0xBA 0xA1..0xA3 #Lo [3] ARABIC MATHEMATICAL DOUBLE-STRU... 1174 | | 0xF0 0x9E 0xBA 0xA5..0xA9 #Lo [5] ARABIC MATHEMATICAL DOUBLE-STRU... 1175 | | 0xF0 0x9E 0xBA 0xAB..0xBB #Lo [17] ARABIC MATHEMATICAL DOUBLE-STRU... 1176 | | 0xF0 0x9F 0x84 0xB0..0xFF #So [26] SQUARED LATIN CAPITAL LETTER A.... 1177 | | 0xF0 0x9F 0x85 0x00..0x89 # 1178 | | 0xF0 0x9F 0x85 0x90..0xA9 #So [26] NEGATIVE CIRCLED LATIN CAPITAL ... 1179 | | 0xF0 0x9F 0x85 0xB0..0xFF #So [26] NEGATIVE SQUARED LATIN CAPITAL ... 1180 | | 0xF0 0x9F 0x86 0x00..0x89 # 1181 | ; 1182 | 1183 | MidLetter = 1184 | 0x3A #Po COLON 1185 | | 0xC2 0xB7 #Po MIDDLE DOT 1186 | | 0xCB 0x97 #Sk MODIFIER LETTER MINUS SIGN 1187 | | 0xCE 0x87 #Po GREEK ANO TELEIA 1188 | | 0xD7 0xB4 #Po HEBREW PUNCTUATION GERSHAYIM 1189 | | 0xE2 0x80 0xA7 #Po HYPHENATION POINT 1190 | | 0xEF 0xB8 0x93 #Po PRESENTATION FORM FOR VERTICAL COLON 1191 | | 0xEF 0xB9 0x95 #Po SMALL COLON 1192 | | 0xEF 0xBC 0x9A #Po FULLWIDTH COLON 1193 | ; 1194 | 1195 | MidNum = 1196 | 0x2C #Po COMMA 1197 | | 0x3B #Po SEMICOLON 1198 | | 0xCD 0xBE #Po GREEK QUESTION MARK 1199 | | 0xD6 0x89 #Po ARMENIAN FULL STOP 1200 | | 0xD8 0x8C..0x8D #Po [2] ARABIC COMMA..ARABIC DATE SEPARATOR 1201 | | 0xD9 0xAC #Po ARABIC THOUSANDS SEPARATOR 1202 | | 0xDF 0xB8 #Po NKO COMMA 1203 | | 0xE2 0x81 0x84 #Sm FRACTION SLASH 1204 | | 0xEF 0xB8 0x90 #Po PRESENTATION FORM FOR VERTICAL COMMA 1205 | | 0xEF 0xB8 0x94 #Po PRESENTATION FORM FOR VERTICAL SEM... 1206 | | 0xEF 0xB9 0x90 #Po SMALL COMMA 1207 | | 0xEF 0xB9 0x94 #Po SMALL SEMICOLON 1208 | | 0xEF 0xBC 0x8C #Po FULLWIDTH COMMA 1209 | | 0xEF 0xBC 0x9B #Po FULLWIDTH SEMICOLON 1210 | ; 1211 | 1212 | MidNumLet = 1213 | 0x2E #Po FULL STOP 1214 | | 0xE2 0x80 0x98 #Pi LEFT SINGLE QUOTATION MARK 1215 | | 0xE2 0x80 0x99 #Pf RIGHT SINGLE QUOTATION MARK 1216 | | 0xE2 0x80 0xA4 #Po ONE DOT LEADER 1217 | | 0xEF 0xB9 0x92 #Po SMALL FULL STOP 1218 | | 0xEF 0xBC 0x87 #Po FULLWIDTH APOSTROPHE 1219 | | 0xEF 0xBC 0x8E #Po FULLWIDTH FULL STOP 1220 | ; 1221 | 1222 | Numeric = 1223 | 0x30..0x39 #Nd [10] DIGIT ZERO..DIGIT NINE 1224 | | 0xD9 0xA0..0xA9 #Nd [10] ARABIC-INDIC DIGIT ZERO..ARABIC-IN... 1225 | | 0xD9 0xAB #Po ARABIC DECIMAL SEPARATOR 1226 | | 0xDB 0xB0..0xB9 #Nd [10] EXTENDED ARABIC-INDIC DIGIT ZERO..... 1227 | | 0xDF 0x80..0x89 #Nd [10] NKO DIGIT ZERO..NKO DIGIT NINE 1228 | | 0xE0 0xA5 0xA6..0xAF #Nd [10] DEVANAGARI DIGIT ZERO..DEVANAGARI ... 1229 | | 0xE0 0xA7 0xA6..0xAF #Nd [10] BENGALI DIGIT ZERO..BENGALI DIGIT ... 1230 | | 0xE0 0xA9 0xA6..0xAF #Nd [10] GURMUKHI DIGIT ZERO..GURMUKHI DIGI... 1231 | | 0xE0 0xAB 0xA6..0xAF #Nd [10] GUJARATI DIGIT ZERO..GUJARATI DIGI... 1232 | | 0xE0 0xAD 0xA6..0xAF #Nd [10] ORIYA DIGIT ZERO..ORIYA DIGIT NINE 1233 | | 0xE0 0xAF 0xA6..0xAF #Nd [10] TAMIL DIGIT ZERO..TAMIL DIGIT NINE 1234 | | 0xE0 0xB1 0xA6..0xAF #Nd [10] TELUGU DIGIT ZERO..TELUGU DIGIT NINE 1235 | | 0xE0 0xB3 0xA6..0xAF #Nd [10] KANNADA DIGIT ZERO..KANNADA DIGIT ... 1236 | | 0xE0 0xB5 0xA6..0xAF #Nd [10] MALAYALAM DIGIT ZERO..MALAYALAM DI... 1237 | | 0xE0 0xB7 0xA6..0xAF #Nd [10] SINHALA LITH DIGIT ZERO..SINHALA L... 1238 | | 0xE0 0xB9 0x90..0x99 #Nd [10] THAI DIGIT ZERO..THAI DIGIT NINE 1239 | | 0xE0 0xBB 0x90..0x99 #Nd [10] LAO DIGIT ZERO..LAO DIGIT NINE 1240 | | 0xE0 0xBC 0xA0..0xA9 #Nd [10] TIBETAN DIGIT ZERO..TIBETAN DIGIT ... 1241 | | 0xE1 0x81 0x80..0x89 #Nd [10] MYANMAR DIGIT ZERO..MYANMAR DIGIT ... 1242 | | 0xE1 0x82 0x90..0x99 #Nd [10] MYANMAR SHAN DIGIT ZERO..MYANMAR S... 1243 | | 0xE1 0x9F 0xA0..0xA9 #Nd [10] KHMER DIGIT ZERO..KHMER DIGIT NINE 1244 | | 0xE1 0xA0 0x90..0x99 #Nd [10] MONGOLIAN DIGIT ZERO..MONGOLIAN DI... 1245 | | 0xE1 0xA5 0x86..0x8F #Nd [10] LIMBU DIGIT ZERO..LIMBU DIGIT NINE 1246 | | 0xE1 0xA7 0x90..0x99 #Nd [10] NEW TAI LUE DIGIT ZERO..NEW TAI LU... 1247 | | 0xE1 0xAA 0x80..0x89 #Nd [10] TAI THAM HORA DIGIT ZERO..TAI THAM... 1248 | | 0xE1 0xAA 0x90..0x99 #Nd [10] TAI THAM THAM DIGIT ZERO..TAI THAM... 1249 | | 0xE1 0xAD 0x90..0x99 #Nd [10] BALINESE DIGIT ZERO..BALINESE DIGI... 1250 | | 0xE1 0xAE 0xB0..0xB9 #Nd [10] SUNDANESE DIGIT ZERO..SUNDANESE DI... 1251 | | 0xE1 0xB1 0x80..0x89 #Nd [10] LEPCHA DIGIT ZERO..LEPCHA DIGIT NINE 1252 | | 0xE1 0xB1 0x90..0x99 #Nd [10] OL CHIKI DIGIT ZERO..OL CHIKI DIGI... 1253 | | 0xEA 0x98 0xA0..0xA9 #Nd [10] VAI DIGIT ZERO..VAI DIGIT NINE 1254 | | 0xEA 0xA3 0x90..0x99 #Nd [10] SAURASHTRA DIGIT ZERO..SAURASHTRA ... 1255 | | 0xEA 0xA4 0x80..0x89 #Nd [10] KAYAH LI DIGIT ZERO..KAYAH LI DIGI... 1256 | | 0xEA 0xA7 0x90..0x99 #Nd [10] JAVANESE DIGIT ZERO..JAVANESE DIGI... 1257 | | 0xEA 0xA7 0xB0..0xB9 #Nd [10] MYANMAR TAI LAING DIGIT ZERO..MYAN... 1258 | | 0xEA 0xA9 0x90..0x99 #Nd [10] CHAM DIGIT ZERO..CHAM DIGIT NINE 1259 | | 0xEA 0xAF 0xB0..0xB9 #Nd [10] MEETEI MAYEK DIGIT ZERO..MEETEI MA... 1260 | | 0xF0 0x90 0x92 0xA0..0xA9 #Nd [10] OSMANYA DIGIT ZERO..OSMANYA DIG... 1261 | | 0xF0 0x91 0x81 0xA6..0xAF #Nd [10] BRAHMI DIGIT ZERO..BRAHMI DIGIT... 1262 | | 0xF0 0x91 0x83 0xB0..0xB9 #Nd [10] SORA SOMPENG DIGIT ZERO..SORA S... 1263 | | 0xF0 0x91 0x84 0xB6..0xBF #Nd [10] CHAKMA DIGIT ZERO..CHAKMA DIGIT... 1264 | | 0xF0 0x91 0x87 0x90..0x99 #Nd [10] SHARADA DIGIT ZERO..SHARADA DIG... 1265 | | 0xF0 0x91 0x8B 0xB0..0xB9 #Nd [10] KHUDAWADI DIGIT ZERO..KHUDAWADI... 1266 | | 0xF0 0x91 0x93 0x90..0x99 #Nd [10] TIRHUTA DIGIT ZERO..TIRHUTA DIG... 1267 | | 0xF0 0x91 0x99 0x90..0x99 #Nd [10] MODI DIGIT ZERO..MODI DIGIT NINE 1268 | | 0xF0 0x91 0x9B 0x80..0x89 #Nd [10] TAKRI DIGIT ZERO..TAKRI DIGIT NINE 1269 | | 0xF0 0x91 0x9C 0xB0..0xB9 #Nd [10] AHOM DIGIT ZERO..AHOM DIGIT NINE 1270 | | 0xF0 0x91 0xA3 0xA0..0xA9 #Nd [10] WARANG CITI DIGIT ZERO..WARANG ... 1271 | | 0xF0 0x96 0xA9 0xA0..0xA9 #Nd [10] MRO DIGIT ZERO..MRO DIGIT NINE 1272 | | 0xF0 0x96 0xAD 0x90..0x99 #Nd [10] PAHAWH HMONG DIGIT ZERO..PAHAWH... 1273 | | 0xF0 0x9D 0x9F 0x8E..0xBF #Nd [50] MATHEMATICAL BOLD DIGIT ZERO..M... 1274 | ; 1275 | 1276 | ExtendNumLet = 1277 | 0x5F #Pc LOW LINE 1278 | | 0xE2 0x80 0xBF..0xFF #Pc [2] UNDERTIE..CHARACTER TIE 1279 | | 0xE2 0x81 0x00..0x80 # 1280 | | 0xE2 0x81 0x94 #Pc INVERTED UNDERTIE 1281 | | 0xEF 0xB8 0xB3..0xB4 #Pc [2] PRESENTATION FORM FOR VERTICAL LOW... 1282 | | 0xEF 0xB9 0x8D..0x8F #Pc [3] DASHED LOW LINE..WAVY LOW LINE 1283 | | 0xEF 0xBC 0xBF #Pc FULLWIDTH LOW LINE 1284 | ; 1285 | 1286 | Regional_Indicator = 1287 | 0xF0 0x9F 0x87 0xA6..0xBF #So [26] REGIONAL INDICATOR SYMBOL LETTE... 1288 | ; 1289 | 1290 | }%% 1291 | -------------------------------------------------------------------------------- /segment.go: -------------------------------------------------------------------------------- 1 | // Copyright (c) 2015 Couchbase, Inc. 2 | // Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file 3 | // except in compliance with the License. You may obtain a copy of the License at 4 | // http://www.apache.org/licenses/LICENSE-2.0 5 | // Unless required by applicable law or agreed to in writing, software distributed under the 6 | // License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, 7 | // either express or implied. See the License for the specific language governing permissions 8 | // and limitations under the License. 9 | 10 | package segment 11 | 12 | import ( 13 | "errors" 14 | "io" 15 | ) 16 | 17 | // Autogenerate the following: 18 | // 1. Ragel rules from subset of Unicode script properties 19 | // 2. Ragel rules from Unicode word segmentation properties 20 | // 3. Ragel machine for word segmentation 21 | // 4. Test tables from Unicode 22 | // 23 | // Requires: 24 | // 1. Ruby (to generate ragel rules from unicode spec) 25 | // 2. Ragel (only v6.9 tested) 26 | // 3. sed (to rewrite build tags) 27 | // 28 | //go:generate ragel/unicode2ragel.rb -u http://www.unicode.org/Public/8.0.0/ucd/Scripts.txt -m SCRIPTS -p Hangul,Han,Hiragana -o ragel/uscript.rl 29 | //go:generate ragel/unicode2ragel.rb -u http://www.unicode.org/Public/8.0.0/ucd/auxiliary/WordBreakProperty.txt -m WB -p Double_Quote,Single_Quote,Hebrew_Letter,CR,LF,Newline,Extend,Format,Katakana,ALetter,MidLetter,MidNum,MidNumLet,Numeric,ExtendNumLet,Regional_Indicator -o ragel/uwb.rl 30 | //go:generate ragel -T1 -Z segment_words.rl -o segment_words.go 31 | //go:generate sed -i "" -e "s/BUILDTAGS/!prod/" segment_words.go 32 | //go:generate sed -i "" -e "s/RAGELFLAGS/-T1/" segment_words.go 33 | //go:generate ragel -G2 -Z segment_words.rl -o segment_words_prod.go 34 | //go:generate sed -i "" -e "s/BUILDTAGS/prod/" segment_words_prod.go 35 | //go:generate sed -i "" -e "s/RAGELFLAGS/-G2/" segment_words_prod.go 36 | //go:generate go run maketesttables.go -output tables_test.go 37 | 38 | // NewWordSegmenter returns a new Segmenter to read from r. 39 | func NewWordSegmenter(r io.Reader) *Segmenter { 40 | return NewSegmenter(r) 41 | } 42 | 43 | // NewWordSegmenterDirect returns a new Segmenter to work directly with buf. 44 | func NewWordSegmenterDirect(buf []byte) *Segmenter { 45 | return NewSegmenterDirect(buf) 46 | } 47 | 48 | func SplitWords(data []byte, atEOF bool) (int, []byte, error) { 49 | advance, token, _, err := SegmentWords(data, atEOF) 50 | return advance, token, err 51 | } 52 | 53 | func SegmentWords(data []byte, atEOF bool) (int, []byte, int, error) { 54 | vals := make([][]byte, 0, 1) 55 | types := make([]int, 0, 1) 56 | tokens, types, advance, err := segmentWords(data, 1, atEOF, vals, types) 57 | if len(tokens) > 0 { 58 | return advance, tokens[0], types[0], err 59 | } 60 | return advance, nil, 0, err 61 | } 62 | 63 | func SegmentWordsDirect(data []byte, val [][]byte, types []int) ([][]byte, []int, int, error) { 64 | return segmentWords(data, -1, true, val, types) 65 | } 66 | 67 | // *** Core Segmenter 68 | 69 | const maxConsecutiveEmptyReads = 100 70 | 71 | // NewSegmenter returns a new Segmenter to read from r. 72 | // Defaults to segment using SegmentWords 73 | func NewSegmenter(r io.Reader) *Segmenter { 74 | return &Segmenter{ 75 | r: r, 76 | segment: SegmentWords, 77 | maxTokenSize: MaxScanTokenSize, 78 | buf: make([]byte, 4096), // Plausible starting size; needn't be large. 79 | } 80 | } 81 | 82 | // NewSegmenterDirect returns a new Segmenter to work directly with buf. 83 | // Defaults to segment using SegmentWords 84 | func NewSegmenterDirect(buf []byte) *Segmenter { 85 | return &Segmenter{ 86 | segment: SegmentWords, 87 | maxTokenSize: MaxScanTokenSize, 88 | buf: buf, 89 | start: 0, 90 | end: len(buf), 91 | err: io.EOF, 92 | } 93 | } 94 | 95 | // Segmenter provides a convenient interface for reading data such as 96 | // a file of newline-delimited lines of text. Successive calls to 97 | // the Segment method will step through the 'tokens' of a file, skipping 98 | // the bytes between the tokens. The specification of a token is 99 | // defined by a split function of type SplitFunc; the default split 100 | // function breaks the input into lines with line termination stripped. Split 101 | // functions are defined in this package for scanning a file into 102 | // lines, bytes, UTF-8-encoded runes, and space-delimited words. The 103 | // client may instead provide a custom split function. 104 | // 105 | // Segmenting stops unrecoverably at EOF, the first I/O error, or a token too 106 | // large to fit in the buffer. When a scan stops, the reader may have 107 | // advanced arbitrarily far past the last token. Programs that need more 108 | // control over error handling or large tokens, or must run sequential scans 109 | // on a reader, should use bufio.Reader instead. 110 | // 111 | type Segmenter struct { 112 | r io.Reader // The reader provided by the client. 113 | segment SegmentFunc // The function to split the tokens. 114 | maxTokenSize int // Maximum size of a token; modified by tests. 115 | token []byte // Last token returned by split. 116 | buf []byte // Buffer used as argument to split. 117 | start int // First non-processed byte in buf. 118 | end int // End of data in buf. 119 | typ int // The token type 120 | err error // Sticky error. 121 | } 122 | 123 | // SegmentFunc is the signature of the segmenting function used to tokenize the 124 | // input. The arguments are an initial substring of the remaining unprocessed 125 | // data and a flag, atEOF, that reports whether the Reader has no more data 126 | // to give. The return values are the number of bytes to advance the input 127 | // and the next token to return to the user, plus an error, if any. If the 128 | // data does not yet hold a complete token, for instance if it has no newline 129 | // while scanning lines, SegmentFunc can return (0, nil, nil) to signal the 130 | // Segmenter to read more data into the slice and try again with a longer slice 131 | // starting at the same point in the input. 132 | // 133 | // If the returned error is non-nil, segmenting stops and the error 134 | // is returned to the client. 135 | // 136 | // The function is never called with an empty data slice unless atEOF 137 | // is true. If atEOF is true, however, data may be non-empty and, 138 | // as always, holds unprocessed text. 139 | type SegmentFunc func(data []byte, atEOF bool) (advance int, token []byte, segmentType int, err error) 140 | 141 | // Errors returned by Segmenter. 142 | var ( 143 | ErrTooLong = errors.New("bufio.Segmenter: token too long") 144 | ErrNegativeAdvance = errors.New("bufio.Segmenter: SplitFunc returns negative advance count") 145 | ErrAdvanceTooFar = errors.New("bufio.Segmenter: SplitFunc returns advance count beyond input") 146 | ) 147 | 148 | const ( 149 | // Maximum size used to buffer a token. The actual maximum token size 150 | // may be smaller as the buffer may need to include, for instance, a newline. 151 | MaxScanTokenSize = 64 * 1024 152 | ) 153 | 154 | // Err returns the first non-EOF error that was encountered by the Segmenter. 155 | func (s *Segmenter) Err() error { 156 | if s.err == io.EOF { 157 | return nil 158 | } 159 | return s.err 160 | } 161 | 162 | func (s *Segmenter) Type() int { 163 | return s.typ 164 | } 165 | 166 | // Bytes returns the most recent token generated by a call to Segment. 167 | // The underlying array may point to data that will be overwritten 168 | // by a subsequent call to Segment. It does no allocation. 169 | func (s *Segmenter) Bytes() []byte { 170 | return s.token 171 | } 172 | 173 | // Text returns the most recent token generated by a call to Segment 174 | // as a newly allocated string holding its bytes. 175 | func (s *Segmenter) Text() string { 176 | return string(s.token) 177 | } 178 | 179 | // Segment advances the Segmenter to the next token, which will then be 180 | // available through the Bytes or Text method. It returns false when the 181 | // scan stops, either by reaching the end of the input or an error. 182 | // After Segment returns false, the Err method will return any error that 183 | // occurred during scanning, except that if it was io.EOF, Err 184 | // will return nil. 185 | func (s *Segmenter) Segment() bool { 186 | // Loop until we have a token. 187 | for { 188 | // See if we can get a token with what we already have. 189 | if s.end > s.start { 190 | advance, token, typ, err := s.segment(s.buf[s.start:s.end], s.err != nil) 191 | if err != nil { 192 | s.setErr(err) 193 | return false 194 | } 195 | s.typ = typ 196 | if !s.advance(advance) { 197 | return false 198 | } 199 | s.token = token 200 | if token != nil { 201 | return true 202 | } 203 | } 204 | // We cannot generate a token with what we are holding. 205 | // If we've already hit EOF or an I/O error, we are done. 206 | if s.err != nil { 207 | // Shut it down. 208 | s.start = 0 209 | s.end = 0 210 | return false 211 | } 212 | // Must read more data. 213 | // First, shift data to beginning of buffer if there's lots of empty space 214 | // or space is needed. 215 | if s.start > 0 && (s.end == len(s.buf) || s.start > len(s.buf)/2) { 216 | copy(s.buf, s.buf[s.start:s.end]) 217 | s.end -= s.start 218 | s.start = 0 219 | } 220 | // Is the buffer full? If so, resize. 221 | if s.end == len(s.buf) { 222 | if len(s.buf) >= s.maxTokenSize { 223 | s.setErr(ErrTooLong) 224 | return false 225 | } 226 | newSize := len(s.buf) * 2 227 | if newSize > s.maxTokenSize { 228 | newSize = s.maxTokenSize 229 | } 230 | newBuf := make([]byte, newSize) 231 | copy(newBuf, s.buf[s.start:s.end]) 232 | s.buf = newBuf 233 | s.end -= s.start 234 | s.start = 0 235 | continue 236 | } 237 | // Finally we can read some input. Make sure we don't get stuck with 238 | // a misbehaving Reader. Officially we don't need to do this, but let's 239 | // be extra careful: Segmenter is for safe, simple jobs. 240 | for loop := 0; ; { 241 | n, err := s.r.Read(s.buf[s.end:len(s.buf)]) 242 | s.end += n 243 | if err != nil { 244 | s.setErr(err) 245 | break 246 | } 247 | if n > 0 { 248 | break 249 | } 250 | loop++ 251 | if loop > maxConsecutiveEmptyReads { 252 | s.setErr(io.ErrNoProgress) 253 | break 254 | } 255 | } 256 | } 257 | } 258 | 259 | // advance consumes n bytes of the buffer. It reports whether the advance was legal. 260 | func (s *Segmenter) advance(n int) bool { 261 | if n < 0 { 262 | s.setErr(ErrNegativeAdvance) 263 | return false 264 | } 265 | if n > s.end-s.start { 266 | s.setErr(ErrAdvanceTooFar) 267 | return false 268 | } 269 | s.start += n 270 | return true 271 | } 272 | 273 | // setErr records the first error encountered. 274 | func (s *Segmenter) setErr(err error) { 275 | if s.err == nil || s.err == io.EOF { 276 | s.err = err 277 | } 278 | } 279 | 280 | // SetSegmenter sets the segment function for the Segmenter. If called, it must be 281 | // called before Segment. 282 | func (s *Segmenter) SetSegmenter(segmenter SegmentFunc) { 283 | s.segment = segmenter 284 | } 285 | -------------------------------------------------------------------------------- /segment_fuzz.go: -------------------------------------------------------------------------------- 1 | // Copyright (c) 2015 Couchbase, Inc. 2 | // Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file 3 | // except in compliance with the License. You may obtain a copy of the License at 4 | // http://www.apache.org/licenses/LICENSE-2.0 5 | // Unless required by applicable law or agreed to in writing, software distributed under the 6 | // License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, 7 | // either express or implied. See the License for the specific language governing permissions 8 | // and limitations under the License. 9 | 10 | // +build gofuzz 11 | 12 | package segment 13 | 14 | func Fuzz(data []byte) int { 15 | 16 | vals := make([][]byte, 0, 10000) 17 | types := make([]int, 0, 10000) 18 | if _, _, _, err := SegmentWordsDirect(data, vals, types); err != nil { 19 | return 0 20 | } 21 | return 1 22 | } 23 | -------------------------------------------------------------------------------- /segment_fuzz_test.go: -------------------------------------------------------------------------------- 1 | // Copyright (c) 2014 Couchbase, Inc. 2 | // Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file 3 | // except in compliance with the License. You may obtain a copy of the License at 4 | // http://www.apache.org/licenses/LICENSE-2.0 5 | // Unless required by applicable law or agreed to in writing, software distributed under the 6 | // License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, 7 | // either express or implied. See the License for the specific language governing permissions 8 | // and limitations under the License. 9 | 10 | // +build gofuzz_generate 11 | 12 | package segment 13 | 14 | import ( 15 | "io/ioutil" 16 | "os" 17 | "strconv" 18 | "testing" 19 | ) 20 | 21 | const fuzzPrefix = "workdir/corpus" 22 | 23 | func TestGenerateWordSegmentFuzz(t *testing.T) { 24 | 25 | os.MkdirAll(fuzzPrefix, 0777) 26 | for i, test := range unicodeWordTests { 27 | ioutil.WriteFile(fuzzPrefix+"/"+strconv.Itoa(i)+".txt", test.input, 0777) 28 | } 29 | } 30 | -------------------------------------------------------------------------------- /segment_test.go: -------------------------------------------------------------------------------- 1 | // Copyright (c) 2014 Couchbase, Inc. 2 | // Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file 3 | // except in compliance with the License. You may obtain a copy of the License at 4 | // http://www.apache.org/licenses/LICENSE-2.0 5 | // Unless required by applicable law or agreed to in writing, software distributed under the 6 | // License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, 7 | // either express or implied. See the License for the specific language governing permissions 8 | // and limitations under the License. 9 | 10 | package segment 11 | 12 | import ( 13 | "bufio" 14 | "bytes" 15 | "errors" 16 | "io" 17 | "strings" 18 | "testing" 19 | ) 20 | 21 | // Tests borrowed from Scanner to test Segmenter 22 | 23 | // slowReader is a reader that returns only a few bytes at a time, to test the incremental 24 | // reads in Scanner.Scan. 25 | type slowReader struct { 26 | max int 27 | buf io.Reader 28 | } 29 | 30 | func (sr *slowReader) Read(p []byte) (n int, err error) { 31 | if len(p) > sr.max { 32 | p = p[0:sr.max] 33 | } 34 | return sr.buf.Read(p) 35 | } 36 | 37 | // genLine writes to buf a predictable but non-trivial line of text of length 38 | // n, including the terminal newline and an occasional carriage return. 39 | // If addNewline is false, the \r and \n are not emitted. 40 | func genLine(buf *bytes.Buffer, lineNum, n int, addNewline bool) { 41 | buf.Reset() 42 | doCR := lineNum%5 == 0 43 | if doCR { 44 | n-- 45 | } 46 | for i := 0; i < n-1; i++ { // Stop early for \n. 47 | c := 'a' + byte(lineNum+i) 48 | if c == '\n' || c == '\r' { // Don't confuse us. 49 | c = 'N' 50 | } 51 | buf.WriteByte(c) 52 | } 53 | if addNewline { 54 | if doCR { 55 | buf.WriteByte('\r') 56 | } 57 | buf.WriteByte('\n') 58 | } 59 | return 60 | } 61 | 62 | func wrapSplitFuncAsSegmentFuncForTesting(splitFunc bufio.SplitFunc) SegmentFunc { 63 | return func(data []byte, atEOF bool) (advance int, token []byte, typ int, err error) { 64 | typ = 0 65 | advance, token, err = splitFunc(data, atEOF) 66 | return 67 | } 68 | } 69 | 70 | // Test that the line segmenter errors out on a long line. 71 | func TestSegmentTooLong(t *testing.T) { 72 | const smallMaxTokenSize = 256 // Much smaller for more efficient testing. 73 | // Build a buffer of lots of line lengths up to but not exceeding smallMaxTokenSize. 74 | tmp := new(bytes.Buffer) 75 | buf := new(bytes.Buffer) 76 | lineNum := 0 77 | j := 0 78 | for i := 0; i < 2*smallMaxTokenSize; i++ { 79 | genLine(tmp, lineNum, j, true) 80 | j++ 81 | buf.Write(tmp.Bytes()) 82 | lineNum++ 83 | } 84 | s := NewSegmenter(&slowReader{3, buf}) 85 | // change to line segmenter for testing 86 | s.SetSegmenter(wrapSplitFuncAsSegmentFuncForTesting(bufio.ScanLines)) 87 | s.MaxTokenSize(smallMaxTokenSize) 88 | j = 0 89 | for lineNum := 0; s.Segment(); lineNum++ { 90 | genLine(tmp, lineNum, j, false) 91 | if j < smallMaxTokenSize { 92 | j++ 93 | } else { 94 | j-- 95 | } 96 | line := tmp.Bytes() 97 | if !bytes.Equal(s.Bytes(), line) { 98 | t.Errorf("%d: bad line: %d %d\n%.100q\n%.100q\n", lineNum, len(s.Bytes()), len(line), s.Bytes(), line) 99 | } 100 | } 101 | err := s.Err() 102 | if err != ErrTooLong { 103 | t.Fatalf("expected ErrTooLong; got %s", err) 104 | } 105 | } 106 | 107 | var testError = errors.New("testError") 108 | 109 | // Test the correct error is returned when the split function errors out. 110 | func TestSegmentError(t *testing.T) { 111 | // Create a split function that delivers a little data, then a predictable error. 112 | numSplits := 0 113 | const okCount = 7 114 | errorSplit := func(data []byte, atEOF bool) (advance int, token []byte, err error) { 115 | if atEOF { 116 | panic("didn't get enough data") 117 | } 118 | if numSplits >= okCount { 119 | return 0, nil, testError 120 | } 121 | numSplits++ 122 | return 1, data[0:1], nil 123 | } 124 | // Read the data. 125 | const text = "abcdefghijklmnopqrstuvwxyz" 126 | buf := strings.NewReader(text) 127 | s := NewSegmenter(&slowReader{1, buf}) 128 | // change to line segmenter for testing 129 | s.SetSegmenter(wrapSplitFuncAsSegmentFuncForTesting(errorSplit)) 130 | var i int 131 | for i = 0; s.Segment(); i++ { 132 | if len(s.Bytes()) != 1 || text[i] != s.Bytes()[0] { 133 | t.Errorf("#%d: expected %q got %q", i, text[i], s.Bytes()[0]) 134 | } 135 | } 136 | // Check correct termination location and error. 137 | if i != okCount { 138 | t.Errorf("unexpected termination; expected %d tokens got %d", okCount, i) 139 | } 140 | err := s.Err() 141 | if err != testError { 142 | t.Fatalf("expected %q got %v", testError, err) 143 | } 144 | } 145 | 146 | // Test that Scan finishes if we have endless empty reads. 147 | type endlessZeros struct{} 148 | 149 | func (endlessZeros) Read(p []byte) (int, error) { 150 | return 0, nil 151 | } 152 | 153 | func TestBadReader(t *testing.T) { 154 | scanner := NewSegmenter(endlessZeros{}) 155 | for scanner.Segment() { 156 | t.Fatal("read should fail") 157 | } 158 | err := scanner.Err() 159 | if err != io.ErrNoProgress { 160 | t.Errorf("unexpected error: %v", err) 161 | } 162 | } 163 | 164 | func TestSegmentAdvanceNegativeError(t *testing.T) { 165 | errorSplit := func(data []byte, atEOF bool) (advance int, token []byte, err error) { 166 | if atEOF { 167 | panic("didn't get enough data") 168 | } 169 | return -1, data[0:1], nil 170 | } 171 | // Read the data. 172 | const text = "abcdefghijklmnopqrstuvwxyz" 173 | buf := strings.NewReader(text) 174 | s := NewSegmenter(&slowReader{1, buf}) 175 | // change to line segmenter for testing 176 | s.SetSegmenter(wrapSplitFuncAsSegmentFuncForTesting(errorSplit)) 177 | s.Segment() 178 | err := s.Err() 179 | if err != ErrNegativeAdvance { 180 | t.Fatalf("expected %q got %v", testError, err) 181 | } 182 | } 183 | 184 | func TestSegmentAdvanceTooFarError(t *testing.T) { 185 | errorSplit := func(data []byte, atEOF bool) (advance int, token []byte, err error) { 186 | if atEOF { 187 | panic("didn't get enough data") 188 | } 189 | return len(data) + 10, data[0:1], nil 190 | } 191 | // Read the data. 192 | const text = "abcdefghijklmnopqrstuvwxyz" 193 | buf := strings.NewReader(text) 194 | s := NewSegmenter(&slowReader{1, buf}) 195 | // change to line segmenter for testing 196 | s.SetSegmenter(wrapSplitFuncAsSegmentFuncForTesting(errorSplit)) 197 | s.Segment() 198 | err := s.Err() 199 | if err != ErrAdvanceTooFar { 200 | t.Fatalf("expected %q got %v", testError, err) 201 | } 202 | } 203 | 204 | func TestSegmentLongTokens(t *testing.T) { 205 | // Read the data. 206 | text := bytes.Repeat([]byte("abcdefghijklmnop"), 257) 207 | buf := strings.NewReader(string(text)) 208 | s := NewSegmenter(&slowReader{1, buf}) 209 | // change to line segmenter for testing 210 | s.SetSegmenter(wrapSplitFuncAsSegmentFuncForTesting(bufio.ScanLines)) 211 | for s.Segment() { 212 | line := s.Bytes() 213 | if !bytes.Equal(text, line) { 214 | t.Errorf("expected %s, got %s", text, line) 215 | } 216 | } 217 | err := s.Err() 218 | if err != nil { 219 | t.Fatalf("unexpected error; got %s", err) 220 | } 221 | } 222 | 223 | func TestSegmentLongTokensDontDouble(t *testing.T) { 224 | // Read the data. 225 | text := bytes.Repeat([]byte("abcdefghijklmnop"), 257) 226 | buf := strings.NewReader(string(text)) 227 | s := NewSegmenter(&slowReader{1, buf}) 228 | // change to line segmenter for testing 229 | s.SetSegmenter(wrapSplitFuncAsSegmentFuncForTesting(bufio.ScanLines)) 230 | s.MaxTokenSize(6144) 231 | for s.Segment() { 232 | line := s.Bytes() 233 | if !bytes.Equal(text, line) { 234 | t.Errorf("expected %s, got %s", text, line) 235 | } 236 | } 237 | err := s.Err() 238 | if err != nil { 239 | t.Fatalf("unexpected error; got %s", err) 240 | } 241 | } 242 | -------------------------------------------------------------------------------- /segment_words.rl: -------------------------------------------------------------------------------- 1 | // Copyright (c) 2015 Couchbase, Inc. 2 | // Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file 3 | // except in compliance with the License. You may obtain a copy of the License at 4 | // http://www.apache.org/licenses/LICENSE-2.0 5 | // Unless required by applicable law or agreed to in writing, software distributed under the 6 | // License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, 7 | // either express or implied. See the License for the specific language governing permissions 8 | // and limitations under the License. 9 | 10 | // +build BUILDTAGS 11 | 12 | package segment 13 | 14 | import ( 15 | "fmt" 16 | "unicode/utf8" 17 | ) 18 | 19 | var RagelFlags = "RAGELFLAGS" 20 | 21 | var ParseError = fmt.Errorf("unicode word segmentation parse error") 22 | 23 | // Word Types 24 | const ( 25 | None = iota 26 | Number 27 | Letter 28 | Kana 29 | Ideo 30 | ) 31 | 32 | %%{ 33 | machine s; 34 | write data; 35 | }%% 36 | 37 | func segmentWords(data []byte, maxTokens int, atEOF bool, val [][]byte, types []int) ([][]byte, []int, int, error) { 38 | cs, p, pe := 0, 0, len(data) 39 | cap := maxTokens 40 | if cap < 0 { 41 | cap = 1000 42 | } 43 | if val == nil { 44 | val = make([][]byte, 0, cap) 45 | } 46 | if types == nil { 47 | types = make([]int, 0, cap) 48 | } 49 | 50 | // added for scanner 51 | ts := 0 52 | te := 0 53 | act := 0 54 | eof := pe 55 | _ = ts // compiler not happy 56 | _ = te 57 | _ = act 58 | 59 | // our state 60 | startPos := 0 61 | endPos := 0 62 | totalConsumed := 0 63 | %%{ 64 | 65 | include SCRIPTS "ragel/uscript.rl"; 66 | include WB "ragel/uwb.rl"; 67 | 68 | action startToken { 69 | startPos = p 70 | } 71 | 72 | action endToken { 73 | endPos = p 74 | } 75 | 76 | action finishNumericToken { 77 | if !atEOF { 78 | return val, types, totalConsumed, nil 79 | } 80 | 81 | val = append(val, data[startPos:endPos+1]) 82 | types = append(types, Number) 83 | totalConsumed = endPos+1 84 | if maxTokens > 0 && len(val) >= maxTokens { 85 | return val, types, totalConsumed, nil 86 | } 87 | } 88 | 89 | action finishHangulToken { 90 | if endPos+1 == pe && !atEOF { 91 | return val, types, totalConsumed, nil 92 | } else if dr, size := utf8.DecodeRune(data[endPos+1:]); dr == utf8.RuneError && size == 1 { 93 | return val, types, totalConsumed, nil 94 | } 95 | 96 | val = append(val, data[startPos:endPos+1]) 97 | types = append(types, Letter) 98 | totalConsumed = endPos+1 99 | if maxTokens > 0 && len(val) >= maxTokens { 100 | return val, types, totalConsumed, nil 101 | } 102 | } 103 | 104 | action finishKatakanaToken { 105 | if endPos+1 == pe && !atEOF { 106 | return val, types, totalConsumed, nil 107 | } else if dr, size := utf8.DecodeRune(data[endPos+1:]); dr == utf8.RuneError && size == 1 { 108 | return val, types, totalConsumed, nil 109 | } 110 | 111 | val = append(val, data[startPos:endPos+1]) 112 | types = append(types, Ideo) 113 | totalConsumed = endPos+1 114 | if maxTokens > 0 && len(val) >= maxTokens { 115 | return val, types, totalConsumed, nil 116 | } 117 | } 118 | 119 | action finishWordToken { 120 | if !atEOF { 121 | return val, types, totalConsumed, nil 122 | } 123 | val = append(val, data[startPos:endPos+1]) 124 | types = append(types, Letter) 125 | totalConsumed = endPos+1 126 | if maxTokens > 0 && len(val) >= maxTokens { 127 | return val, types, totalConsumed, nil 128 | } 129 | } 130 | 131 | action finishHanToken { 132 | if endPos+1 == pe && !atEOF { 133 | return val, types, totalConsumed, nil 134 | } else if dr, size := utf8.DecodeRune(data[endPos+1:]); dr == utf8.RuneError && size == 1 { 135 | return val, types, totalConsumed, nil 136 | } 137 | 138 | val = append(val, data[startPos:endPos+1]) 139 | types = append(types, Ideo) 140 | totalConsumed = endPos+1 141 | if maxTokens > 0 && len(val) >= maxTokens { 142 | return val, types, totalConsumed, nil 143 | } 144 | } 145 | 146 | action finishHiraganaToken { 147 | if endPos+1 == pe && !atEOF { 148 | return val, types, totalConsumed, nil 149 | } else if dr, size := utf8.DecodeRune(data[endPos+1:]); dr == utf8.RuneError && size == 1 { 150 | return val, types, totalConsumed, nil 151 | } 152 | 153 | val = append(val, data[startPos:endPos+1]) 154 | types = append(types, Ideo) 155 | totalConsumed = endPos+1 156 | if maxTokens > 0 && len(val) >= maxTokens { 157 | return val, types, totalConsumed, nil 158 | } 159 | } 160 | 161 | action finishNoneToken { 162 | lastPos := startPos 163 | for lastPos <= endPos { 164 | _, size := utf8.DecodeRune(data[lastPos:]) 165 | lastPos += size 166 | } 167 | endPos = lastPos -1 168 | p = endPos 169 | 170 | if endPos+1 == pe && !atEOF { 171 | return val, types, totalConsumed, nil 172 | } else if dr, size := utf8.DecodeRune(data[endPos+1:]); dr == utf8.RuneError && size == 1 { 173 | return val, types, totalConsumed, nil 174 | } 175 | // otherwise, consume this as well 176 | val = append(val, data[startPos:endPos+1]) 177 | types = append(types, None) 178 | totalConsumed = endPos+1 179 | if maxTokens > 0 && len(val) >= maxTokens { 180 | return val, types, totalConsumed, nil 181 | } 182 | } 183 | 184 | HangulEx = Hangul ( Extend | Format )*; 185 | HebrewOrALetterEx = ( Hebrew_Letter | ALetter ) ( Extend | Format )*; 186 | NumericEx = Numeric ( Extend | Format )*; 187 | KatakanaEx = Katakana ( Extend | Format )*; 188 | MidLetterEx = ( MidLetter | MidNumLet | Single_Quote ) ( Extend | Format )*; 189 | MidNumericEx = ( MidNum | MidNumLet | Single_Quote ) ( Extend | Format )*; 190 | ExtendNumLetEx = ExtendNumLet ( Extend | Format )*; 191 | HanEx = Han ( Extend | Format )*; 192 | HiraganaEx = Hiragana ( Extend | Format )*; 193 | SingleQuoteEx = Single_Quote ( Extend | Format )*; 194 | DoubleQuoteEx = Double_Quote ( Extend | Format )*; 195 | HebrewLetterEx = Hebrew_Letter ( Extend | Format )*; 196 | RegionalIndicatorEx = Regional_Indicator ( Extend | Format )*; 197 | NLCRLF = Newline | CR | LF; 198 | OtherEx = ^(NLCRLF) ( Extend | Format )* ; 199 | 200 | # UAX#29 WB8. Numeric × Numeric 201 | # WB11. Numeric (MidNum | MidNumLet | Single_Quote) × Numeric 202 | # WB12. Numeric × (MidNum | MidNumLet | Single_Quote) Numeric 203 | # WB13a. (ALetter | Hebrew_Letter | Numeric | Katakana | ExtendNumLet) × ExtendNumLet 204 | # WB13b. ExtendNumLet × (ALetter | Hebrew_Letter | Numeric | Katakana) 205 | # 206 | WordNumeric = ( ( ExtendNumLetEx )* NumericEx ( ( ( ExtendNumLetEx )* | MidNumericEx ) NumericEx )* ( ExtendNumLetEx )* ) >startToken @endToken; 207 | 208 | # subset of the below for typing purposes only! 209 | WordHangul = ( HangulEx )+ >startToken @endToken; 210 | WordKatakana = ( KatakanaEx )+ >startToken @endToken; 211 | 212 | # UAX#29 WB5. (ALetter | Hebrew_Letter) × (ALetter | Hebrew_Letter) 213 | # WB6. (ALetter | Hebrew_Letter) × (MidLetter | MidNumLet | Single_Quote) (ALetter | Hebrew_Letter) 214 | # WB7. (ALetter | Hebrew_Letter) (MidLetter | MidNumLet | Single_Quote) × (ALetter | Hebrew_Letter) 215 | # WB7a. Hebrew_Letter × Single_Quote 216 | # WB7b. Hebrew_Letter × Double_Quote Hebrew_Letter 217 | # WB7c. Hebrew_Letter Double_Quote × Hebrew_Letter 218 | # WB9. (ALetter | Hebrew_Letter) × Numeric 219 | # WB10. Numeric × (ALetter | Hebrew_Letter) 220 | # WB13. Katakana × Katakana 221 | # WB13a. (ALetter | Hebrew_Letter | Numeric | Katakana | ExtendNumLet) × ExtendNumLet 222 | # WB13b. ExtendNumLet × (ALetter | Hebrew_Letter | Numeric | Katakana) 223 | # 224 | # Marty -deviated here to allow for (ExtendNumLetEx x ExtendNumLetEx) part of 13a 225 | # 226 | Word = ( ( ExtendNumLetEx )* ( KatakanaEx ( ( ExtendNumLetEx )* KatakanaEx )* 227 | | ( HebrewLetterEx ( SingleQuoteEx | DoubleQuoteEx HebrewLetterEx ) 228 | | NumericEx ( ( ( ExtendNumLetEx )* | MidNumericEx ) NumericEx )* 229 | | HebrewOrALetterEx ( ( ( ExtendNumLetEx )* | MidLetterEx ) HebrewOrALetterEx )* 230 | |ExtendNumLetEx 231 | )+ 232 | ) 233 | ( 234 | ( ExtendNumLetEx )+ ( KatakanaEx ( ( ExtendNumLetEx )* KatakanaEx )* 235 | | ( HebrewLetterEx ( SingleQuoteEx | DoubleQuoteEx HebrewLetterEx ) 236 | | NumericEx ( ( ( ExtendNumLetEx )* | MidNumericEx ) NumericEx )* 237 | | HebrewOrALetterEx ( ( ( ExtendNumLetEx )* | MidLetterEx ) HebrewOrALetterEx )* 238 | )+ 239 | ) 240 | )* ExtendNumLetEx*) >startToken @endToken; 241 | 242 | # UAX#29 WB14. Any ÷ Any 243 | WordHan = HanEx >startToken @endToken; 244 | WordHiragana = HiraganaEx >startToken @endToken; 245 | 246 | WordExt = ( ( Extend | Format )* ) >startToken @endToken; # maybe plus not star 247 | 248 | WordCRLF = (CR LF) >startToken @endToken; 249 | 250 | WordCR = CR >startToken @endToken; 251 | 252 | WordLF = LF >startToken @endToken; 253 | 254 | WordNL = Newline >startToken @endToken; 255 | 256 | WordRegional = (RegionalIndicatorEx+) >startToken @endToken; 257 | 258 | Other = OtherEx >startToken @endToken; 259 | 260 | main := |* 261 | WordNumeric => finishNumericToken; 262 | WordHangul => finishHangulToken; 263 | WordKatakana => finishKatakanaToken; 264 | Word => finishWordToken; 265 | WordHan => finishHanToken; 266 | WordHiragana => finishHiraganaToken; 267 | WordRegional =>finishNoneToken; 268 | WordCRLF => finishNoneToken; 269 | WordCR => finishNoneToken; 270 | WordLF => finishNoneToken; 271 | WordNL => finishNoneToken; 272 | WordExt => finishNoneToken; 273 | Other => finishNoneToken; 274 | *|; 275 | 276 | write init; 277 | write exec; 278 | }%% 279 | 280 | if cs < s_first_final { 281 | return val, types, totalConsumed, ParseError 282 | } 283 | 284 | return val, types, totalConsumed, nil 285 | } 286 | -------------------------------------------------------------------------------- /segment_words_test.go: -------------------------------------------------------------------------------- 1 | // Copyright (c) 2014 Couchbase, Inc. 2 | // Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file 3 | // except in compliance with the License. You may obtain a copy of the License at 4 | // http://www.apache.org/licenses/LICENSE-2.0 5 | // Unless required by applicable law or agreed to in writing, software distributed under the 6 | // License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, 7 | // either express or implied. See the License for the specific language governing permissions 8 | // and limitations under the License. 9 | 10 | package segment 11 | 12 | import ( 13 | "bufio" 14 | "bytes" 15 | "reflect" 16 | "strings" 17 | "testing" 18 | ) 19 | 20 | func TestAdhocSegmentsWithType(t *testing.T) { 21 | 22 | tests := []struct { 23 | input []byte 24 | output [][]byte 25 | outputStrings []string 26 | outputTypes []int 27 | }{ 28 | { 29 | input: []byte("Now is the.\n End."), 30 | output: [][]byte{ 31 | []byte("Now"), 32 | []byte(" "), 33 | []byte(" "), 34 | []byte("is"), 35 | []byte(" "), 36 | []byte("the"), 37 | []byte("."), 38 | []byte("\n"), 39 | []byte(" "), 40 | []byte("End"), 41 | []byte("."), 42 | }, 43 | outputStrings: []string{ 44 | "Now", 45 | " ", 46 | " ", 47 | "is", 48 | " ", 49 | "the", 50 | ".", 51 | "\n", 52 | " ", 53 | "End", 54 | ".", 55 | }, 56 | outputTypes: []int{ 57 | Letter, 58 | None, 59 | None, 60 | Letter, 61 | None, 62 | Letter, 63 | None, 64 | None, 65 | None, 66 | Letter, 67 | None, 68 | }, 69 | }, 70 | { 71 | input: []byte("3.5"), 72 | output: [][]byte{ 73 | []byte("3.5"), 74 | }, 75 | outputStrings: []string{ 76 | "3.5", 77 | }, 78 | outputTypes: []int{ 79 | Number, 80 | }, 81 | }, 82 | { 83 | input: []byte("cat3.5"), 84 | output: [][]byte{ 85 | []byte("cat3.5"), 86 | }, 87 | outputStrings: []string{ 88 | "cat3.5", 89 | }, 90 | outputTypes: []int{ 91 | Letter, 92 | }, 93 | }, 94 | { 95 | input: []byte("c"), 96 | output: [][]byte{ 97 | []byte("c"), 98 | }, 99 | outputStrings: []string{ 100 | "c", 101 | }, 102 | outputTypes: []int{ 103 | Letter, 104 | }, 105 | }, 106 | { 107 | input: []byte("こんにちは世界"), 108 | output: [][]byte{ 109 | []byte("こ"), 110 | []byte("ん"), 111 | []byte("に"), 112 | []byte("ち"), 113 | []byte("は"), 114 | []byte("世"), 115 | []byte("界"), 116 | }, 117 | outputStrings: []string{ 118 | "こ", 119 | "ん", 120 | "に", 121 | "ち", 122 | "は", 123 | "世", 124 | "界", 125 | }, 126 | outputTypes: []int{ 127 | Ideo, 128 | Ideo, 129 | Ideo, 130 | Ideo, 131 | Ideo, 132 | Ideo, 133 | Ideo, 134 | }, 135 | }, 136 | { 137 | input: []byte("你好世界"), 138 | output: [][]byte{ 139 | []byte("你"), 140 | []byte("好"), 141 | []byte("世"), 142 | []byte("界"), 143 | }, 144 | outputStrings: []string{ 145 | "你", 146 | "好", 147 | "世", 148 | "界", 149 | }, 150 | outputTypes: []int{ 151 | Ideo, 152 | Ideo, 153 | Ideo, 154 | Ideo, 155 | }, 156 | }, 157 | { 158 | input: []byte("サッカ"), 159 | output: [][]byte{ 160 | []byte("サッカ"), 161 | }, 162 | outputStrings: []string{ 163 | "サッカ", 164 | }, 165 | outputTypes: []int{ 166 | Ideo, 167 | }, 168 | }, 169 | // test for wb7b/wb7c 170 | { 171 | input: []byte(`א"א`), 172 | output: [][]byte{ 173 | []byte(`א"א`), 174 | }, 175 | outputStrings: []string{ 176 | `א"א`, 177 | }, 178 | outputTypes: []int{ 179 | Letter, 180 | }, 181 | }, 182 | } 183 | 184 | for _, test := range tests { 185 | rv := make([][]byte, 0) 186 | rvstrings := make([]string, 0) 187 | rvtypes := make([]int, 0) 188 | segmenter := NewWordSegmenter(bytes.NewReader(test.input)) 189 | // Set the split function for the scanning operation. 190 | for segmenter.Segment() { 191 | rv = append(rv, segmenter.Bytes()) 192 | rvstrings = append(rvstrings, segmenter.Text()) 193 | rvtypes = append(rvtypes, segmenter.Type()) 194 | } 195 | if err := segmenter.Err(); err != nil { 196 | t.Fatal(err) 197 | } 198 | if !reflect.DeepEqual(rv, test.output) { 199 | t.Fatalf("expected:\n%#v\ngot:\n%#v\nfor: '%s'", test.output, rv, test.input) 200 | } 201 | if !reflect.DeepEqual(rvstrings, test.outputStrings) { 202 | t.Fatalf("expected:\n%#v\ngot:\n%#v\nfor: '%s'", test.outputStrings, rvstrings, test.input) 203 | } 204 | if !reflect.DeepEqual(rvtypes, test.outputTypes) { 205 | t.Fatalf("expeced:\n%#v\ngot:\n%#v\nfor: '%s'", test.outputTypes, rvtypes, test.input) 206 | } 207 | } 208 | 209 | // run same tests again with direct 210 | for _, test := range tests { 211 | rv := make([][]byte, 0) 212 | rvstrings := make([]string, 0) 213 | rvtypes := make([]int, 0) 214 | segmenter := NewWordSegmenterDirect(test.input) 215 | // Set the split function for the scanning operation. 216 | for segmenter.Segment() { 217 | rv = append(rv, segmenter.Bytes()) 218 | rvstrings = append(rvstrings, segmenter.Text()) 219 | rvtypes = append(rvtypes, segmenter.Type()) 220 | } 221 | if err := segmenter.Err(); err != nil { 222 | t.Fatal(err) 223 | } 224 | if !reflect.DeepEqual(rv, test.output) { 225 | t.Fatalf("expected:\n%#v\ngot:\n%#v\nfor: '%s'", test.output, rv, test.input) 226 | } 227 | if !reflect.DeepEqual(rvstrings, test.outputStrings) { 228 | t.Fatalf("expected:\n%#v\ngot:\n%#v\nfor: '%s'", test.outputStrings, rvstrings, test.input) 229 | } 230 | if !reflect.DeepEqual(rvtypes, test.outputTypes) { 231 | t.Fatalf("expeced:\n%#v\ngot:\n%#v\nfor: '%s'", test.outputTypes, rvtypes, test.input) 232 | } 233 | } 234 | 235 | } 236 | 237 | func TestUnicodeSegments(t *testing.T) { 238 | 239 | for _, test := range unicodeWordTests { 240 | rv := make([][]byte, 0) 241 | scanner := bufio.NewScanner(bytes.NewReader(test.input)) 242 | // Set the split function for the scanning operation. 243 | scanner.Split(SplitWords) 244 | for scanner.Scan() { 245 | rv = append(rv, scanner.Bytes()) 246 | } 247 | if err := scanner.Err(); err != nil { 248 | t.Fatal(err) 249 | } 250 | if !reflect.DeepEqual(rv, test.output) { 251 | t.Fatalf("expected:\n%#v\ngot:\n%#v\nfor: '%s' comment: %s", test.output, rv, test.input, test.comment) 252 | } 253 | } 254 | } 255 | 256 | func TestUnicodeSegmentsSlowReader(t *testing.T) { 257 | 258 | for i, test := range unicodeWordTests { 259 | rv := make([][]byte, 0) 260 | segmenter := NewWordSegmenter(&slowReader{1, bytes.NewReader(test.input)}) 261 | for segmenter.Segment() { 262 | rv = append(rv, segmenter.Bytes()) 263 | } 264 | if err := segmenter.Err(); err != nil { 265 | t.Fatal(err) 266 | } 267 | if !reflect.DeepEqual(rv, test.output) { 268 | t.Fatalf("expected:\n%#v\ngot:\n%#v\nfor: %d '%s' comment: %s", test.output, rv, i, test.input, test.comment) 269 | } 270 | } 271 | } 272 | 273 | func TestWordSegmentLongInputSlowReader(t *testing.T) { 274 | // Read the data. 275 | text := bytes.Repeat([]byte("abcdefghijklmnop"), 26) 276 | buf := strings.NewReader(string(text) + " cat") 277 | s := NewSegmenter(&slowReader{1, buf}) 278 | s.MaxTokenSize(6144) 279 | for s.Segment() { 280 | } 281 | err := s.Err() 282 | if err != nil { 283 | t.Fatalf("unexpected error; got '%s'", err) 284 | } 285 | finalWord := s.Text() 286 | if s.Text() != "cat" { 287 | t.Errorf("expected 'cat' got '%s'", finalWord) 288 | } 289 | } 290 | 291 | func BenchmarkSplitWords(b *testing.B) { 292 | for i := 0; i < b.N; i++ { 293 | vals := make([][]byte, 0) 294 | scanner := bufio.NewScanner(bytes.NewReader(bleveWikiArticle)) 295 | scanner.Split(SplitWords) 296 | for scanner.Scan() { 297 | vals = append(vals, scanner.Bytes()) 298 | } 299 | if err := scanner.Err(); err != nil { 300 | b.Fatal(err) 301 | } 302 | if len(vals) != 3465 { 303 | b.Fatalf("expected 3465 tokens, got %d", len(vals)) 304 | } 305 | } 306 | 307 | } 308 | 309 | func BenchmarkWordSegmenter(b *testing.B) { 310 | 311 | for i := 0; i < b.N; i++ { 312 | vals := make([][]byte, 0) 313 | types := make([]int, 0) 314 | segmenter := NewWordSegmenter(bytes.NewReader(bleveWikiArticle)) 315 | for segmenter.Segment() { 316 | vals = append(vals, segmenter.Bytes()) 317 | types = append(types, segmenter.Type()) 318 | } 319 | if err := segmenter.Err(); err != nil { 320 | b.Fatal(err) 321 | } 322 | if vals == nil { 323 | b.Fatalf("expected non-nil vals") 324 | } 325 | if types == nil { 326 | b.Fatalf("expected non-nil types") 327 | } 328 | } 329 | } 330 | 331 | func BenchmarkWordSegmenterDirect(b *testing.B) { 332 | 333 | for i := 0; i < b.N; i++ { 334 | vals := make([][]byte, 0) 335 | types := make([]int, 0) 336 | segmenter := NewWordSegmenterDirect(bleveWikiArticle) 337 | for segmenter.Segment() { 338 | vals = append(vals, segmenter.Bytes()) 339 | types = append(types, segmenter.Type()) 340 | } 341 | if err := segmenter.Err(); err != nil { 342 | b.Fatal(err) 343 | } 344 | if vals == nil { 345 | b.Fatalf("expected non-nil vals") 346 | } 347 | if types == nil { 348 | b.Fatalf("expected non-nil types") 349 | } 350 | } 351 | } 352 | 353 | func BenchmarkDirect(b *testing.B) { 354 | 355 | for i := 0; i < b.N; i++ { 356 | vals := make([][]byte, 0, 10000) 357 | types := make([]int, 0, 10000) 358 | vals, types, _, err := SegmentWordsDirect(bleveWikiArticle, vals, types) 359 | if err != nil { 360 | b.Fatal(err) 361 | } 362 | if vals == nil { 363 | b.Fatalf("expected non-nil vals") 364 | } 365 | if types == nil { 366 | b.Fatalf("expected non-nil types") 367 | } 368 | } 369 | } 370 | 371 | var bleveWikiArticle = []byte(`Boiling liquid expanding vapor explosion 372 | From Wikipedia, the free encyclopedia 373 | See also: Boiler explosion and Steam explosion 374 | 375 | Flames subsequent to a flammable liquid BLEVE from a tanker. BLEVEs do not necessarily involve fire. 376 | 377 | This article's tone or style may not reflect the encyclopedic tone used on Wikipedia. See Wikipedia's guide to writing better articles for suggestions. (July 2013) 378 | A boiling liquid expanding vapor explosion (BLEVE, /ˈblɛviː/ blev-ee) is an explosion caused by the rupture of a vessel containing a pressurized liquid above its boiling point.[1] 379 | Contents [hide] 380 | 1 Mechanism 381 | 1.1 Water example 382 | 1.2 BLEVEs without chemical reactions 383 | 2 Fires 384 | 3 Incidents 385 | 4 Safety measures 386 | 5 See also 387 | 6 References 388 | 7 External links 389 | Mechanism[edit] 390 | 391 | This section needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. (July 2013) 392 | There are three characteristics of liquids which are relevant to the discussion of a BLEVE: 393 | If a liquid in a sealed container is boiled, the pressure inside the container increases. As the liquid changes to a gas it expands - this expansion in a vented container would cause the gas and liquid to take up more space. In a sealed container the gas and liquid are not able to take up more space and so the pressure rises. Pressurized vessels containing liquids can reach an equilibrium where the liquid stops boiling and the pressure stops rising. This occurs when no more heat is being added to the system (either because it has reached ambient temperature or has had a heat source removed). 394 | The boiling temperature of a liquid is dependent on pressure - high pressures will yield high boiling temperatures, and low pressures will yield low boiling temperatures. A common simple experiment is to place a cup of water in a vacuum chamber, and then reduce the pressure in the chamber until the water boils. By reducing the pressure the water will boil even at room temperature. This works both ways - if the pressure is increased beyond normal atmospheric pressures, the boiling of hot water could be suppressed far beyond normal temperatures. The cooling system of a modern internal combustion engine is a real-world example. 395 | When a liquid boils it turns into a gas. The resulting gas takes up far more space than the liquid did. 396 | Typically, a BLEVE starts with a container of liquid which is held above its normal, atmospheric-pressure boiling temperature. Many substances normally stored as liquids, such as CO2, propane, and other similar industrial gases have boiling temperatures, at atmospheric pressure, far below room temperature. In the case of water, a BLEVE could occur if a pressurized chamber of water is heated far beyond the standard 100 °C (212 °F). That container, because the boiling water pressurizes it, is capable of holding liquid water at very high temperatures. 397 | If the pressurized vessel, containing liquid at high temperature (which may be room temperature, depending on the substance) ruptures, the pressure which prevents the liquid from boiling is lost. If the rupture is catastrophic, where the vessel is immediately incapable of holding any pressure at all, then there suddenly exists a large mass of liquid which is at very high temperature and very low pressure. This causes the entire volume of liquid to instantaneously boil, which in turn causes an extremely rapid expansion. Depending on temperatures, pressures and the substance involved, that expansion may be so rapid that it can be classified as an explosion, fully capable of inflicting severe damage on its surroundings. 398 | Water example[edit] 399 | Imagine, for example, a tank of pressurized liquid water held at 204.4 °C (400 °F). This tank would normally be pressurized to 1.7 MPa (250 psi) above atmospheric ("gauge") pressure. If the tank containing the water were to rupture, there would for a slight moment exist a volume of liquid water which would be 400 | at atmospheric pressure, and 401 | 204.4 °C (400 °F). 402 | At atmospheric pressure the boiling point of water is 100 °C (212 °F) - liquid water at atmospheric pressure cannot exist at temperatures higher than 100 °C (212 °F). At that moment, the water would boil and turn to vapour explosively, and the 204.4 °C (400 °F) liquid water turned to gas would take up a lot more volume than it did as liquid, causing a vapour explosion. Such explosions can happen when the superheated water of a steam engine escapes through a crack in a boiler, causing a boiler explosion. 403 | BLEVEs without chemical reactions[edit] 404 | It is important to note that a BLEVE need not be a chemical explosion—nor does there need to be a fire—however if a flammable substance is subject to a BLEVE it may also be subject to intense heating, either from an external source of heat which may have caused the vessel to rupture in the first place or from an internal source of localized heating such as skin friction. This heating can cause a flammable substance to ignite, adding a secondary explosion caused by the primary BLEVE. While blast effects of any BLEVE can be devastating, a flammable substance such as propane can add significantly to the danger. 405 | Bleve explosion.svg 406 | While the term BLEVE is most often used to describe the results of a container of flammable liquid rupturing due to fire, a BLEVE can occur even with a non-flammable substance such as water,[2] liquid nitrogen,[3] liquid helium or other refrigerants or cryogens, and therefore is not usually considered a type of chemical explosion. 407 | Fires[edit] 408 | BLEVEs can be caused by an external fire near the storage vessel causing heating of the contents and pressure build-up. While tanks are often designed to withstand great pressure, constant heating can cause the metal to weaken and eventually fail. If the tank is being heated in an area where there is no liquid, it may rupture faster without the liquid to absorb the heat. Gas containers are usually equipped with relief valves that vent off excess pressure, but the tank can still fail if the pressure is not released quickly enough.[1] Relief valves are sized to release pressure fast enough to prevent the pressure from increasing beyond the strength of the vessel, but not so fast as to be the cause of an explosion. An appropriately sized relief valve will allow the liquid inside to boil slowly, maintaining a constant pressure in the vessel until all the liquid has boiled and the vessel empties. 409 | If the substance involved is flammable, it is likely that the resulting cloud of the substance will ignite after the BLEVE has occurred, forming a fireball and possibly a fuel-air explosion, also termed a vapor cloud explosion (VCE). If the materials are toxic, a large area will be contaminated.[4] 410 | Incidents[edit] 411 | The term "BLEVE" was coined by three researchers at Factory Mutual, in the analysis of an accident there in 1957 involving a chemical reactor vessel.[5] 412 | In August 1959 the Kansas City Fire Department suffered its largest ever loss of life in the line of duty, when a 25,000 gallon (95,000 litre) gas tank exploded during a fire on Southwest Boulevard killing five firefighters. This was the first time BLEVE was used to describe a burning fuel tank.[citation needed] 413 | Later incidents included the Cheapside Street Whisky Bond Fire in Glasgow, Scotland in 1960; Feyzin, France in 1966; Crescent City, Illinois in 1970; Kingman, Arizona in 1973; a liquid nitrogen tank rupture[6] at Air Products and Chemicals and Mobay Chemical Company at New Martinsville, West Virginia on January 31, 1978 [1];Texas City, Texas in 1978; Murdock, Illinois in 1983; San Juan Ixhuatepec, Mexico City in 1984; and Toronto, Ontario in 2008. 414 | Safety measures[edit] 415 | [icon] This section requires expansion. (July 2013) 416 | Some fire mitigation measures are listed under liquefied petroleum gas. 417 | See also[edit] 418 | Boiler explosion 419 | Expansion ratio 420 | Explosive boiling or phase explosion 421 | Rapid phase transition 422 | Viareggio train derailment 423 | 2008 Toronto explosions 424 | Gas carriers 425 | Los Alfaques Disaster 426 | Lac-Mégantic derailment 427 | References[edit] 428 | ^ Jump up to: a b Kletz, Trevor (March 1990). Critical Aspects of Safety and Loss Prevention. London: Butterworth–Heinemann. pp. 43–45. ISBN 0-408-04429-2. 429 | Jump up ^ "Temperature Pressure Relief Valves on Water Heaters: test, inspect, replace, repair guide". Inspect-ny.com. Retrieved 2011-07-12. 430 | Jump up ^ Liquid nitrogen BLEVE demo 431 | Jump up ^ "Chemical Process Safety" (PDF). Retrieved 2011-07-12. 432 | Jump up ^ David F. Peterson, BLEVE: Facts, Risk Factors, and Fallacies, Fire Engineering magazine (2002). 433 | Jump up ^ "STATE EX REL. VAPOR CORP. v. NARICK". Supreme Court of Appeals of West Virginia. 1984-07-12. Retrieved 2014-03-16. 434 | External links[edit] 435 | Look up boiling liquid expanding vapor explosion in Wiktionary, the free dictionary. 436 | Wikimedia Commons has media related to BLEVE. 437 | BLEVE Demo on YouTube — video of a controlled BLEVE demo 438 | huge explosions on YouTube — video of propane and isobutane BLEVEs from a train derailment at Murdock, Illinois (3 September 1983) 439 | Propane BLEVE on YouTube — video of BLEVE from the Toronto propane depot fire 440 | Moscow Ring Road Accident on YouTube - Dozens of LPG tank BLEVEs after a road accident in Moscow 441 | Kingman, AZ BLEVE — An account of the 5 July 1973 explosion in Kingman, with photographs 442 | Propane Tank Explosions — Description of circumstances required to cause a propane tank BLEVE. 443 | Analysis of BLEVE Events at DOE Sites - Details physics and mathematics of BLEVEs. 444 | HID - SAFETY REPORT ASSESSMENT GUIDE: Whisky Maturation Warehouses - The liquor is aged in wooden barrels that can suffer BLEVE. 445 | Categories: ExplosivesFirefightingFireTypes of fireGas technologiesIndustrial fires and explosions`) 446 | --------------------------------------------------------------------------------