├── .gitignore ├── .travis.yml ├── LICENSE ├── Makefile ├── README.md ├── UTF-8-demo.txt ├── ascii.cpp ├── boost.cpp ├── lemire-avx2.c ├── lemire-neon.c ├── lemire-sse.c ├── lookup.c ├── main.c ├── naive.c ├── range-avx2.c ├── range-neon.c ├── range-sse.c ├── range.png ├── range2-neon.c ├── range2-sse.c └── utf8_to_utf16 ├── .gitignore ├── Makefile ├── iconv.c ├── main.c └── naive.c /.gitignore: -------------------------------------------------------------------------------- 1 | utf8 2 | utf8-boost 3 | ascii 4 | *.o 5 | *.swp 6 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: c 2 | sudo: false 3 | arch: 4 | - amd64 5 | - arm64 6 | os: linux 7 | dist: bionic 8 | 9 | script: 10 | - make 11 | - ./utf8 test 12 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Yibo Cai 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | CC = gcc 2 | CXX = g++ 3 | CPPFLAGS = -g -O3 -Wall -march=native 4 | CXXFLAGS = -std=c++11 5 | 6 | OBJS = main.o naive.o lookup.o lemire-sse.o lemire-neon.o \ 7 | range-sse.o range-neon.o range2-sse.o range2-neon.o \ 8 | lemire-avx2.o range-avx2.o 9 | 10 | utf8: ${OBJS} 11 | gcc $^ -o $@ 12 | 13 | utf8-boost: CFLAGS += -DBOOST 14 | utf8-boost: ${OBJS} boost.o 15 | g++ $^ -o $@ 16 | 17 | .PHONY: clean 18 | clean: 19 | rm -f utf8 utf8-boost ascii *.o 20 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![Build Status](https://travis-ci.com/cyb70289/utf8.svg?branch=master)](https://travis-ci.com/cyb70289/utf8) 2 | 3 | # Fast UTF-8 validation with Range algorithm (NEON+SSE4+AVX2) 4 | 5 | This is a brand new algorithm to leverage SIMD for fast UTF-8 string validation. Both **NEON**(armv8a) and **SSE4** versions are implemented. **AVX2** implementation contributed by [ioioioio](https://github.com/ioioioio). 6 | 7 | Four UTF-8 validation methods are compared on both x86 and Arm platforms. Benchmark result shows range base algorithm is the best solution on Arm, and achieves same performance as [Lemire's approach](https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/) on x86. 8 | 9 | * Range based algorithm 10 | * range-neon.c: NEON version 11 | * range-sse.c: SSE4 version 12 | * range-avx2.c: AVX2 version 13 | * range2-neon.c, range2-sse.c: Process two blocks in one iteration 14 | * [Lemire's SIMD implementation](https://github.com/lemire/fastvalidate-utf-8) 15 | * lemire-sse.c: SSE4 version 16 | * lemire-avx2.c: AVX2 version 17 | * lemire-neon.c: NEON porting 18 | * naive.c: Naive UTF-8 validation byte by byte 19 | * lookup.c: [Lookup-table method](http://bjoern.hoehrmann.de/utf-8/decoder/dfa/) 20 | 21 | ## About the code 22 | 23 | * Run "make" to build. Built and tested with gcc-7.3. 24 | * Run "./utf8" to see all command line options. 25 | * Benchmark 26 | * Run "./utf8 bench" to bechmark all algorithms with [default test file](https://raw.githubusercontent.com/cyb70289/utf8/master/UTF-8-demo.txt). 27 | * Run "./utf8 bench size NUM" to benchmark specified string size. 28 | * Run "./utf8 test" to test all algorithms with positive and negative test cases. 29 | * To benchmark or test specific algorithm, run something like "./utf8 bench range". 30 | 31 | ## Benchmark result (MB/s) 32 | 33 | ### Method 34 | 1. Generate UTF-8 test buffer per [test file](https://raw.githubusercontent.com/cyb70289/utf8/master/UTF-8-demo.txt) or buffer size. 35 | 1. Call validation sub-routines in a loop until 1G bytes are checked. 36 | 1. Calculate speed(MB/s) of validating UTF-8 strings. 37 | 38 | ### NEON(armv8a) 39 | Test case | naive | lookup | lemire | range | range2 40 | :-------- | :---- | :----- | :----- | :---- | :----- 41 | [UTF-demo.txt](https://raw.githubusercontent.com/cyb70289/utf8/master/UTF-8-demo.txt) | 562.25 | 412.84 | 1198.50 | 1411.72 | **1579.85** 42 | 32 bytes | 651.55 | 441.70 | 891.38 | 1003.95 | **1043.58** 43 | 33 bytes | 660.00 | 446.78 | 588.77 | 1009.31 | **1048.12** 44 | 129 bytes | 771.89 | 402.55 | 938.07 | 1283.77 | **1401.76** 45 | 1K bytes | 811.92 | 411.58 | 1188.96 | 1398.15 | **1560.23** 46 | 8K bytes | 812.25 | 412.74 | 1198.90 | 1412.18 | **1580.65** 47 | 64K bytes | 817.35 | 412.24 | 1200.20 | 1415.11 | **1583.86** 48 | 1M bytes | 815.70 | 411.93 | 1200.93 | 1415.65 | **1585.40** 49 | 50 | ### SSE4(E5-2650) 51 | Test case | naive | lookup | lemire | range | range2 52 | :-------- | :---- | :----- | :----- | :---- | :----- 53 | [UTF-demo.txt](https://raw.githubusercontent.com/cyb70289/utf8/master/UTF-8-demo.txt) | 753.70 | 310.41 | 3954.74 | 3945.60 | **3986.13** 54 | 32 bytes | 1135.76 | 364.07 | **2890.52** | 2351.81 | 2173.02 55 | 33 bytes | 1161.85 | 376.29 | 1352.95 | **2239.55** | 2041.43 56 | 129 bytes | 1161.22 | 322.47 | 2742.49 | **3315.33** | 3249.35 57 | 1K bytes | 1310.95 | 310.72 | 3755.88 | 3781.23 | **3874.17** 58 | 8K bytes | 1348.32 | 307.93 | 3860.71 | 3922.81 | **3968.93** 59 | 64K bytes | 1301.34 | 308.39 | 3935.15 | 3973.50 | **3983.44** 60 | 1M bytes | 1279.78 | 309.06 | 3923.51 | 3953.00 | **3960.49** 61 | 62 | ## Range algorithm analysis 63 | 64 | Basic idea: 65 | * Load 16 bytes 66 | * Leverage SIMD to calculate value range for each byte efficiently 67 | * Validate 16 bytes at once 68 | 69 | ### UTF-8 coding format 70 | 71 | http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf, page 94 72 | 73 | Table 3-7. Well-Formed UTF-8 Byte Sequences 74 | 75 | Code Points | First Byte | Second Byte | Third Byte | Fourth Byte | 76 | :---------- | :--------- | :---------- | :--------- | :---------- | 77 | U+0000..U+007F | 00..7F | | | | 78 | U+0080..U+07FF | C2..DF | 80..BF | | | 79 | U+0800..U+0FFF | E0 | ***A0***..BF| 80..BF | | 80 | U+1000..U+CFFF | E1..EC | 80..BF | 80..BF | | 81 | U+D000..U+D7FF | ED | 80..***9F***| 80..BF | | 82 | U+E000..U+FFFF | EE..EF | 80..BF | 80..BF | | 83 | U+10000..U+3FFFF | F0 | ***90***..BF| 80..BF | 80..BF | 84 | U+40000..U+FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF | 85 | U+100000..U+10FFFF | F4 | 80..***8F***| 80..BF | 80..BF | 86 | 87 | To summarise UTF-8 encoding: 88 | * Depending on First Byte, one legal character can be 1, 2, 3, 4 bytes 89 | * For First Byte within C0..DF, character length = 2 90 | * For First Byte within E0..EF, character length = 3 91 | * For First Byte within F0..F4, character length = 4 92 | * C0, C1, F5..FF are not allowed 93 | * Second,Third,Fourth Bytes must lie in 80..BF. 94 | * There are four **special cases** for Second Byte, shown ***bold italic*** in above table. 95 | 96 | ### Range table 97 | 98 | Range table maps range index 0 ~ 15 to minimal and maximum values allowed. Our task is to observe input string, find the pattern and set correct range index for each byte, then validate input string. 99 | 100 | Index | Min | Max | Byte type 101 | :---- | :-- | :-- | :-------- 102 | 0 | 00 | 7F | First Byte, ASCII 103 | 1,2,3 | 80 | BF | Second, Third, Fourth Bytes 104 | 4 | A0 | BF | Second Byte after E0 105 | 5 | 80 | 9F | Second Byte after ED 106 | 6 | 90 | BF | Second Byte after F0 107 | 7 | 80 | 8F | Second Byte after F4 108 | 8 | C2 | F4 | First Byte, non-ASCII 109 | 9..15(NEON) | FF | 00 | Illegal: unsigned char >= 255 && unsigned char <= 0 110 | 9..15(SSE) | 7F | 80 | Illegal: signed char >= 127 && signed char <= -128 111 | 112 | ### Calculate byte ranges (ignore special cases) 113 | 114 | Ignoring the four special cases(E0,ED,F0,F4), how should we set range index for each byte? 115 | 116 | * Set range index to 0(00..7F) for all bytes by default 117 | * Find non-ASCII First Byte (C0..FF), set their range index to 8(C2..F4) 118 | * For First Byte within C0..DF, set next byte's range index to 1(80..BF) 119 | * For First Byte within E0..EF, set next two byte's range index to 2,1(80..BF) in sequence 120 | * For First Byte within F0..FF, set next three byte's range index to 3,2,1(80..BF) in sequence 121 | 122 | To implement above operations efficiently with SIMD: 123 | * For 16 input bytes, use lookup table to map C0..DF to 1, E0..EF to 2, F0..FF to 3, others to 0. Save to first_len. 124 | * Map C0..FF to 8, we get range indices for First Byte. 125 | * Shift first_len one byte, we get range indices for Second Byte. 126 | * Saturate substract first_len by one(3->2, 2->1, 1->0, 0->0), then shift two bytes, we get range indices for Third Byte. 127 | * Saturate substract first_len by two(3->1, 2->0, 1->0, 0->0), then shift three bytes, we get range indices for Fourth Byte. 128 | 129 | Example(assume no previous data) 130 | 131 | Input | F1 | 80 | 80 | 80 | 80 | C2 | 80 | 80 | ... 132 | :---- | :- | :- | :- | :- | :- | :- | :- | :- | :-- 133 | *first_len* |*3* |*0* |*0* |*0* |*0* |*1* |*0* |*0* |*...* 134 | First Byte | 8 | 0 | 0 | 0 | 0 | 8 | 0 | 0 | ... 135 | Second Byte | 0 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | ... 136 | Third Byte | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | ... 137 | Fourth Byte | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... 138 | Range index | 8 | 3 | 2 | 1 | 0 | 8 | 1 | 0 | ... 139 | 140 | ```c 141 | Range_index = First_Byte | Second_Byte | Third_Byte | Fourth_Byte 142 | ``` 143 | 144 | #### Error handling 145 | 146 | * C0,C1,F5..FF are not included in range table and will always be detected. 147 | * Illegal 80..BF will have range index 0(00..7F) and be detected. 148 | * Based on First Byte, according Second, Third and Fourth Bytes will have range index 1/2/3, to make sure they must lie in 80..BF. 149 | * If non-ASCII First Byte overlaps, above algorithm will set range index of the latter First Byte to 9,10,11, which are illegal ranges. E.g, Input = F1 80 C2 90 --> Range index = 8 3 10 1, where 10 indicates error. See table below. 150 | 151 | Overlapped non-ASCII First Byte 152 | 153 | Input | F1 | 80 | C2 | 90 154 | :---- | :- | :- | :- | :- 155 | *first_len* |*3* |*0* |*1* |*0* 156 | First Byte | 8 | 0 | 8 | 0 157 | Second Byte | 0 | 3 | 0 | 1 158 | Third Byte | 0 | 0 | 2 | 0 159 | Fourth Byte | 0 | 0 | 0 | 1 160 | Range index | 8 | 3 |***10***| 1 161 | 162 | ### Adjust Second Byte range for special cases 163 | 164 | Range index adjustment for four special cases 165 | 166 | First Byte | Second Byte | Before adjustment | Correct index | Adjustment | 167 | :--------- | :---------- | :---------------- | :------------ | :--------- 168 | E0 | A0..BF | 2 | 4 | **2** 169 | ED | 80..9F | 2 | 5 | **3** 170 | F0 | 90..BF | 3 | 6 | **3** 171 | F4 | 80..8F | 3 | 7 | **4** 172 | 173 | Range index adjustment can be reduced to below problem: 174 | 175 | ***Given 16 bytes, replace E0 with 2, ED with 3, F0 with 3, F4 with 4, others with 0.*** 176 | 177 | A naive SIMD approach: 178 | 1. Compare 16 bytes with E0, get the mask for eacy byte (FF if equal, 00 otherwise) 179 | 1. And the mask with 2 to get adjustment for E0 180 | 1. Repeat step 1,2 for ED,F0,F4 181 | 182 | At least **eight** operations are required for naive approach. 183 | 184 | Observing special bytes(E0,ED,F0,F4) are close to each other, we can do much better using lookup table. 185 | 186 | #### NEON 187 | 188 | NEON ```tbl``` instruction is very convenient for table lookup: 189 | * Table can be up to 16x4 bytes in size 190 | * Return zero if index is out of range 191 | 192 | Leverage these features, we can solve the problem with as few as **two** operations: 193 | * Precreate a 16x2 lookup table, where table[0]=2, table[13]=3, table[16]=3, table[20]=4, table[others]=0. 194 | * Substract input bytes with E0 (E0 -> 0, ED -> 13, F0 -> 16, F4 -> 20). 195 | * Use the substracted byte as index of lookup table and get range adjustment directly. 196 | * For indices less than 32, we get zero or required adjustment value per input byte 197 | * For out of bound indices, we get zero per ```tbl``` behaviour 198 | 199 | #### SSE 200 | 201 | SSE ```pshufb``` instruction is not as friendly as NEON ```tbl``` in this case: 202 | * Table can only be 16 bytes in size 203 | * Out of bound indices are handled this way: 204 | * If 7-th bit of index is 0, least four bits are used as index (E.g, index 0x73 returns 3rd element) 205 | * If 7-th bit of index is 1, return 0 (E.g, index 0x83 returns 0) 206 | 207 | We can still leverage these features to solve the problem in **five** operations: 208 | * Precreate two tables: 209 | * table_df[1] = 2, table_df[14] = 3, table_df[others] = 0 210 | * table_ef[1] = 3, table_ef[5] = 4, table_ef[others] = 0 211 | * Substract input bytes with EF (E0 -> 241, ED -> 254, F0 -> 1, F4 -> 5) to get the temporary indices 212 | * Get range index for E0,ED 213 | * Saturate substract temporary indices with 240 (E0 -> 1, ED -> 14, all values below 240 becomes 0) 214 | * Use substracted indices to look up table_df, get the correct adjustment 215 | * Get range index for F0,F4 216 | * Saturate add temporary indices with 112(0x70) (F0 -> 0x71, F4 -> 0x75, all values above 16 will be larger than 128(7-th bit set)) 217 | * Use added indices to look up table_ef, get the correct adjustment (index 0x71,0x75 returns 1st,5th elements, per ```pshufb``` behaviour) 218 | 219 | #### Error handling 220 | 221 | * For overlapped non-ASCII First Byte, range index before adjustment is 9,10,11. After adjustment (adds 2,3,4 or 0), the range index will be 9 to 15, which is still illegal in range table. So the error will be detected. 222 | 223 | ### Handling remaining bytes 224 | 225 | For remaining input less than 16 bytes, we will fallback to naive byte by byte approach to validate them, which is actually faster than SIMD processing. 226 | * Look back last 16 bytes buffer to find First Byte. At most three bytes need to look back. Otherwise we either happen to be at character boundray, or there are some errors we already detected. 227 | * Validate string byte by byte starting from the First Byte. 228 | 229 | ## Tests 230 | 231 | It's necessary to design test cases to cover corner cases as more as possible. 232 | 233 | ### Positive cases 234 | 235 | 1. Prepare correct characters 236 | 2. Validate correct characters 237 | 3. Validate long strings 238 | * Round concatenate characters starting from first character to 1024 bytes 239 | * Validate 1024 bytes string 240 | * Shift 1 byte, validate 1025 bytes string 241 | * Shift 2 bytes, Validate 1026 bytes string 242 | * ... 243 | * Shift 16 bytes, validate 1040 bytes string 244 | 4. Repeat step3, test buffer starting from second character 245 | 5. Repeat step3, test buffer starting from third character 246 | 6. ... 247 | 248 | ### Negative cases 249 | 250 | 1. Prepare bad characters and bad strings 251 | * Bad character 252 | * Bad character cross 16 bytes boundary 253 | * Bad character cross last 16 bytes and remaining bytes boundary 254 | 2. Test long strings 255 | * Prepare correct long strings same as positive cases 256 | * Append bad characters 257 | * Shift one byte for each iteration 258 | * Validate each shift 259 | 260 | ## Code breakdown 261 | 262 | Below table shows how 16 bytes input are processed step by step. See [range-neon.c](range-neon.c) for according code. 263 | 264 | ![Range based UTF-8 validation algorithm](https://raw.githubusercontent.com/cyb70289/utf8/master/range.png) 265 | -------------------------------------------------------------------------------- /UTF-8-demo.txt: -------------------------------------------------------------------------------- 1 | 2 | UTF-8 encoded sample plain-text file 3 | ‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾ 4 | 5 | Markus Kuhn [ˈmaʳkʊs kuːn] — 2002-07-25 CC BY 6 | 7 | 8 | The ASCII compatible UTF-8 encoding used in this plain-text file 9 | is defined in Unicode, ISO 10646-1, and RFC 2279. 10 | 11 | 12 | Using Unicode/UTF-8, you can write in emails and source code things such as 13 | 14 | Mathematics and sciences: 15 | 16 | ∮ E⋅da = Q, n → ∞, ∑ f(i) = ∏ g(i), ⎧⎡⎛┌─────┐⎞⎤⎫ 17 | ⎪⎢⎜│a²+b³ ⎟⎥⎪ 18 | ∀x∈ℝ: ⌈x⌉ = −⌊−x⌋, α ∧ ¬β = ¬(¬α ∨ β), ⎪⎢⎜│───── ⎟⎥⎪ 19 | ⎪⎢⎜⎷ c₈ ⎟⎥⎪ 20 | ℕ ⊆ ℕ₀ ⊂ ℤ ⊂ ℚ ⊂ ℝ ⊂ ℂ, ⎨⎢⎜ ⎟⎥⎬ 21 | ⎪⎢⎜ ∞ ⎟⎥⎪ 22 | ⊥ < a ≠ b ≡ c ≤ d ≪ ⊤ ⇒ (⟦A⟧ ⇔ ⟪B⟫), ⎪⎢⎜ ⎲ ⎟⎥⎪ 23 | ⎪⎢⎜ ⎳aⁱ-bⁱ⎟⎥⎪ 24 | 2H₂ + O₂ ⇌ 2H₂O, R = 4.7 kΩ, ⌀ 200 mm ⎩⎣⎝i=1 ⎠⎦⎭ 25 | 26 | Linguistics and dictionaries: 27 | 28 | ði ıntəˈnæʃənəl fəˈnɛtık əsoʊsiˈeıʃn 29 | Y [ˈʏpsilɔn], Yen [jɛn], Yoga [ˈjoːgɑ] 30 | 31 | APL: 32 | 33 | ((V⍳V)=⍳⍴V)/V←,V ⌷←⍳→⍴∆∇⊃‾⍎⍕⌈ 34 | 35 | Nicer typography in plain text files: 36 | 37 | ╔══════════════════════════════════════════╗ 38 | ║ ║ 39 | ║ • ‘single’ and “double” quotes ║ 40 | ║ ║ 41 | ║ • Curly apostrophes: “We’ve been here” ║ 42 | ║ ║ 43 | ║ • Latin-1 apostrophe and accents: '´` ║ 44 | ║ ║ 45 | ║ • ‚deutsche‘ „Anführungszeichen“ ║ 46 | ║ ║ 47 | ║ • †, ‡, ‰, •, 3–4, —, −5/+5, ™, … ║ 48 | ║ ║ 49 | ║ • ASCII safety test: 1lI|, 0OD, 8B ║ 50 | ║ ╭─────────╮ ║ 51 | ║ • the euro symbol: │ 14.95 € │ ║ 52 | ║ ╰─────────╯ ║ 53 | ╚══════════════════════════════════════════╝ 54 | 55 | Combining characters: 56 | 57 | STARGΛ̊TE SG-1, a = v̇ = r̈, a⃑ ⊥ b⃑ 58 | 59 | Greek (in Polytonic): 60 | 61 | The Greek anthem: 62 | 63 | Σὲ γνωρίζω ἀπὸ τὴν κόψη 64 | τοῦ σπαθιοῦ τὴν τρομερή, 65 | σὲ γνωρίζω ἀπὸ τὴν ὄψη 66 | ποὺ μὲ βία μετράει τὴ γῆ. 67 | 68 | ᾿Απ᾿ τὰ κόκκαλα βγαλμένη 69 | τῶν ῾Ελλήνων τὰ ἱερά 70 | καὶ σὰν πρῶτα ἀνδρειωμένη 71 | χαῖρε, ὦ χαῖρε, ᾿Ελευθεριά! 72 | 73 | From a speech of Demosthenes in the 4th century BC: 74 | 75 | Οὐχὶ ταὐτὰ παρίσταταί μοι γιγνώσκειν, ὦ ἄνδρες ᾿Αθηναῖοι, 76 | ὅταν τ᾿ εἰς τὰ πράγματα ἀποβλέψω καὶ ὅταν πρὸς τοὺς 77 | λόγους οὓς ἀκούω· τοὺς μὲν γὰρ λόγους περὶ τοῦ 78 | τιμωρήσασθαι Φίλιππον ὁρῶ γιγνομένους, τὰ δὲ πράγματ᾿ 79 | εἰς τοῦτο προήκοντα, ὥσθ᾿ ὅπως μὴ πεισόμεθ᾿ αὐτοὶ 80 | πρότερον κακῶς σκέψασθαι δέον. οὐδέν οὖν ἄλλο μοι δοκοῦσιν 81 | οἱ τὰ τοιαῦτα λέγοντες ἢ τὴν ὑπόθεσιν, περὶ ἧς βουλεύεσθαι, 82 | οὐχὶ τὴν οὖσαν παριστάντες ὑμῖν ἁμαρτάνειν. ἐγὼ δέ, ὅτι μέν 83 | ποτ᾿ ἐξῆν τῇ πόλει καὶ τὰ αὑτῆς ἔχειν ἀσφαλῶς καὶ Φίλιππον 84 | τιμωρήσασθαι, καὶ μάλ᾿ ἀκριβῶς οἶδα· ἐπ᾿ ἐμοῦ γάρ, οὐ πάλαι 85 | γέγονεν ταῦτ᾿ ἀμφότερα· νῦν μέντοι πέπεισμαι τοῦθ᾿ ἱκανὸν 86 | προλαβεῖν ἡμῖν εἶναι τὴν πρώτην, ὅπως τοὺς συμμάχους 87 | σώσομεν. ἐὰν γὰρ τοῦτο βεβαίως ὑπάρξῃ, τότε καὶ περὶ τοῦ 88 | τίνα τιμωρήσεταί τις καὶ ὃν τρόπον ἐξέσται σκοπεῖν· πρὶν δὲ 89 | τὴν ἀρχὴν ὀρθῶς ὑποθέσθαι, μάταιον ἡγοῦμαι περὶ τῆς 90 | τελευτῆς ὁντινοῦν ποιεῖσθαι λόγον. 91 | 92 | Δημοσθένους, Γ´ ᾿Ολυνθιακὸς 93 | 94 | Georgian: 95 | 96 | From a Unicode conference invitation: 97 | 98 | გთხოვთ ახლავე გაიაროთ რეგისტრაცია Unicode-ის მეათე საერთაშორისო 99 | კონფერენციაზე დასასწრებად, რომელიც გაიმართება 10-12 მარტს, 100 | ქ. მაინცში, გერმანიაში. კონფერენცია შეჰკრებს ერთად მსოფლიოს 101 | ექსპერტებს ისეთ დარგებში როგორიცაა ინტერნეტი და Unicode-ი, 102 | ინტერნაციონალიზაცია და ლოკალიზაცია, Unicode-ის გამოყენება 103 | ოპერაციულ სისტემებსა, და გამოყენებით პროგრამებში, შრიფტებში, 104 | ტექსტების დამუშავებასა და მრავალენოვან კომპიუტერულ სისტემებში. 105 | 106 | Russian: 107 | 108 | From a Unicode conference invitation: 109 | 110 | Зарегистрируйтесь сейчас на Десятую Международную Конференцию по 111 | Unicode, которая состоится 10-12 марта 1997 года в Майнце в Германии. 112 | Конференция соберет широкий круг экспертов по вопросам глобального 113 | Интернета и Unicode, локализации и интернационализации, воплощению и 114 | применению Unicode в различных операционных системах и программных 115 | приложениях, шрифтах, верстке и многоязычных компьютерных системах. 116 | 117 | Thai (UCS Level 2): 118 | 119 | Excerpt from a poetry on The Romance of The Three Kingdoms (a Chinese 120 | classic 'San Gua'): 121 | 122 | [----------------------------|------------------------] 123 | ๏ แผ่นดินฮั่นเสื่อมโทรมแสนสังเวช พระปกเกศกองบู๊กู้ขึ้นใหม่ 124 | สิบสองกษัตริย์ก่อนหน้าแลถัดไป สององค์ไซร้โง่เขลาเบาปัญญา 125 | ทรงนับถือขันทีเป็นที่พึ่ง บ้านเมืองจึงวิปริตเป็นนักหนา 126 | โฮจิ๋นเรียกทัพทั่วหัวเมืองมา หมายจะฆ่ามดชั่วตัวสำคัญ 127 | เหมือนขับไสไล่เสือจากเคหา รับหมาป่าเข้ามาเลยอาสัญ 128 | ฝ่ายอ้องอุ้นยุแยกให้แตกกัน ใช้สาวนั้นเป็นชนวนชื่นชวนใจ 129 | พลันลิฉุยกุยกีกลับก่อเหตุ ช่างอาเพศจริงหนาฟ้าร้องไห้ 130 | ต้องรบราฆ่าฟันจนบรรลัย ฤๅหาใครค้ำชูกู้บรรลังก์ ฯ 131 | 132 | (The above is a two-column text. If combining characters are handled 133 | correctly, the lines of the second column should be aligned with the 134 | | character above.) 135 | 136 | Ethiopian: 137 | 138 | Proverbs in the Amharic language: 139 | 140 | ሰማይ አይታረስ ንጉሥ አይከሰስ። 141 | ብላ ካለኝ እንደአባቴ በቆመጠኝ። 142 | ጌጥ ያለቤቱ ቁምጥና ነው። 143 | ደሀ በሕልሙ ቅቤ ባይጠጣ ንጣት በገደለው። 144 | የአፍ ወለምታ በቅቤ አይታሽም። 145 | አይጥ በበላ ዳዋ ተመታ። 146 | ሲተረጉሙ ይደረግሙ። 147 | ቀስ በቀስ፥ ዕንቁላል በእግሩ ይሄዳል። 148 | ድር ቢያብር አንበሳ ያስር። 149 | ሰው እንደቤቱ እንጅ እንደ ጉረቤቱ አይተዳደርም። 150 | እግዜር የከፈተውን ጉሮሮ ሳይዘጋው አይድርም። 151 | የጎረቤት ሌባ፥ ቢያዩት ይስቅ ባያዩት ያጠልቅ። 152 | ሥራ ከመፍታት ልጄን ላፋታት። 153 | ዓባይ ማደሪያ የለው፥ ግንድ ይዞ ይዞራል። 154 | የእስላም አገሩ መካ የአሞራ አገሩ ዋርካ። 155 | ተንጋሎ ቢተፉ ተመልሶ ባፉ። 156 | ወዳጅህ ማር ቢሆን ጨርስህ አትላሰው። 157 | እግርህን በፍራሽህ ልክ ዘርጋ። 158 | 159 | Runes: 160 | 161 | ᚻᛖ ᚳᚹᚫᚦ ᚦᚫᛏ ᚻᛖ ᛒᚢᛞᛖ ᚩᚾ ᚦᚫᛗ ᛚᚪᚾᛞᛖ ᚾᚩᚱᚦᚹᛖᚪᚱᛞᚢᛗ ᚹᛁᚦ ᚦᚪ ᚹᛖᛥᚫ 162 | 163 | (Old English, which transcribed into Latin reads 'He cwaeth that he 164 | bude thaem lande northweardum with tha Westsae.' and means 'He said 165 | that he lived in the northern land near the Western Sea.') 166 | 167 | Braille: 168 | 169 | ⡌⠁⠧⠑ ⠼⠁⠒ ⡍⠜⠇⠑⠹⠰⠎ ⡣⠕⠌ 170 | 171 | ⡍⠜⠇⠑⠹ ⠺⠁⠎ ⠙⠑⠁⠙⠒ ⠞⠕ ⠃⠑⠛⠔ ⠺⠊⠹⠲ ⡹⠻⠑ ⠊⠎ ⠝⠕ ⠙⠳⠃⠞ 172 | ⠱⠁⠞⠑⠧⠻ ⠁⠃⠳⠞ ⠹⠁⠞⠲ ⡹⠑ ⠗⠑⠛⠊⠌⠻ ⠕⠋ ⠙⠊⠎ ⠃⠥⠗⠊⠁⠇ ⠺⠁⠎ 173 | ⠎⠊⠛⠝⠫ ⠃⠹ ⠹⠑ ⠊⠇⠻⠛⠹⠍⠁⠝⠂ ⠹⠑ ⠊⠇⠻⠅⠂ ⠹⠑ ⠥⠝⠙⠻⠞⠁⠅⠻⠂ 174 | ⠁⠝⠙ ⠹⠑ ⠡⠊⠑⠋ ⠍⠳⠗⠝⠻⠲ ⡎⠊⠗⠕⠕⠛⠑ ⠎⠊⠛⠝⠫ ⠊⠞⠲ ⡁⠝⠙ 175 | ⡎⠊⠗⠕⠕⠛⠑⠰⠎ ⠝⠁⠍⠑ ⠺⠁⠎ ⠛⠕⠕⠙ ⠥⠏⠕⠝ ⠰⡡⠁⠝⠛⠑⠂ ⠋⠕⠗ ⠁⠝⠹⠹⠔⠛ ⠙⠑ 176 | ⠡⠕⠎⠑ ⠞⠕ ⠏⠥⠞ ⠙⠊⠎ ⠙⠁⠝⠙ ⠞⠕⠲ 177 | 178 | ⡕⠇⠙ ⡍⠜⠇⠑⠹ ⠺⠁⠎ ⠁⠎ ⠙⠑⠁⠙ ⠁⠎ ⠁ ⠙⠕⠕⠗⠤⠝⠁⠊⠇⠲ 179 | 180 | ⡍⠔⠙⠖ ⡊ ⠙⠕⠝⠰⠞ ⠍⠑⠁⠝ ⠞⠕ ⠎⠁⠹ ⠹⠁⠞ ⡊ ⠅⠝⠪⠂ ⠕⠋ ⠍⠹ 181 | ⠪⠝ ⠅⠝⠪⠇⠫⠛⠑⠂ ⠱⠁⠞ ⠹⠻⠑ ⠊⠎ ⠏⠜⠞⠊⠊⠥⠇⠜⠇⠹ ⠙⠑⠁⠙ ⠁⠃⠳⠞ 182 | ⠁ ⠙⠕⠕⠗⠤⠝⠁⠊⠇⠲ ⡊ ⠍⠊⠣⠞ ⠙⠁⠧⠑ ⠃⠑⠲ ⠔⠊⠇⠔⠫⠂ ⠍⠹⠎⠑⠇⠋⠂ ⠞⠕ 183 | ⠗⠑⠛⠜⠙ ⠁ ⠊⠕⠋⠋⠔⠤⠝⠁⠊⠇ ⠁⠎ ⠹⠑ ⠙⠑⠁⠙⠑⠌ ⠏⠊⠑⠊⠑ ⠕⠋ ⠊⠗⠕⠝⠍⠕⠝⠛⠻⠹ 184 | ⠔ ⠹⠑ ⠞⠗⠁⠙⠑⠲ ⡃⠥⠞ ⠹⠑ ⠺⠊⠎⠙⠕⠍ ⠕⠋ ⠳⠗ ⠁⠝⠊⠑⠌⠕⠗⠎ 185 | ⠊⠎ ⠔ ⠹⠑ ⠎⠊⠍⠊⠇⠑⠆ ⠁⠝⠙ ⠍⠹ ⠥⠝⠙⠁⠇⠇⠪⠫ ⠙⠁⠝⠙⠎ 186 | ⠩⠁⠇⠇ ⠝⠕⠞ ⠙⠊⠌⠥⠗⠃ ⠊⠞⠂ ⠕⠗ ⠹⠑ ⡊⠳⠝⠞⠗⠹⠰⠎ ⠙⠕⠝⠑ ⠋⠕⠗⠲ ⡹⠳ 187 | ⠺⠊⠇⠇ ⠹⠻⠑⠋⠕⠗⠑ ⠏⠻⠍⠊⠞ ⠍⠑ ⠞⠕ ⠗⠑⠏⠑⠁⠞⠂ ⠑⠍⠏⠙⠁⠞⠊⠊⠁⠇⠇⠹⠂ ⠹⠁⠞ 188 | ⡍⠜⠇⠑⠹ ⠺⠁⠎ ⠁⠎ ⠙⠑⠁⠙ ⠁⠎ ⠁ ⠙⠕⠕⠗⠤⠝⠁⠊⠇⠲ 189 | 190 | (The first couple of paragraphs of "A Christmas Carol" by Dickens) 191 | 192 | Compact font selection example text: 193 | 194 | ABCDEFGHIJKLMNOPQRSTUVWXYZ /0123456789 195 | abcdefghijklmnopqrstuvwxyz £©µÀÆÖÞßéöÿ 196 | –—‘“”„†•…‰™œŠŸž€ ΑΒΓΔΩαβγδω АБВГДабвгд 197 | ∀∂∈ℝ∧∪≡∞ ↑↗↨↻⇣ ┐┼╔╘░►☺♀ fi�⑀₂ἠḂӥẄɐː⍎אԱა 198 | 199 | Greetings in various languages: 200 | 201 | Hello world, Καλημέρα κόσμε, コンニチハ 202 | 203 | Box drawing alignment tests: █ 204 | ▉ 205 | ╔══╦══╗ ┌──┬──┐ ╭──┬──╮ ╭──┬──╮ ┏━━┳━━┓ ┎┒┏┑ ╷ ╻ ┏┯┓ ┌┰┐ ▊ ╱╲╱╲╳╳╳ 206 | ║┌─╨─┐║ │╔═╧═╗│ │╒═╪═╕│ │╓─╁─╖│ ┃┌─╂─┐┃ ┗╃╄┙ ╶┼╴╺╋╸┠┼┨ ┝╋┥ ▋ ╲╱╲╱╳╳╳ 207 | ║│╲ ╱│║ │║ ║│ ││ │ ││ │║ ┃ ║│ ┃│ ╿ │┃ ┍╅╆┓ ╵ ╹ ┗┷┛ └┸┘ ▌ ╱╲╱╲╳╳╳ 208 | ╠╡ ╳ ╞╣ ├╢ ╟┤ ├┼─┼─┼┤ ├╫─╂─╫┤ ┣┿╾┼╼┿┫ ┕┛┖┚ ┌┄┄┐ ╎ ┏┅┅┓ ┋ ▍ ╲╱╲╱╳╳╳ 209 | ║│╱ ╲│║ │║ ║│ ││ │ ││ │║ ┃ ║│ ┃│ ╽ │┃ ░░▒▒▓▓██ ┊ ┆ ╎ ╏ ┇ ┋ ▎ 210 | ║└─╥─┘║ │╚═╤═╝│ │╘═╪═╛│ │╙─╀─╜│ ┃└─╂─┘┃ ░░▒▒▓▓██ ┊ ┆ ╎ ╏ ┇ ┋ ▏ 211 | ╚══╩══╝ └──┴──┘ ╰──┴──╯ ╰──┴──╯ ┗━━┻━━┛ ▗▄▖▛▀▜ └╌╌┘ ╎ ┗╍╍┛ ┋ ▁▂▃▄▅▆▇█ 212 | ▝▀▘▙▄▟ 213 | -------------------------------------------------------------------------------- /ascii.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include 5 | #include 6 | #include 7 | 8 | #include 9 | 10 | static inline int ascii_std(const uint8_t *data, int len) 11 | { 12 | return !std::any_of(data, data+len, [] (int8_t b) { return b < 0; }); 13 | } 14 | 15 | static inline int ascii_u64(const uint8_t *data, int len) 16 | { 17 | uint8_t orall = 0; 18 | 19 | if (len >= 16) { 20 | 21 | uint64_t or1 = 0, or2 = 0; 22 | const uint8_t *data2 = data+8; 23 | 24 | do { 25 | or1 |= *(const uint64_t *)data; 26 | or2 |= *(const uint64_t *)data2; 27 | data += 16; 28 | data2 += 16; 29 | len -= 16; 30 | } while (len >= 16); 31 | 32 | /* 33 | * Idea from Benny Halevy 34 | * - 7-th bit set ==> orall = !(non-zero) - 1 = 0 - 1 = 0xFF 35 | * - 7-th bit clear ==> orall = !0 - 1 = 1 - 1 = 0x00 36 | */ 37 | orall = !((or1 | or2) & 0x8080808080808080ULL) - 1; 38 | } 39 | 40 | while (len--) 41 | orall |= *data++; 42 | 43 | return orall < 0x80; 44 | } 45 | 46 | #if defined(__x86_64__) 47 | #include 48 | 49 | static inline int ascii_simd(const uint8_t *data, int len) 50 | { 51 | if (len >= 32) { 52 | const uint8_t *data2 = data+16; 53 | 54 | __m128i or1 = _mm_set1_epi8(0), or2 = or1; 55 | 56 | while (len >= 32) { 57 | __m128i input1 = _mm_loadu_si128((const __m128i *)data); 58 | __m128i input2 = _mm_loadu_si128((const __m128i *)data2); 59 | 60 | or1 = _mm_or_si128(or1, input1); 61 | or2 = _mm_or_si128(or2, input2); 62 | 63 | data += 32; 64 | data2 += 32; 65 | len -= 32; 66 | } 67 | 68 | or1 = _mm_or_si128(or1, or2); 69 | if (_mm_movemask_epi8(_mm_cmplt_epi8(or1, _mm_set1_epi8(0)))) 70 | return 0; 71 | } 72 | 73 | return ascii_u64(data, len); 74 | } 75 | 76 | #elif defined(__aarch64__) 77 | #include 78 | 79 | static inline int ascii_simd(const uint8_t *data, int len) 80 | { 81 | if (len >= 32) { 82 | const uint8_t *data2 = data+16; 83 | 84 | uint8x16_t or1 = vdupq_n_u8(0), or2 = or1; 85 | 86 | while (len >= 32) { 87 | const uint8x16_t input1 = vld1q_u8(data); 88 | const uint8x16_t input2 = vld1q_u8(data2); 89 | 90 | or1 = vorrq_u8(or1, input1); 91 | or2 = vorrq_u8(or2, input2); 92 | 93 | data += 32; 94 | data2 += 32; 95 | len -= 32; 96 | } 97 | 98 | or1 = vorrq_u8(or1, or2); 99 | if (vmaxvq_u8(or1) >= 0x80) 100 | return 0; 101 | } 102 | 103 | return ascii_u64(data, len); 104 | } 105 | 106 | #endif 107 | 108 | struct ftab { 109 | const char *name; 110 | int (*func)(const uint8_t *data, int len); 111 | }; 112 | 113 | static const std::vector _f = { 114 | { 115 | .name = "std", 116 | .func = ascii_std, 117 | }, { 118 | .name = "u64", 119 | .func = ascii_u64, 120 | }, { 121 | .name = "simd", 122 | .func = ascii_simd, 123 | }, 124 | }; 125 | 126 | static void load_test_buf(uint8_t *data, int len) 127 | { 128 | uint8_t v = 0; 129 | 130 | for (int i = 0; i < len; ++i) { 131 | data[i] = v++; 132 | v &= 0x7F; 133 | } 134 | } 135 | 136 | static void bench(const struct ftab &f, const uint8_t *data, int len) 137 | { 138 | const int loops = 1024*1024*1024/len; 139 | int ret = 1; 140 | double time_aligned, time_unaligned, size; 141 | struct timeval tv1, tv2; 142 | 143 | fprintf(stderr, "bench %s (%d bytes)... ", f.name, len); 144 | 145 | /* aligned */ 146 | gettimeofday(&tv1, 0); 147 | for (int i = 0; i < loops; ++i) 148 | ret &= f.func(data, len); 149 | gettimeofday(&tv2, 0); 150 | time_aligned = tv2.tv_usec - tv1.tv_usec; 151 | time_aligned = time_aligned / 1000000 + tv2.tv_sec - tv1.tv_sec; 152 | 153 | /* unaligned */ 154 | gettimeofday(&tv1, 0); 155 | for (int i = 0; i < loops; ++i) 156 | ret &= f.func(data+1, len); 157 | gettimeofday(&tv2, 0); 158 | time_unaligned = tv2.tv_usec - tv1.tv_usec; 159 | time_unaligned = time_unaligned / 1000000 + tv2.tv_sec - tv1.tv_sec; 160 | 161 | printf("%s ", ret?"pass":"FAIL"); 162 | 163 | size = ((double)len * loops) / (1024*1024); 164 | printf("%.0f/%.0f MB/s\n", size / time_aligned, size / time_unaligned); 165 | } 166 | 167 | static void test(const struct ftab &f, uint8_t *data, int len) 168 | { 169 | int error = 0; 170 | 171 | fprintf(stderr, "test %s (%d bytes)... ", f.name, len); 172 | 173 | /* positive */ 174 | error |= !f.func(data, len); 175 | 176 | /* negative */ 177 | if (len < 100*1024) { 178 | for (int i = 0; i < len; ++i) { 179 | data[i] += 0x80; 180 | error |= f.func(data, len); 181 | data[i] -= 0x80; 182 | } 183 | } 184 | 185 | printf("%s\n", error ? "FAIL" : "pass"); 186 | } 187 | 188 | /* ./ascii [test|bench] [alg] */ 189 | int main(int argc, const char *argv[]) 190 | { 191 | int do_test = 1, do_bench = 1; 192 | const char *alg = NULL; 193 | 194 | if (argc > 1) { 195 | do_bench &= !!strcmp(argv[1], "test"); 196 | do_test &= !!strcmp(argv[1], "bench"); 197 | } 198 | 199 | if (do_bench && argc > 2) 200 | alg = argv[2]; 201 | 202 | const std::vector size = { 9, 16+1, 32-1, 128+1, 1024+15, 203 | 16*1024+1, 64*1024+15, 1024*1024 }; 204 | 205 | int max_size = *std::max_element(size.begin(), size.end()); 206 | uint8_t *_data = new uint8_t[max_size+1]; 207 | assert(((uintptr_t)_data & 7) == 0); 208 | uint8_t *data = _data+1; /* Unalign buffer address */ 209 | 210 | _data[0] = 0; 211 | load_test_buf(data, max_size); 212 | 213 | if (do_test) { 214 | printf("==================== Test ====================\n"); 215 | for (int sz : size) { 216 | for (auto &f : _f) { 217 | test(f, data, sz); 218 | } 219 | } 220 | } 221 | 222 | if (do_bench) { 223 | printf("==================== Bench ====================\n"); 224 | for (int sz : size) { 225 | for (auto &f : _f) { 226 | if (!alg || strcmp(alg, f.name) == 0) 227 | bench(f, _data, sz); 228 | } 229 | printf("-----------------------------------------------\n"); 230 | } 231 | } 232 | 233 | delete _data; 234 | return 0; 235 | } 236 | -------------------------------------------------------------------------------- /boost.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | 3 | using namespace std; 4 | 5 | /* Return 0 on sucess, -1 on error */ 6 | extern "C" int utf8_boost(const unsigned char *data, int len) 7 | { 8 | try { 9 | boost::locale::conv::utf_to_utf(data, data+len, 10 | boost::locale::conv::stop); 11 | } catch (const boost::locale::conv::conversion_error& ex) { 12 | return -1; 13 | } 14 | 15 | return 0; 16 | } 17 | -------------------------------------------------------------------------------- /lemire-avx2.c: -------------------------------------------------------------------------------- 1 | // Adapted from https://github.com/lemire/fastvalidate-utf-8 2 | 3 | #ifdef __AVX2__ 4 | 5 | #include 6 | #include 7 | #include 8 | #include 9 | #include 10 | 11 | /* 12 | * legal utf-8 byte sequence 13 | * http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf - page 94 14 | * 15 | * Code Points 1st 2s 3s 4s 16 | * U+0000..U+007F 00..7F 17 | * U+0080..U+07FF C2..DF 80..BF 18 | * U+0800..U+0FFF E0 A0..BF 80..BF 19 | * U+1000..U+CFFF E1..EC 80..BF 80..BF 20 | * U+D000..U+D7FF ED 80..9F 80..BF 21 | * U+E000..U+FFFF EE..EF 80..BF 80..BF 22 | * U+10000..U+3FFFF F0 90..BF 80..BF 80..BF 23 | * U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF 24 | * U+100000..U+10FFFF F4 80..8F 80..BF 80..BF 25 | * 26 | */ 27 | 28 | #if 0 29 | static void print256(const char *s, const __m256i v256) 30 | { 31 | const unsigned char *v8 = (const unsigned char *)&v256; 32 | if (s) 33 | printf("%s:\t", s); 34 | for (int i = 0; i < 32; i++) 35 | printf("%02x ", v8[i]); 36 | printf("\n"); 37 | } 38 | #endif 39 | 40 | static inline __m256i push_last_byte_of_a_to_b(__m256i a, __m256i b) { 41 | return _mm256_alignr_epi8(b, _mm256_permute2x128_si256(a, b, 0x21), 15); 42 | } 43 | 44 | static inline __m256i push_last_2bytes_of_a_to_b(__m256i a, __m256i b) { 45 | return _mm256_alignr_epi8(b, _mm256_permute2x128_si256(a, b, 0x21), 14); 46 | } 47 | 48 | // all byte values must be no larger than 0xF4 49 | static inline void avxcheckSmallerThan0xF4(__m256i current_bytes, 50 | __m256i *has_error) { 51 | // unsigned, saturates to 0 below max 52 | *has_error = _mm256_or_si256( 53 | *has_error, _mm256_subs_epu8(current_bytes, _mm256_set1_epi8(0xF4))); 54 | } 55 | 56 | static inline __m256i avxcontinuationLengths(__m256i high_nibbles) { 57 | return _mm256_shuffle_epi8( 58 | _mm256_setr_epi8(1, 1, 1, 1, 1, 1, 1, 1, // 0xxx (ASCII) 59 | 0, 0, 0, 0, // 10xx (continuation) 60 | 2, 2, // 110x 61 | 3, // 1110 62 | 4, // 1111, next should be 0 (not checked here) 63 | 1, 1, 1, 1, 1, 1, 1, 1, // 0xxx (ASCII) 64 | 0, 0, 0, 0, // 10xx (continuation) 65 | 2, 2, // 110x 66 | 3, // 1110 67 | 4 // 1111, next should be 0 (not checked here) 68 | ), 69 | high_nibbles); 70 | } 71 | 72 | static inline __m256i avxcarryContinuations(__m256i initial_lengths, 73 | __m256i previous_carries) { 74 | 75 | __m256i right1 = _mm256_subs_epu8( 76 | push_last_byte_of_a_to_b(previous_carries, initial_lengths), 77 | _mm256_set1_epi8(1)); 78 | __m256i sum = _mm256_add_epi8(initial_lengths, right1); 79 | 80 | __m256i right2 = _mm256_subs_epu8( 81 | push_last_2bytes_of_a_to_b(previous_carries, sum), _mm256_set1_epi8(2)); 82 | return _mm256_add_epi8(sum, right2); 83 | } 84 | 85 | static inline void avxcheckContinuations(__m256i initial_lengths, 86 | __m256i carries, __m256i *has_error) { 87 | 88 | // overlap || underlap 89 | // carry > length && length > 0 || !(carry > length) && !(length > 0) 90 | // (carries > length) == (lengths > 0) 91 | __m256i overunder = _mm256_cmpeq_epi8( 92 | _mm256_cmpgt_epi8(carries, initial_lengths), 93 | _mm256_cmpgt_epi8(initial_lengths, _mm256_setzero_si256())); 94 | 95 | *has_error = _mm256_or_si256(*has_error, overunder); 96 | } 97 | 98 | // when 0xED is found, next byte must be no larger than 0x9F 99 | // when 0xF4 is found, next byte must be no larger than 0x8F 100 | // next byte must be continuation, ie sign bit is set, so signed < is ok 101 | static inline void avxcheckFirstContinuationMax(__m256i current_bytes, 102 | __m256i off1_current_bytes, 103 | __m256i *has_error) { 104 | __m256i maskED = 105 | _mm256_cmpeq_epi8(off1_current_bytes, _mm256_set1_epi8(0xED)); 106 | __m256i maskF4 = 107 | _mm256_cmpeq_epi8(off1_current_bytes, _mm256_set1_epi8(0xF4)); 108 | 109 | __m256i badfollowED = _mm256_and_si256( 110 | _mm256_cmpgt_epi8(current_bytes, _mm256_set1_epi8(0x9F)), maskED); 111 | __m256i badfollowF4 = _mm256_and_si256( 112 | _mm256_cmpgt_epi8(current_bytes, _mm256_set1_epi8(0x8F)), maskF4); 113 | 114 | *has_error = 115 | _mm256_or_si256(*has_error, _mm256_or_si256(badfollowED, badfollowF4)); 116 | } 117 | 118 | // map off1_hibits => error condition 119 | // hibits off1 cur 120 | // C => < C2 && true 121 | // E => < E1 && < A0 122 | // F => < F1 && < 90 123 | // else false && false 124 | static inline void avxcheckOverlong(__m256i current_bytes, 125 | __m256i off1_current_bytes, __m256i hibits, 126 | __m256i previous_hibits, 127 | __m256i *has_error) { 128 | __m256i off1_hibits = push_last_byte_of_a_to_b(previous_hibits, hibits); 129 | __m256i initial_mins = _mm256_shuffle_epi8( 130 | _mm256_setr_epi8(-128, -128, -128, -128, -128, -128, -128, -128, -128, 131 | -128, -128, -128, // 10xx => false 132 | 0xC2, -128, // 110x 133 | 0xE1, // 1110 134 | 0xF1, -128, -128, -128, -128, -128, -128, -128, -128, 135 | -128, -128, -128, -128, // 10xx => false 136 | 0xC2, -128, // 110x 137 | 0xE1, // 1110 138 | 0xF1), 139 | off1_hibits); 140 | 141 | __m256i initial_under = _mm256_cmpgt_epi8(initial_mins, off1_current_bytes); 142 | 143 | __m256i second_mins = _mm256_shuffle_epi8( 144 | _mm256_setr_epi8(-128, -128, -128, -128, -128, -128, -128, -128, -128, 145 | -128, -128, -128, // 10xx => false 146 | 127, 127, // 110x => true 147 | 0xA0, // 1110 148 | 0x90, -128, -128, -128, -128, -128, -128, -128, -128, 149 | -128, -128, -128, -128, // 10xx => false 150 | 127, 127, // 110x => true 151 | 0xA0, // 1110 152 | 0x90), 153 | off1_hibits); 154 | __m256i second_under = _mm256_cmpgt_epi8(second_mins, current_bytes); 155 | *has_error = _mm256_or_si256(*has_error, 156 | _mm256_and_si256(initial_under, second_under)); 157 | } 158 | 159 | struct avx_processed_utf_bytes { 160 | __m256i rawbytes; 161 | __m256i high_nibbles; 162 | __m256i carried_continuations; 163 | }; 164 | 165 | static inline void avx_count_nibbles(__m256i bytes, 166 | struct avx_processed_utf_bytes *answer) { 167 | answer->rawbytes = bytes; 168 | answer->high_nibbles = 169 | _mm256_and_si256(_mm256_srli_epi16(bytes, 4), _mm256_set1_epi8(0x0F)); 170 | } 171 | 172 | // check whether the current bytes are valid UTF-8 173 | // at the end of the function, previous gets updated 174 | static struct avx_processed_utf_bytes 175 | avxcheckUTF8Bytes(__m256i current_bytes, 176 | struct avx_processed_utf_bytes *previous, 177 | __m256i *has_error) { 178 | struct avx_processed_utf_bytes pb; 179 | avx_count_nibbles(current_bytes, &pb); 180 | 181 | avxcheckSmallerThan0xF4(current_bytes, has_error); 182 | 183 | __m256i initial_lengths = avxcontinuationLengths(pb.high_nibbles); 184 | 185 | pb.carried_continuations = 186 | avxcarryContinuations(initial_lengths, previous->carried_continuations); 187 | 188 | avxcheckContinuations(initial_lengths, pb.carried_continuations, has_error); 189 | 190 | __m256i off1_current_bytes = 191 | push_last_byte_of_a_to_b(previous->rawbytes, pb.rawbytes); 192 | avxcheckFirstContinuationMax(current_bytes, off1_current_bytes, has_error); 193 | 194 | avxcheckOverlong(current_bytes, off1_current_bytes, pb.high_nibbles, 195 | previous->high_nibbles, has_error); 196 | return pb; 197 | } 198 | 199 | /* Return 0 on success, -1 on error */ 200 | int utf8_lemire_avx2(const unsigned char *src, int len) { 201 | size_t i = 0; 202 | __m256i has_error = _mm256_setzero_si256(); 203 | struct avx_processed_utf_bytes previous = { 204 | .rawbytes = _mm256_setzero_si256(), 205 | .high_nibbles = _mm256_setzero_si256(), 206 | .carried_continuations = _mm256_setzero_si256()}; 207 | if (len >= 32) { 208 | for (; i <= len - 32; i += 32) { 209 | __m256i current_bytes = _mm256_loadu_si256((const __m256i *)(src + i)); 210 | previous = avxcheckUTF8Bytes(current_bytes, &previous, &has_error); 211 | } 212 | } 213 | 214 | // last part 215 | if (i < len) { 216 | char buffer[32]; 217 | memset(buffer, 0, 32); 218 | memcpy(buffer, src + i, len - i); 219 | __m256i current_bytes = _mm256_loadu_si256((const __m256i *)(buffer)); 220 | previous = avxcheckUTF8Bytes(current_bytes, &previous, &has_error); 221 | } else { 222 | has_error = _mm256_or_si256( 223 | _mm256_cmpgt_epi8(previous.carried_continuations, 224 | _mm256_setr_epi8(9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 225 | 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 226 | 9, 9, 9, 9, 9, 9, 9, 1)), 227 | has_error); 228 | } 229 | 230 | return _mm256_testz_si256(has_error, has_error) ? 0 : -1; 231 | } 232 | 233 | #endif 234 | -------------------------------------------------------------------------------- /lemire-neon.c: -------------------------------------------------------------------------------- 1 | // Adapted from https://github.com/lemire/fastvalidate-utf-8 2 | 3 | #ifdef __aarch64__ 4 | 5 | #include 6 | #include 7 | #include 8 | #include 9 | #include 10 | #include 11 | 12 | /* 13 | * legal utf-8 byte sequence 14 | * http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf - page 94 15 | * 16 | * Code Points 1st 2s 3s 4s 17 | * U+0000..U+007F 00..7F 18 | * U+0080..U+07FF C2..DF 80..BF 19 | * U+0800..U+0FFF E0 A0..BF 80..BF 20 | * U+1000..U+CFFF E1..EC 80..BF 80..BF 21 | * U+D000..U+D7FF ED 80..9F 80..BF 22 | * U+E000..U+FFFF EE..EF 80..BF 80..BF 23 | * U+10000..U+3FFFF F0 90..BF 80..BF 80..BF 24 | * U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF 25 | * U+100000..U+10FFFF F4 80..8F 80..BF 80..BF 26 | * 27 | */ 28 | 29 | #if 0 30 | static void print128(const char *s, const int8x16_t *v128) 31 | { 32 | int8_t v8[16]; 33 | vst1q_s8(v8, *v128); 34 | 35 | if (s) 36 | printf("%s:\t", s); 37 | for (int i = 0; i < 16; ++i) 38 | printf("%02x ", (unsigned char)v8[i]); 39 | printf("\n"); 40 | } 41 | #endif 42 | 43 | // all byte values must be no larger than 0xF4 44 | static inline void checkSmallerThan0xF4(int8x16_t current_bytes, 45 | int8x16_t *has_error) { 46 | // unsigned, saturates to 0 below max 47 | *has_error = vorrq_s8(*has_error, 48 | vreinterpretq_s8_u8(vqsubq_u8(vreinterpretq_u8_s8(current_bytes), vdupq_n_u8(0xF4)))); 49 | } 50 | 51 | static const int8_t _nibbles[] = { 52 | 1, 1, 1, 1, 1, 1, 1, 1, // 0xxx (ASCII) 53 | 0, 0, 0, 0, // 10xx (continuation) 54 | 2, 2, // 110x 55 | 3, // 1110 56 | 4, // 1111, next should be 0 (not checked here) 57 | }; 58 | 59 | static inline int8x16_t continuationLengths(int8x16_t high_nibbles) { 60 | return vqtbl1q_s8(vld1q_s8(_nibbles), vreinterpretq_u8_s8(high_nibbles)); 61 | } 62 | 63 | static inline int8x16_t carryContinuations(int8x16_t initial_lengths, 64 | int8x16_t previous_carries) { 65 | 66 | int8x16_t right1 = 67 | vreinterpretq_s8_u8(vqsubq_u8(vreinterpretq_u8_s8(vextq_s8(previous_carries, initial_lengths, 16 - 1)), 68 | vdupq_n_u8(1))); 69 | int8x16_t sum = vaddq_s8(initial_lengths, right1); 70 | 71 | int8x16_t right2 = vreinterpretq_s8_u8(vqsubq_u8(vreinterpretq_u8_s8(vextq_s8(previous_carries, sum, 16 - 2)), 72 | vdupq_n_u8(2))); 73 | return vaddq_s8(sum, right2); 74 | } 75 | 76 | static inline void checkContinuations(int8x16_t initial_lengths, int8x16_t carries, 77 | int8x16_t *has_error) { 78 | 79 | // overlap || underlap 80 | // carry > length && length > 0 || !(carry > length) && !(length > 0) 81 | // (carries > length) == (lengths > 0) 82 | uint8x16_t overunder = 83 | vceqq_u8(vcgtq_s8(carries, initial_lengths), 84 | vcgtq_s8(initial_lengths, vdupq_n_s8(0))); 85 | 86 | *has_error = vorrq_s8(*has_error, vreinterpretq_s8_u8(overunder)); 87 | } 88 | 89 | // when 0xED is found, next byte must be no larger than 0x9F 90 | // when 0xF4 is found, next byte must be no larger than 0x8F 91 | // next byte must be continuation, ie sign bit is set, so signed < is ok 92 | static inline void checkFirstContinuationMax(int8x16_t current_bytes, 93 | int8x16_t off1_current_bytes, 94 | int8x16_t *has_error) { 95 | uint8x16_t maskED = vceqq_s8(off1_current_bytes, vdupq_n_s8(0xED)); 96 | uint8x16_t maskF4 = vceqq_s8(off1_current_bytes, vdupq_n_s8(0xF4)); 97 | 98 | uint8x16_t badfollowED = 99 | vandq_u8(vcgtq_s8(current_bytes, vdupq_n_s8(0x9F)), maskED); 100 | uint8x16_t badfollowF4 = 101 | vandq_u8(vcgtq_s8(current_bytes, vdupq_n_s8(0x8F)), maskF4); 102 | 103 | *has_error = vorrq_s8(*has_error, vreinterpretq_s8_u8(vorrq_u8(badfollowED, badfollowF4))); 104 | } 105 | 106 | static const int8_t _initial_mins[] = { 107 | -128, -128, -128, -128, -128, -128, -128, -128, -128, -128, 108 | -128, -128, // 10xx => false 109 | 0xC2, -128, // 110x 110 | 0xE1, // 1110 111 | 0xF1, 112 | }; 113 | 114 | static const int8_t _second_mins[] = { 115 | -128, -128, -128, -128, -128, -128, -128, -128, -128, -128, 116 | -128, -128, // 10xx => false 117 | 127, 127, // 110x => true 118 | 0xA0, // 1110 119 | 0x90, 120 | }; 121 | 122 | // map off1_hibits => error condition 123 | // hibits off1 cur 124 | // C => < C2 && true 125 | // E => < E1 && < A0 126 | // F => < F1 && < 90 127 | // else false && false 128 | static inline void checkOverlong(int8x16_t current_bytes, 129 | int8x16_t off1_current_bytes, int8x16_t hibits, 130 | int8x16_t previous_hibits, int8x16_t *has_error) { 131 | int8x16_t off1_hibits = vextq_s8(previous_hibits, hibits, 16 - 1); 132 | int8x16_t initial_mins = vqtbl1q_s8(vld1q_s8(_initial_mins), vreinterpretq_u8_s8(off1_hibits)); 133 | 134 | uint8x16_t initial_under = vcgtq_s8(initial_mins, off1_current_bytes); 135 | 136 | int8x16_t second_mins = vqtbl1q_s8(vld1q_s8(_second_mins), vreinterpretq_u8_s8(off1_hibits)); 137 | uint8x16_t second_under = vcgtq_s8(second_mins, current_bytes); 138 | *has_error = 139 | vorrq_s8(*has_error, vreinterpretq_s8_u8(vandq_u8(initial_under, second_under))); 140 | } 141 | 142 | struct processed_utf_bytes { 143 | int8x16_t rawbytes; 144 | int8x16_t high_nibbles; 145 | int8x16_t carried_continuations; 146 | }; 147 | 148 | static inline void count_nibbles(int8x16_t bytes, 149 | struct processed_utf_bytes *answer) { 150 | answer->rawbytes = bytes; 151 | answer->high_nibbles = 152 | vreinterpretq_s8_u8(vshrq_n_u8(vreinterpretq_u8_s8(bytes), 4)); 153 | } 154 | 155 | // check whether the current bytes are valid UTF-8 156 | // at the end of the function, previous gets updated 157 | static inline struct processed_utf_bytes 158 | checkUTF8Bytes(int8x16_t current_bytes, struct processed_utf_bytes *previous, 159 | int8x16_t *has_error) { 160 | struct processed_utf_bytes pb; 161 | count_nibbles(current_bytes, &pb); 162 | 163 | checkSmallerThan0xF4(current_bytes, has_error); 164 | 165 | int8x16_t initial_lengths = continuationLengths(pb.high_nibbles); 166 | 167 | pb.carried_continuations = 168 | carryContinuations(initial_lengths, previous->carried_continuations); 169 | 170 | checkContinuations(initial_lengths, pb.carried_continuations, has_error); 171 | 172 | int8x16_t off1_current_bytes = 173 | vextq_s8(previous->rawbytes, pb.rawbytes, 16 - 1); 174 | checkFirstContinuationMax(current_bytes, off1_current_bytes, has_error); 175 | 176 | checkOverlong(current_bytes, off1_current_bytes, pb.high_nibbles, 177 | previous->high_nibbles, has_error); 178 | return pb; 179 | } 180 | 181 | static const int8_t _verror[] = {9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 1}; 182 | 183 | /* Return 0 on success, -1 on error */ 184 | int utf8_lemire(const unsigned char *src, int len) { 185 | size_t i = 0; 186 | int8x16_t has_error = vdupq_n_s8(0); 187 | struct processed_utf_bytes previous = {.rawbytes = vdupq_n_s8(0), 188 | .high_nibbles = vdupq_n_s8(0), 189 | .carried_continuations = 190 | vdupq_n_s8(0)}; 191 | if (len >= 16) { 192 | for (; i <= len - 16; i += 16) { 193 | int8x16_t current_bytes = vld1q_s8((int8_t*)(src + i)); 194 | previous = checkUTF8Bytes(current_bytes, &previous, &has_error); 195 | } 196 | } 197 | 198 | // last part 199 | if (i < len) { 200 | char buffer[16]; 201 | memset(buffer, 0, 16); 202 | memcpy(buffer, src + i, len - i); 203 | int8x16_t current_bytes = vld1q_s8((int8_t *)buffer); 204 | previous = checkUTF8Bytes(current_bytes, &previous, &has_error); 205 | } else { 206 | has_error = 207 | vorrq_s8(vreinterpretq_s8_u8(vcgtq_s8(previous.carried_continuations, 208 | vld1q_s8(_verror))), 209 | has_error); 210 | } 211 | 212 | return vmaxvq_u8(vreinterpretq_u8_s8(has_error)) == 0 ? 0 : -1; 213 | } 214 | 215 | #endif 216 | -------------------------------------------------------------------------------- /lemire-sse.c: -------------------------------------------------------------------------------- 1 | // Adapted from https://github.com/lemire/fastvalidate-utf-8 2 | 3 | #ifdef __x86_64__ 4 | 5 | #include 6 | #include 7 | #include 8 | #include 9 | #include 10 | 11 | /* 12 | * legal utf-8 byte sequence 13 | * http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf - page 94 14 | * 15 | * Code Points 1st 2s 3s 4s 16 | * U+0000..U+007F 00..7F 17 | * U+0080..U+07FF C2..DF 80..BF 18 | * U+0800..U+0FFF E0 A0..BF 80..BF 19 | * U+1000..U+CFFF E1..EC 80..BF 80..BF 20 | * U+D000..U+D7FF ED 80..9F 80..BF 21 | * U+E000..U+FFFF EE..EF 80..BF 80..BF 22 | * U+10000..U+3FFFF F0 90..BF 80..BF 80..BF 23 | * U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF 24 | * U+100000..U+10FFFF F4 80..8F 80..BF 80..BF 25 | * 26 | */ 27 | 28 | #if 0 29 | static void print128(const char *s, const __m128i *v128) 30 | { 31 | const unsigned char *v8 = (const unsigned char *)v128; 32 | if (s) 33 | printf("%s: ", s); 34 | for (int i = 0; i < 16; i++) 35 | printf("%02x ", v8[i]); 36 | printf("\n"); 37 | } 38 | #endif 39 | 40 | // all byte values must be no larger than 0xF4 41 | static inline void checkSmallerThan0xF4(__m128i current_bytes, 42 | __m128i *has_error) { 43 | // unsigned, saturates to 0 below max 44 | *has_error = _mm_or_si128(*has_error, 45 | _mm_subs_epu8(current_bytes, _mm_set1_epi8(0xF4))); 46 | } 47 | 48 | static inline __m128i continuationLengths(__m128i high_nibbles) { 49 | return _mm_shuffle_epi8( 50 | _mm_setr_epi8(1, 1, 1, 1, 1, 1, 1, 1, // 0xxx (ASCII) 51 | 0, 0, 0, 0, // 10xx (continuation) 52 | 2, 2, // 110x 53 | 3, // 1110 54 | 4), // 1111, next should be 0 (not checked here) 55 | high_nibbles); 56 | } 57 | 58 | static inline __m128i carryContinuations(__m128i initial_lengths, 59 | __m128i previous_carries) { 60 | 61 | __m128i right1 = 62 | _mm_subs_epu8(_mm_alignr_epi8(initial_lengths, previous_carries, 16 - 1), 63 | _mm_set1_epi8(1)); 64 | __m128i sum = _mm_add_epi8(initial_lengths, right1); 65 | 66 | __m128i right2 = _mm_subs_epu8(_mm_alignr_epi8(sum, previous_carries, 16 - 2), 67 | _mm_set1_epi8(2)); 68 | return _mm_add_epi8(sum, right2); 69 | } 70 | 71 | static inline void checkContinuations(__m128i initial_lengths, __m128i carries, 72 | __m128i *has_error) { 73 | 74 | // overlap || underlap 75 | // carry > length && length > 0 || !(carry > length) && !(length > 0) 76 | // (carries > length) == (lengths > 0) 77 | __m128i overunder = 78 | _mm_cmpeq_epi8(_mm_cmpgt_epi8(carries, initial_lengths), 79 | _mm_cmpgt_epi8(initial_lengths, _mm_setzero_si128())); 80 | 81 | *has_error = _mm_or_si128(*has_error, overunder); 82 | } 83 | 84 | // when 0xED is found, next byte must be no larger than 0x9F 85 | // when 0xF4 is found, next byte must be no larger than 0x8F 86 | // next byte must be continuation, ie sign bit is set, so signed < is ok 87 | static inline void checkFirstContinuationMax(__m128i current_bytes, 88 | __m128i off1_current_bytes, 89 | __m128i *has_error) { 90 | __m128i maskED = _mm_cmpeq_epi8(off1_current_bytes, _mm_set1_epi8(0xED)); 91 | __m128i maskF4 = _mm_cmpeq_epi8(off1_current_bytes, _mm_set1_epi8(0xF4)); 92 | 93 | __m128i badfollowED = 94 | _mm_and_si128(_mm_cmpgt_epi8(current_bytes, _mm_set1_epi8(0x9F)), maskED); 95 | __m128i badfollowF4 = 96 | _mm_and_si128(_mm_cmpgt_epi8(current_bytes, _mm_set1_epi8(0x8F)), maskF4); 97 | 98 | *has_error = _mm_or_si128(*has_error, _mm_or_si128(badfollowED, badfollowF4)); 99 | } 100 | 101 | // map off1_hibits => error condition 102 | // hibits off1 cur 103 | // C => < C2 && true 104 | // E => < E1 && < A0 105 | // F => < F1 && < 90 106 | // else false && false 107 | static inline void checkOverlong(__m128i current_bytes, 108 | __m128i off1_current_bytes, __m128i hibits, 109 | __m128i previous_hibits, __m128i *has_error) { 110 | __m128i off1_hibits = _mm_alignr_epi8(hibits, previous_hibits, 16 - 1); 111 | __m128i initial_mins = _mm_shuffle_epi8( 112 | _mm_setr_epi8(-128, -128, -128, -128, -128, -128, -128, -128, -128, -128, 113 | -128, -128, // 10xx => false 114 | 0xC2, -128, // 110x 115 | 0xE1, // 1110 116 | 0xF1), 117 | off1_hibits); 118 | 119 | __m128i initial_under = _mm_cmpgt_epi8(initial_mins, off1_current_bytes); 120 | 121 | __m128i second_mins = _mm_shuffle_epi8( 122 | _mm_setr_epi8(-128, -128, -128, -128, -128, -128, -128, -128, -128, -128, 123 | -128, -128, // 10xx => false 124 | 127, 127, // 110x => true 125 | 0xA0, // 1110 126 | 0x90), 127 | off1_hibits); 128 | __m128i second_under = _mm_cmpgt_epi8(second_mins, current_bytes); 129 | *has_error = 130 | _mm_or_si128(*has_error, _mm_and_si128(initial_under, second_under)); 131 | } 132 | 133 | struct processed_utf_bytes { 134 | __m128i rawbytes; 135 | __m128i high_nibbles; 136 | __m128i carried_continuations; 137 | }; 138 | 139 | static inline void count_nibbles(__m128i bytes, 140 | struct processed_utf_bytes *answer) { 141 | answer->rawbytes = bytes; 142 | answer->high_nibbles = 143 | _mm_and_si128(_mm_srli_epi16(bytes, 4), _mm_set1_epi8(0x0F)); 144 | } 145 | 146 | // check whether the current bytes are valid UTF-8 147 | // at the end of the function, previous gets updated 148 | static inline struct processed_utf_bytes 149 | checkUTF8Bytes(__m128i current_bytes, struct processed_utf_bytes *previous, 150 | __m128i *has_error) { 151 | 152 | struct processed_utf_bytes pb; 153 | count_nibbles(current_bytes, &pb); 154 | 155 | checkSmallerThan0xF4(current_bytes, has_error); 156 | 157 | __m128i initial_lengths = continuationLengths(pb.high_nibbles); 158 | 159 | pb.carried_continuations = 160 | carryContinuations(initial_lengths, previous->carried_continuations); 161 | 162 | checkContinuations(initial_lengths, pb.carried_continuations, has_error); 163 | 164 | __m128i off1_current_bytes = 165 | _mm_alignr_epi8(pb.rawbytes, previous->rawbytes, 16 - 1); 166 | checkFirstContinuationMax(current_bytes, off1_current_bytes, has_error); 167 | 168 | checkOverlong(current_bytes, off1_current_bytes, pb.high_nibbles, 169 | previous->high_nibbles, has_error); 170 | return pb; 171 | } 172 | 173 | /* Return 0 on success, -1 on error */ 174 | int utf8_lemire(const unsigned char *src, int len) { 175 | size_t i = 0; 176 | __m128i has_error = _mm_setzero_si128(); 177 | struct processed_utf_bytes previous = {.rawbytes = _mm_setzero_si128(), 178 | .high_nibbles = _mm_setzero_si128(), 179 | .carried_continuations = 180 | _mm_setzero_si128()}; 181 | if (len >= 16) { 182 | for (; i <= len - 16; i += 16) { 183 | __m128i current_bytes = _mm_loadu_si128((const __m128i *)(src + i)); 184 | previous = checkUTF8Bytes(current_bytes, &previous, &has_error); 185 | } 186 | } 187 | 188 | // last part 189 | if (i < len) { 190 | char buffer[16]; 191 | memset(buffer, 0, 16); 192 | memcpy(buffer, src + i, len - i); 193 | __m128i current_bytes = _mm_loadu_si128((const __m128i *)(buffer)); 194 | previous = checkUTF8Bytes(current_bytes, &previous, &has_error); 195 | } else { 196 | has_error = 197 | _mm_or_si128(_mm_cmpgt_epi8(previous.carried_continuations, 198 | _mm_setr_epi8(9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 199 | 9, 9, 9, 9, 9, 1)), 200 | has_error); 201 | } 202 | 203 | return _mm_testz_si128(has_error, has_error) ? 0 : -1; 204 | } 205 | 206 | #endif 207 | -------------------------------------------------------------------------------- /lookup.c: -------------------------------------------------------------------------------- 1 | #include 2 | 3 | /* http://bjoern.hoehrmann.de/utf-8/decoder/dfa */ 4 | /* Optimized version based on Rich Felker's variant. */ 5 | #define UTF8_ACCEPT 0 6 | #define UTF8_REJECT 12 7 | 8 | static const unsigned char utf8d[] = { 9 | /* The first part of the table maps bytes to character classes that 10 | * to reduce the size of the transition table and create bitmasks. */ 11 | 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 12 | 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 13 | 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 14 | 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 15 | 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9, 16 | 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 17 | 8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 18 | 10,3,3,3,3,3,3,3,3,3,3,3,3,4,3,3, 11,6,6,6,5,8,8,8,8,8,8,8,8,8,8,8 19 | }; 20 | /* Note: Splitting the table improves performance on ARM due to its simpler 21 | * addressing modes not being able to encode x[y + 256]. */ 22 | static const unsigned char utf8s[] = { 23 | /* The second part is a transition table that maps a combination 24 | * of a state of the automaton and a character class to a state. */ 25 | 0,12,24,36,60,96,84,12,12,12,48,72, 12,12,12,12,12,12,12,12,12,12,12,12, 26 | 12, 0,12,12,12,12,12, 0,12, 0,12,12, 12,24,12,12,12,12,12,24,12,24,12,12, 27 | 12,12,12,12,12,12,12,24,12,12,12,12, 12,24,12,12,12,12,12,12,12,24,12,12, 28 | 12,12,12,12,12,12,12,36,12,36,12,12, 12,36,12,12,12,12,12,36,12,36,12,12, 29 | 12,36,12,12,12,12,12,12,12,12,12,12 30 | }; 31 | 32 | /* Return 0 on success, -1 on error */ 33 | int utf8_lookup(const unsigned char *data, int len) 34 | { 35 | int state = 0; 36 | 37 | while (len-- && state != UTF8_REJECT) 38 | state = utf8s[state + utf8d[*data++]]; 39 | 40 | return state == UTF8_ACCEPT ? 0 : -1; 41 | } 42 | -------------------------------------------------------------------------------- /main.c: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include 5 | #include 6 | #include 7 | #include 8 | #include 9 | #include 10 | 11 | int utf8_naive(const unsigned char *data, int len); 12 | int utf8_lookup(const unsigned char *data, int len); 13 | int utf8_boost(const unsigned char *data, int len); 14 | int utf8_lemire(const unsigned char *data, int len); 15 | int utf8_range(const unsigned char *data, int len); 16 | int utf8_range2(const unsigned char *data, int len); 17 | #ifdef __AVX2__ 18 | int utf8_lemire_avx2(const unsigned char *data, int len); 19 | int utf8_range_avx2(const unsigned char *data, int len); 20 | #endif 21 | 22 | static struct ftab { 23 | const char *name; 24 | int (*func)(const unsigned char *data, int len); 25 | } ftab[] = { 26 | { 27 | .name = "naive", 28 | .func = utf8_naive, 29 | }, 30 | { 31 | .name = "lookup", 32 | .func = utf8_lookup, 33 | }, 34 | { 35 | .name = "lemire", 36 | .func = utf8_lemire, 37 | }, 38 | { 39 | .name = "range", 40 | .func = utf8_range, 41 | }, 42 | { 43 | .name = "range2", 44 | .func = utf8_range2, 45 | }, 46 | #ifdef __AVX2__ 47 | { 48 | .name = "lemire_avx2", 49 | .func = utf8_lemire_avx2, 50 | }, 51 | { 52 | .name = "range_avx2", 53 | .func = utf8_range_avx2, 54 | }, 55 | #endif 56 | #ifdef BOOST 57 | { 58 | .name = "boost", 59 | .func = utf8_boost, 60 | }, 61 | #endif 62 | }; 63 | 64 | static unsigned char *load_test_buf(int len) 65 | { 66 | const char utf8[] = "\xF0\x90\xBF\x80"; 67 | const int utf8_len = sizeof(utf8)/sizeof(utf8[0]) - 1; 68 | 69 | unsigned char *data = malloc(len); 70 | unsigned char *p = data; 71 | 72 | while (len >= utf8_len) { 73 | memcpy(p, utf8, utf8_len); 74 | p += utf8_len; 75 | len -= utf8_len; 76 | } 77 | 78 | while (len--) 79 | *p++ = 0x7F; 80 | 81 | return data; 82 | } 83 | 84 | static unsigned char *load_test_file(int *len) 85 | { 86 | unsigned char *data; 87 | int fd; 88 | struct stat stat; 89 | 90 | fd = open("./UTF-8-demo.txt", O_RDONLY); 91 | if (fd == -1) { 92 | printf("Failed to open UTF-8-demo.txt!\n"); 93 | exit(1); 94 | } 95 | if (fstat(fd, &stat) == -1) { 96 | printf("Failed to get file size!\n"); 97 | exit(1); 98 | } 99 | 100 | *len = stat.st_size; 101 | data = malloc(*len); 102 | if (read(fd, data, *len) != *len) { 103 | printf("Failed to read file!\n"); 104 | exit(1); 105 | } 106 | 107 | utf8_range(data, *len); 108 | #ifdef __AVX2__ 109 | utf8_range_avx2(data, *len); 110 | #endif 111 | close(fd); 112 | 113 | return data; 114 | } 115 | 116 | static void print_test(const unsigned char *data, int len) 117 | { 118 | while (len--) 119 | printf("\\x%02X", *data++); 120 | 121 | printf("\n"); 122 | } 123 | 124 | struct test { 125 | const unsigned char *data; 126 | int len; 127 | }; 128 | 129 | static void prepare_test_buf(unsigned char *buf, const struct test *pos, 130 | int pos_len, int pos_idx) 131 | { 132 | /* Round concatenate correct tokens to 1024 bytes */ 133 | int buf_idx = 0; 134 | while (buf_idx < 1024) { 135 | int buf_len = 1024 - buf_idx; 136 | 137 | if (buf_len >= pos[pos_idx].len) { 138 | memcpy(buf+buf_idx, pos[pos_idx].data, pos[pos_idx].len); 139 | buf_idx += pos[pos_idx].len; 140 | } else { 141 | memset(buf+buf_idx, 0, buf_len); 142 | buf_idx += buf_len; 143 | } 144 | 145 | if (++pos_idx == pos_len) 146 | pos_idx = 0; 147 | } 148 | } 149 | 150 | /* Return 0 on success, -1 on error */ 151 | static int test_manual(const struct ftab *ftab) 152 | { 153 | #pragma GCC diagnostic push 154 | #pragma GCC diagnostic ignored "-Wpointer-sign" 155 | /* positive tests */ 156 | static const struct test pos[] = { 157 | {"", 0}, 158 | {"\x00", 1}, 159 | {"\x66", 1}, 160 | {"\x7F", 1}, 161 | {"\x00\x7F", 2}, 162 | {"\x7F\x00", 2}, 163 | {"\xC2\x80", 2}, 164 | {"\xDF\xBF", 2}, 165 | {"\xE0\xA0\x80", 3}, 166 | {"\xE0\xA0\xBF", 3}, 167 | {"\xED\x9F\x80", 3}, 168 | {"\xEF\x80\xBF", 3}, 169 | {"\xF0\x90\xBF\x80", 4}, 170 | {"\xF2\x81\xBE\x99", 4}, 171 | {"\xF4\x8F\x88\xAA", 4}, 172 | }; 173 | 174 | /* negative tests */ 175 | static const struct test neg[] = { 176 | {"\x80", 1}, 177 | {"\xBF", 1}, 178 | {"\xC0\x80", 2}, 179 | {"\xC1\x00", 2}, 180 | {"\xC2\x7F", 2}, 181 | {"\xDF\xC0", 2}, 182 | {"\xE0\x9F\x80", 3}, 183 | {"\xE0\xC2\x80", 3}, 184 | {"\xED\xA0\x80", 3}, 185 | {"\xED\x7F\x80", 3}, 186 | {"\xEF\x80\x00", 3}, 187 | {"\xF0\x8F\x80\x80", 4}, 188 | {"\xF0\xEE\x80\x80", 4}, 189 | {"\xF2\x90\x91\x7F", 4}, 190 | {"\xF4\x90\x88\xAA", 4}, 191 | {"\xF4\x00\xBF\xBF", 4}, 192 | {"\x00\x00\x00\x00\x00\xC2\x80\x00\x00\x00\xE1\x80\x80\x00\x00\xC2" \ 193 | "\xC2\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", 194 | 32}, 195 | {"\x00\x00\x00\x00\x00\xC2\xC2\x80\x00\x00\xE1\x80\x80\x00\x00\x00", 196 | 16}, 197 | {"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" \ 198 | "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xF1\x80", 199 | 32}, 200 | {"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" \ 201 | "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xF1", 202 | 32}, 203 | {"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" \ 204 | "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xF1\x80" \ 205 | "\x80", 33}, 206 | {"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" \ 207 | "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xF1\x80" \ 208 | "\xC2\x80", 34}, 209 | {"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" \ 210 | "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xF0" \ 211 | "\x80\x80\x80", 35}, 212 | }; 213 | #pragma GCC diagnostic push 214 | 215 | /* Test single token */ 216 | for (int i = 0; i < sizeof(pos)/sizeof(pos[0]); ++i) { 217 | if (ftab->func(pos[i].data, pos[i].len) != 0) { 218 | printf("FAILED positive test: "); 219 | print_test(pos[i].data, pos[i].len); 220 | return -1; 221 | } 222 | } 223 | for (int i = 0; i < sizeof(neg)/sizeof(neg[0]); ++i) { 224 | if (ftab->func(neg[i].data, neg[i].len) == 0) { 225 | printf("FAILED negitive test: "); 226 | print_test(neg[i].data, neg[i].len); 227 | return -1; 228 | } 229 | } 230 | 231 | /* Test shifted buffer to cover 1k length */ 232 | /* buffer size must be greater than 1024 + 16 + max(test string length) */ 233 | const int max_size = 1024*2; 234 | uint64_t buf64[max_size/8 + 2]; 235 | /* Offset 8 bytes by 1 byte */ 236 | unsigned char *buf = ((unsigned char *)buf64) + 1; 237 | int buf_len; 238 | 239 | for (int i = 0; i < sizeof(pos)/sizeof(pos[0]); ++i) { 240 | /* Positive test: shift 16 bytes, validate each shift */ 241 | prepare_test_buf(buf, pos, sizeof(pos)/sizeof(pos[0]), i); 242 | buf_len = 1024; 243 | for (int j = 0; j < 16; ++j) { 244 | if (ftab->func(buf, buf_len) != 0) { 245 | printf("FAILED positive test: "); 246 | print_test(buf, buf_len); 247 | return -1; 248 | } 249 | for (int k = buf_len; k >= 1; --k) 250 | buf[k] = buf[k-1]; 251 | buf[0] = '\x55'; 252 | ++buf_len; 253 | } 254 | 255 | /* Negative test: trunk last non ascii */ 256 | while (buf_len >= 1 && buf[buf_len-1] <= 0x7F) 257 | --buf_len; 258 | if (buf_len && ftab->func(buf, buf_len-1) == 0) { 259 | printf("FAILED negitive test: "); 260 | print_test(buf, buf_len); 261 | return -1; 262 | } 263 | } 264 | 265 | /* Negative test */ 266 | for (int i = 0; i < sizeof(neg)/sizeof(neg[0]); ++i) { 267 | /* Append one error token, shift 16 bytes, validate each shift */ 268 | int pos_idx = i % (sizeof(pos)/sizeof(pos[0])); 269 | prepare_test_buf(buf, pos, sizeof(pos)/sizeof(pos[0]), pos_idx); 270 | memcpy(buf+1024, neg[i].data, neg[i].len); 271 | buf_len = 1024 + neg[i].len; 272 | for (int j = 0; j < 16; ++j) { 273 | if (ftab->func(buf, buf_len) == 0) { 274 | printf("FAILED negative test: "); 275 | print_test(buf, buf_len); 276 | return -1; 277 | } 278 | for (int k = buf_len; k >= 1; --k) 279 | buf[k] = buf[k-1]; 280 | buf[0] = '\x66'; 281 | ++buf_len; 282 | } 283 | } 284 | 285 | return 0; 286 | } 287 | 288 | static int test(const unsigned char *data, int len, const struct ftab *ftab) 289 | { 290 | int ret_standard = ftab->func(data, len); 291 | int ret_manual = test_manual(ftab); 292 | printf("%s\n", ftab->name); 293 | printf("standard test: %s\n", ret_standard ? "FAIL" : "pass"); 294 | printf("manual test: %s\n", ret_manual ? "FAIL" : "pass"); 295 | 296 | return ret_standard | ret_manual; 297 | } 298 | 299 | static int bench(const unsigned char *data, int len, const struct ftab *ftab) 300 | { 301 | const int loops = 1024*1024*1024/len; 302 | int ret = 0; 303 | double time, size; 304 | struct timeval tv1, tv2; 305 | 306 | fprintf(stderr, "bench %s... ", ftab->name); 307 | gettimeofday(&tv1, 0); 308 | for (int i = 0; i < loops; ++i) 309 | ret |= ftab->func(data, len); 310 | gettimeofday(&tv2, 0); 311 | printf("%s\n", ret?"FAIL":"pass"); 312 | 313 | time = tv2.tv_usec - tv1.tv_usec; 314 | time = time / 1000000 + tv2.tv_sec - tv1.tv_sec; 315 | size = ((double)len * loops) / (1024*1024); 316 | printf("time: %.4f s\n", time); 317 | printf("data: %.0f MB\n", size); 318 | printf("BW: %.2f MB/s\n", size / time); 319 | 320 | return 0; 321 | } 322 | 323 | static void usage(const char *bin) 324 | { 325 | printf("Usage:\n"); 326 | printf("%s test [alg] ==> test all or one algorithm\n", bin); 327 | printf("%s bench [alg] ==> benchmark all or one algorithm\n", bin); 328 | printf("%s bench size NUM ==> benchmark with specific buffer size\n", bin); 329 | printf("alg = "); 330 | for (int i = 0; i < sizeof(ftab)/sizeof(ftab[0]); ++i) 331 | printf("%s ", ftab[i].name); 332 | printf("\nNUM = buffer size in bytes, 1 ~ 67108864(64M)\n"); 333 | } 334 | 335 | int main(int argc, char *argv[]) 336 | { 337 | int len = 0; 338 | unsigned char *data; 339 | const char *alg = NULL; 340 | int (*tb)(const unsigned char *data, int len, const struct ftab *ftab); 341 | 342 | tb = NULL; 343 | if (argc >= 2) { 344 | if (strcmp(argv[1], "test") == 0) 345 | tb = test; 346 | else if (strcmp(argv[1], "bench") == 0) 347 | tb = bench; 348 | if (argc >= 3) { 349 | alg = argv[2]; 350 | if (strcmp(alg, "size") == 0) { 351 | if (argc < 4) { 352 | tb = NULL; 353 | } else { 354 | alg = NULL; 355 | len = atoi(argv[3]); 356 | if (len <= 0 || len > 67108864) { 357 | printf("Buffer size error!\n\n"); 358 | tb = NULL; 359 | } 360 | } 361 | } 362 | } 363 | } 364 | 365 | if (tb == NULL) { 366 | usage(argv[0]); 367 | return 1; 368 | } 369 | 370 | /* Load UTF8 test buffer */ 371 | if (len) 372 | data = load_test_buf(len); 373 | else 374 | data = load_test_file(&len); 375 | 376 | int ret = 0; 377 | if (tb == bench) 378 | printf("=============== Bench UTF8 (%d bytes) ===============\n", len); 379 | for (int i = 0; i < sizeof(ftab)/sizeof(ftab[0]); ++i) { 380 | if (alg && strcmp(alg, ftab[i].name) != 0) 381 | continue; 382 | ret |= tb((const unsigned char *)data, len, &ftab[i]); 383 | printf("\n"); 384 | } 385 | 386 | #if 0 387 | if (tb == bench) { 388 | printf("==================== Bench ASCII ====================\n"); 389 | /* Change test buffer to ascii */ 390 | for (int i = 0; i < len; i++) 391 | data[i] &= 0x7F; 392 | 393 | for (int i = 0; i < sizeof(ftab)/sizeof(ftab[0]); ++i) { 394 | if (alg && strcmp(alg, ftab[i].name) != 0) 395 | continue; 396 | tb((const unsigned char *)data, len, &ftab[i]); 397 | printf("\n"); 398 | } 399 | } 400 | #endif 401 | 402 | free(data); 403 | 404 | return ret; 405 | } 406 | -------------------------------------------------------------------------------- /naive.c: -------------------------------------------------------------------------------- 1 | #include 2 | 3 | /* 4 | * http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf - page 94 5 | * 6 | * Table 3-7. Well-Formed UTF-8 Byte Sequences 7 | * 8 | * +--------------------+------------+-------------+------------+-------------+ 9 | * | Code Points | First Byte | Second Byte | Third Byte | Fourth Byte | 10 | * +--------------------+------------+-------------+------------+-------------+ 11 | * | U+0000..U+007F | 00..7F | | | | 12 | * +--------------------+------------+-------------+------------+-------------+ 13 | * | U+0080..U+07FF | C2..DF | 80..BF | | | 14 | * +--------------------+------------+-------------+------------+-------------+ 15 | * | U+0800..U+0FFF | E0 | A0..BF | 80..BF | | 16 | * +--------------------+------------+-------------+------------+-------------+ 17 | * | U+1000..U+CFFF | E1..EC | 80..BF | 80..BF | | 18 | * +--------------------+------------+-------------+------------+-------------+ 19 | * | U+D000..U+D7FF | ED | 80..9F | 80..BF | | 20 | * +--------------------+------------+-------------+------------+-------------+ 21 | * | U+E000..U+FFFF | EE..EF | 80..BF | 80..BF | | 22 | * +--------------------+------------+-------------+------------+-------------+ 23 | * | U+10000..U+3FFFF | F0 | 90..BF | 80..BF | 80..BF | 24 | * +--------------------+------------+-------------+------------+-------------+ 25 | * | U+40000..U+FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF | 26 | * +--------------------+------------+-------------+------------+-------------+ 27 | * | U+100000..U+10FFFF | F4 | 80..8F | 80..BF | 80..BF | 28 | * +--------------------+------------+-------------+------------+-------------+ 29 | */ 30 | 31 | /* Return 0 - success, >0 - index(1 based) of first error char */ 32 | int utf8_naive(const unsigned char *data, int len) 33 | { 34 | int err_pos = 1; 35 | 36 | while (len) { 37 | int bytes; 38 | const unsigned char byte1 = data[0]; 39 | 40 | /* 00..7F */ 41 | if (byte1 <= 0x7F) { 42 | bytes = 1; 43 | /* C2..DF, 80..BF */ 44 | } else if (len >= 2 && byte1 >= 0xC2 && byte1 <= 0xDF && 45 | (signed char)data[1] <= (signed char)0xBF) { 46 | bytes = 2; 47 | } else if (len >= 3) { 48 | const unsigned char byte2 = data[1]; 49 | 50 | /* Is byte2, byte3 between 0x80 ~ 0xBF */ 51 | const int byte2_ok = (signed char)byte2 <= (signed char)0xBF; 52 | const int byte3_ok = (signed char)data[2] <= (signed char)0xBF; 53 | 54 | if (byte2_ok && byte3_ok && 55 | /* E0, A0..BF, 80..BF */ 56 | ((byte1 == 0xE0 && byte2 >= 0xA0) || 57 | /* E1..EC, 80..BF, 80..BF */ 58 | (byte1 >= 0xE1 && byte1 <= 0xEC) || 59 | /* ED, 80..9F, 80..BF */ 60 | (byte1 == 0xED && byte2 <= 0x9F) || 61 | /* EE..EF, 80..BF, 80..BF */ 62 | (byte1 >= 0xEE && byte1 <= 0xEF))) { 63 | bytes = 3; 64 | } else if (len >= 4) { 65 | /* Is byte4 between 0x80 ~ 0xBF */ 66 | const int byte4_ok = (signed char)data[3] <= (signed char)0xBF; 67 | 68 | if (byte2_ok && byte3_ok && byte4_ok && 69 | /* F0, 90..BF, 80..BF, 80..BF */ 70 | ((byte1 == 0xF0 && byte2 >= 0x90) || 71 | /* F1..F3, 80..BF, 80..BF, 80..BF */ 72 | (byte1 >= 0xF1 && byte1 <= 0xF3) || 73 | /* F4, 80..8F, 80..BF, 80..BF */ 74 | (byte1 == 0xF4 && byte2 <= 0x8F))) { 75 | bytes = 4; 76 | } else { 77 | return err_pos; 78 | } 79 | } else { 80 | return err_pos; 81 | } 82 | } else { 83 | return err_pos; 84 | } 85 | 86 | len -= bytes; 87 | err_pos += bytes; 88 | data += bytes; 89 | } 90 | 91 | return 0; 92 | } 93 | -------------------------------------------------------------------------------- /range-avx2.c: -------------------------------------------------------------------------------- 1 | #ifdef __AVX2__ 2 | 3 | #include 4 | #include 5 | #include 6 | 7 | int utf8_naive(const unsigned char *data, int len); 8 | 9 | #if 0 10 | static void print256(const char *s, const __m256i v256) 11 | { 12 | const unsigned char *v8 = (const unsigned char *)&v256; 13 | if (s) 14 | printf("%s:\t", s); 15 | for (int i = 0; i < 32; i++) 16 | printf("%02x ", v8[i]); 17 | printf("\n"); 18 | } 19 | #endif 20 | 21 | /* 22 | * Map high nibble of "First Byte" to legal character length minus 1 23 | * 0x00 ~ 0xBF --> 0 24 | * 0xC0 ~ 0xDF --> 1 25 | * 0xE0 ~ 0xEF --> 2 26 | * 0xF0 ~ 0xFF --> 3 27 | */ 28 | static const int8_t _first_len_tbl[] = { 29 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 3, 30 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 3, 31 | }; 32 | 33 | /* Map "First Byte" to 8-th item of range table (0xC2 ~ 0xF4) */ 34 | static const int8_t _first_range_tbl[] = { 35 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 36 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 37 | }; 38 | 39 | /* 40 | * Range table, map range index to min and max values 41 | * Index 0 : 00 ~ 7F (First Byte, ascii) 42 | * Index 1,2,3: 80 ~ BF (Second, Third, Fourth Byte) 43 | * Index 4 : A0 ~ BF (Second Byte after E0) 44 | * Index 5 : 80 ~ 9F (Second Byte after ED) 45 | * Index 6 : 90 ~ BF (Second Byte after F0) 46 | * Index 7 : 80 ~ 8F (Second Byte after F4) 47 | * Index 8 : C2 ~ F4 (First Byte, non ascii) 48 | * Index 9~15 : illegal: i >= 127 && i <= -128 49 | */ 50 | static const int8_t _range_min_tbl[] = { 51 | 0x00, 0x80, 0x80, 0x80, 0xA0, 0x80, 0x90, 0x80, 52 | 0xC2, 0x7F, 0x7F, 0x7F, 0x7F, 0x7F, 0x7F, 0x7F, 53 | 0x00, 0x80, 0x80, 0x80, 0xA0, 0x80, 0x90, 0x80, 54 | 0xC2, 0x7F, 0x7F, 0x7F, 0x7F, 0x7F, 0x7F, 0x7F, 55 | }; 56 | static const int8_t _range_max_tbl[] = { 57 | 0x7F, 0xBF, 0xBF, 0xBF, 0xBF, 0x9F, 0xBF, 0x8F, 58 | 0xF4, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 59 | 0x7F, 0xBF, 0xBF, 0xBF, 0xBF, 0x9F, 0xBF, 0x8F, 60 | 0xF4, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 61 | }; 62 | 63 | /* 64 | * Tables for fast handling of four special First Bytes(E0,ED,F0,F4), after 65 | * which the Second Byte are not 80~BF. It contains "range index adjustment". 66 | * +------------+---------------+------------------+----------------+ 67 | * | First Byte | original range| range adjustment | adjusted range | 68 | * +------------+---------------+------------------+----------------+ 69 | * | E0 | 2 | 2 | 4 | 70 | * +------------+---------------+------------------+----------------+ 71 | * | ED | 2 | 3 | 5 | 72 | * +------------+---------------+------------------+----------------+ 73 | * | F0 | 3 | 3 | 6 | 74 | * +------------+---------------+------------------+----------------+ 75 | * | F4 | 4 | 4 | 8 | 76 | * +------------+---------------+------------------+----------------+ 77 | */ 78 | /* index1 -> E0, index14 -> ED */ 79 | static const int8_t _df_ee_tbl[] = { 80 | 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 81 | 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 82 | }; 83 | /* index1 -> F0, index5 -> F4 */ 84 | static const int8_t _ef_fe_tbl[] = { 85 | 0, 3, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 86 | 0, 3, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 87 | }; 88 | 89 | #define RET_ERR_IDX 0 /* Define 1 to return index of first error char */ 90 | 91 | static inline __m256i push_last_byte_of_a_to_b(__m256i a, __m256i b) { 92 | return _mm256_alignr_epi8(b, _mm256_permute2x128_si256(a, b, 0x21), 15); 93 | } 94 | 95 | static inline __m256i push_last_2bytes_of_a_to_b(__m256i a, __m256i b) { 96 | return _mm256_alignr_epi8(b, _mm256_permute2x128_si256(a, b, 0x21), 14); 97 | } 98 | 99 | static inline __m256i push_last_3bytes_of_a_to_b(__m256i a, __m256i b) { 100 | return _mm256_alignr_epi8(b, _mm256_permute2x128_si256(a, b, 0x21), 13); 101 | } 102 | 103 | /* 5x faster than naive method */ 104 | /* Return 0 - success, -1 - error, >0 - first error char(if RET_ERR_IDX = 1) */ 105 | int utf8_range_avx2(const unsigned char *data, int len) 106 | { 107 | #if RET_ERR_IDX 108 | int err_pos = 1; 109 | #endif 110 | 111 | if (len >= 32) { 112 | __m256i prev_input = _mm256_set1_epi8(0); 113 | __m256i prev_first_len = _mm256_set1_epi8(0); 114 | 115 | /* Cached tables */ 116 | const __m256i first_len_tbl = 117 | _mm256_loadu_si256((const __m256i *)_first_len_tbl); 118 | const __m256i first_range_tbl = 119 | _mm256_loadu_si256((const __m256i *)_first_range_tbl); 120 | const __m256i range_min_tbl = 121 | _mm256_loadu_si256((const __m256i *)_range_min_tbl); 122 | const __m256i range_max_tbl = 123 | _mm256_loadu_si256((const __m256i *)_range_max_tbl); 124 | const __m256i df_ee_tbl = 125 | _mm256_loadu_si256((const __m256i *)_df_ee_tbl); 126 | const __m256i ef_fe_tbl = 127 | _mm256_loadu_si256((const __m256i *)_ef_fe_tbl); 128 | 129 | #if !RET_ERR_IDX 130 | __m256i error1 = _mm256_set1_epi8(0); 131 | __m256i error2 = _mm256_set1_epi8(0); 132 | #endif 133 | 134 | while (len >= 32) { 135 | const __m256i input = _mm256_loadu_si256((const __m256i *)data); 136 | 137 | /* high_nibbles = input >> 4 */ 138 | const __m256i high_nibbles = 139 | _mm256_and_si256(_mm256_srli_epi16(input, 4), _mm256_set1_epi8(0x0F)); 140 | 141 | /* first_len = legal character length minus 1 */ 142 | /* 0 for 00~7F, 1 for C0~DF, 2 for E0~EF, 3 for F0~FF */ 143 | /* first_len = first_len_tbl[high_nibbles] */ 144 | __m256i first_len = _mm256_shuffle_epi8(first_len_tbl, high_nibbles); 145 | 146 | /* First Byte: set range index to 8 for bytes within 0xC0 ~ 0xFF */ 147 | /* range = first_range_tbl[high_nibbles] */ 148 | __m256i range = _mm256_shuffle_epi8(first_range_tbl, high_nibbles); 149 | 150 | /* Second Byte: set range index to first_len */ 151 | /* 0 for 00~7F, 1 for C0~DF, 2 for E0~EF, 3 for F0~FF */ 152 | /* range |= (first_len, prev_first_len) << 1 byte */ 153 | range = _mm256_or_si256( 154 | range, push_last_byte_of_a_to_b(prev_first_len, first_len)); 155 | 156 | /* Third Byte: set range index to saturate_sub(first_len, 1) */ 157 | /* 0 for 00~7F, 0 for C0~DF, 1 for E0~EF, 2 for F0~FF */ 158 | __m256i tmp1, tmp2; 159 | 160 | /* tmp1 = (first_len, prev_first_len) << 2 bytes */ 161 | tmp1 = push_last_2bytes_of_a_to_b(prev_first_len, first_len); 162 | /* tmp2 = saturate_sub(tmp1, 1) */ 163 | tmp2 = _mm256_subs_epu8(tmp1, _mm256_set1_epi8(1)); 164 | 165 | /* range |= tmp2 */ 166 | range = _mm256_or_si256(range, tmp2); 167 | 168 | /* Fourth Byte: set range index to saturate_sub(first_len, 2) */ 169 | /* 0 for 00~7F, 0 for C0~DF, 0 for E0~EF, 1 for F0~FF */ 170 | /* tmp1 = (first_len, prev_first_len) << 3 bytes */ 171 | tmp1 = push_last_3bytes_of_a_to_b(prev_first_len, first_len); 172 | /* tmp2 = saturate_sub(tmp1, 2) */ 173 | tmp2 = _mm256_subs_epu8(tmp1, _mm256_set1_epi8(2)); 174 | /* range |= tmp2 */ 175 | range = _mm256_or_si256(range, tmp2); 176 | 177 | /* 178 | * Now we have below range indices caluclated 179 | * Correct cases: 180 | * - 8 for C0~FF 181 | * - 3 for 1st byte after F0~FF 182 | * - 2 for 1st byte after E0~EF or 2nd byte after F0~FF 183 | * - 1 for 1st byte after C0~DF or 2nd byte after E0~EF or 184 | * 3rd byte after F0~FF 185 | * - 0 for others 186 | * Error cases: 187 | * 9,10,11 if non ascii First Byte overlaps 188 | * E.g., F1 80 C2 90 --> 8 3 10 2, where 10 indicates error 189 | */ 190 | 191 | /* Adjust Second Byte range for special First Bytes(E0,ED,F0,F4) */ 192 | /* Overlaps lead to index 9~15, which are illegal in range table */ 193 | __m256i shift1, pos, range2; 194 | /* shift1 = (input, prev_input) << 1 byte */ 195 | shift1 = push_last_byte_of_a_to_b(prev_input, input); 196 | pos = _mm256_sub_epi8(shift1, _mm256_set1_epi8(0xEF)); 197 | /* 198 | * shift1: | EF F0 ... FE | FF 00 ... ... DE | DF E0 ... EE | 199 | * pos: | 0 1 15 | 16 17 239| 240 241 255| 200 | * pos-240: | 0 0 0 | 0 0 0 | 0 1 15 | 201 | * pos+112: | 112 113 127| >= 128 | >= 128 | 202 | */ 203 | tmp1 = _mm256_subs_epu8(pos, _mm256_set1_epi8(240)); 204 | range2 = _mm256_shuffle_epi8(df_ee_tbl, tmp1); 205 | tmp2 = _mm256_adds_epu8(pos, _mm256_set1_epi8(112)); 206 | range2 = _mm256_add_epi8(range2, _mm256_shuffle_epi8(ef_fe_tbl, tmp2)); 207 | 208 | range = _mm256_add_epi8(range, range2); 209 | 210 | /* Load min and max values per calculated range index */ 211 | __m256i minv = _mm256_shuffle_epi8(range_min_tbl, range); 212 | __m256i maxv = _mm256_shuffle_epi8(range_max_tbl, range); 213 | 214 | /* Check value range */ 215 | #if RET_ERR_IDX 216 | __m256i error = _mm256_cmpgt_epi8(minv, input); 217 | error = _mm256_or_si256(error, _mm256_cmpgt_epi8(input, maxv)); 218 | /* 5% performance drop from this conditional branch */ 219 | if (!_mm256_testz_si256(error, error)) 220 | break; 221 | #else 222 | error1 = _mm256_or_si256(error1, _mm256_cmpgt_epi8(minv, input)); 223 | error2 = _mm256_or_si256(error2, _mm256_cmpgt_epi8(input, maxv)); 224 | #endif 225 | 226 | prev_input = input; 227 | prev_first_len = first_len; 228 | 229 | data += 32; 230 | len -= 32; 231 | #if RET_ERR_IDX 232 | err_pos += 32; 233 | #endif 234 | } 235 | 236 | #if RET_ERR_IDX 237 | /* Error in first 16 bytes */ 238 | if (err_pos == 1) 239 | goto do_naive; 240 | #else 241 | __m256i error = _mm256_or_si256(error1, error2); 242 | if (!_mm256_testz_si256(error, error)) 243 | return -1; 244 | #endif 245 | 246 | /* Find previous token (not 80~BF) */ 247 | int32_t token4 = _mm256_extract_epi32(prev_input, 7); 248 | const int8_t *token = (const int8_t *)&token4; 249 | int lookahead = 0; 250 | if (token[3] > (int8_t)0xBF) 251 | lookahead = 1; 252 | else if (token[2] > (int8_t)0xBF) 253 | lookahead = 2; 254 | else if (token[1] > (int8_t)0xBF) 255 | lookahead = 3; 256 | 257 | data -= lookahead; 258 | len += lookahead; 259 | #if RET_ERR_IDX 260 | err_pos -= lookahead; 261 | #endif 262 | } 263 | 264 | /* Check remaining bytes with naive method */ 265 | #if RET_ERR_IDX 266 | int err_pos2; 267 | do_naive: 268 | err_pos2 = utf8_naive(data, len); 269 | if (err_pos2) 270 | return err_pos + err_pos2 - 1; 271 | return 0; 272 | #else 273 | return utf8_naive(data, len); 274 | #endif 275 | } 276 | 277 | #endif 278 | -------------------------------------------------------------------------------- /range-neon.c: -------------------------------------------------------------------------------- 1 | #ifdef __aarch64__ 2 | 3 | #include 4 | #include 5 | #include 6 | 7 | int utf8_naive(const unsigned char *data, int len); 8 | 9 | #if 0 10 | static void print128(const char *s, const uint8x16_t v128) 11 | { 12 | unsigned char v8[16]; 13 | vst1q_u8(v8, v128); 14 | 15 | if (s) 16 | printf("%s:\t", s); 17 | for (int i = 0; i < 16; ++i) 18 | printf("%02x ", v8[i]); 19 | printf("\n"); 20 | } 21 | #endif 22 | 23 | /* 24 | * Map high nibble of "First Byte" to legal character length minus 1 25 | * 0x00 ~ 0xBF --> 0 26 | * 0xC0 ~ 0xDF --> 1 27 | * 0xE0 ~ 0xEF --> 2 28 | * 0xF0 ~ 0xFF --> 3 29 | */ 30 | static const uint8_t _first_len_tbl[] = { 31 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 3, 32 | }; 33 | 34 | /* Map "First Byte" to 8-th item of range table (0xC2 ~ 0xF4) */ 35 | static const uint8_t _first_range_tbl[] = { 36 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 37 | }; 38 | 39 | /* 40 | * Range table, map range index to min and max values 41 | * Index 0 : 00 ~ 7F (First Byte, ascii) 42 | * Index 1,2,3: 80 ~ BF (Second, Third, Fourth Byte) 43 | * Index 4 : A0 ~ BF (Second Byte after E0) 44 | * Index 5 : 80 ~ 9F (Second Byte after ED) 45 | * Index 6 : 90 ~ BF (Second Byte after F0) 46 | * Index 7 : 80 ~ 8F (Second Byte after F4) 47 | * Index 8 : C2 ~ F4 (First Byte, non ascii) 48 | * Index 9~15 : illegal: u >= 255 && u <= 0 49 | */ 50 | static const uint8_t _range_min_tbl[] = { 51 | 0x00, 0x80, 0x80, 0x80, 0xA0, 0x80, 0x90, 0x80, 52 | 0xC2, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 53 | }; 54 | static const uint8_t _range_max_tbl[] = { 55 | 0x7F, 0xBF, 0xBF, 0xBF, 0xBF, 0x9F, 0xBF, 0x8F, 56 | 0xF4, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 57 | }; 58 | 59 | /* 60 | * This table is for fast handling four special First Bytes(E0,ED,F0,F4), after 61 | * which the Second Byte are not 80~BF. It contains "range index adjustment". 62 | * - The idea is to minus byte with E0, use the result(0~31) as the index to 63 | * lookup the "range index adjustment". Then add the adjustment to original 64 | * range index to get the correct range. 65 | * - Range index adjustment 66 | * +------------+---------------+------------------+----------------+ 67 | * | First Byte | original range| range adjustment | adjusted range | 68 | * +------------+---------------+------------------+----------------+ 69 | * | E0 | 2 | 2 | 4 | 70 | * +------------+---------------+------------------+----------------+ 71 | * | ED | 2 | 3 | 5 | 72 | * +------------+---------------+------------------+----------------+ 73 | * | F0 | 3 | 3 | 6 | 74 | * +------------+---------------+------------------+----------------+ 75 | * | F4 | 4 | 4 | 8 | 76 | * +------------+---------------+------------------+----------------+ 77 | * - Below is a uint8x16x2 table, data is interleaved in NEON register. So I'm 78 | * putting it vertically. 1st column is for E0~EF, 2nd column for F0~FF. 79 | */ 80 | static const uint8_t _range_adjust_tbl[] = { 81 | /* index -> 0~15 16~31 <- index */ 82 | /* E0 -> */ 2, 3, /* <- F0 */ 83 | 0, 0, 84 | 0, 0, 85 | 0, 0, 86 | 0, 4, /* <- F4 */ 87 | 0, 0, 88 | 0, 0, 89 | 0, 0, 90 | 0, 0, 91 | 0, 0, 92 | 0, 0, 93 | 0, 0, 94 | 0, 0, 95 | /* ED -> */ 3, 0, 96 | 0, 0, 97 | 0, 0, 98 | }; 99 | 100 | /* 2x ~ 4x faster than naive method */ 101 | /* Return 0 on success, -1 on error */ 102 | int utf8_range(const unsigned char *data, int len) 103 | { 104 | if (len >= 16) { 105 | uint8x16_t prev_input = vdupq_n_u8(0); 106 | uint8x16_t prev_first_len = vdupq_n_u8(0); 107 | 108 | /* Cached tables */ 109 | const uint8x16_t first_len_tbl = vld1q_u8(_first_len_tbl); 110 | const uint8x16_t first_range_tbl = vld1q_u8(_first_range_tbl); 111 | const uint8x16_t range_min_tbl = vld1q_u8(_range_min_tbl); 112 | const uint8x16_t range_max_tbl = vld1q_u8(_range_max_tbl); 113 | const uint8x16x2_t range_adjust_tbl = vld2q_u8(_range_adjust_tbl); 114 | 115 | /* Cached values */ 116 | const uint8x16_t const_1 = vdupq_n_u8(1); 117 | const uint8x16_t const_2 = vdupq_n_u8(2); 118 | const uint8x16_t const_e0 = vdupq_n_u8(0xE0); 119 | 120 | /* We use two error registers to remove a dependency. */ 121 | uint8x16_t error1 = vdupq_n_u8(0); 122 | uint8x16_t error2 = vdupq_n_u8(0); 123 | 124 | while (len >= 16) { 125 | const uint8x16_t input = vld1q_u8(data); 126 | 127 | /* high_nibbles = input >> 4 */ 128 | const uint8x16_t high_nibbles = vshrq_n_u8(input, 4); 129 | 130 | /* first_len = legal character length minus 1 */ 131 | /* 0 for 00~7F, 1 for C0~DF, 2 for E0~EF, 3 for F0~FF */ 132 | /* first_len = first_len_tbl[high_nibbles] */ 133 | const uint8x16_t first_len = 134 | vqtbl1q_u8(first_len_tbl, high_nibbles); 135 | 136 | /* First Byte: set range index to 8 for bytes within 0xC0 ~ 0xFF */ 137 | /* range = first_range_tbl[high_nibbles] */ 138 | uint8x16_t range = vqtbl1q_u8(first_range_tbl, high_nibbles); 139 | 140 | /* Second Byte: set range index to first_len */ 141 | /* 0 for 00~7F, 1 for C0~DF, 2 for E0~EF, 3 for F0~FF */ 142 | /* range |= (first_len, prev_first_len) << 1 byte */ 143 | range = 144 | vorrq_u8(range, vextq_u8(prev_first_len, first_len, 15)); 145 | 146 | /* Third Byte: set range index to saturate_sub(first_len, 1) */ 147 | /* 0 for 00~7F, 0 for C0~DF, 1 for E0~EF, 2 for F0~FF */ 148 | uint8x16_t tmp1, tmp2; 149 | /* tmp1 = (first_len, prev_first_len) << 2 bytes */ 150 | tmp1 = vextq_u8(prev_first_len, first_len, 14); 151 | /* tmp1 = saturate_sub(tmp1, 1) */ 152 | tmp1 = vqsubq_u8(tmp1, const_1); 153 | /* range |= tmp1 */ 154 | range = vorrq_u8(range, tmp1); 155 | 156 | /* Fourth Byte: set range index to saturate_sub(first_len, 2) */ 157 | /* 0 for 00~7F, 0 for C0~DF, 0 for E0~EF, 1 for F0~FF */ 158 | /* tmp2 = (first_len, prev_first_len) << 3 bytes */ 159 | tmp2 = vextq_u8(prev_first_len, first_len, 13); 160 | /* tmp2 = saturate_sub(tmp2, 2) */ 161 | tmp2 = vqsubq_u8(tmp2, const_2); 162 | /* range |= tmp2 */ 163 | range = vorrq_u8(range, tmp2); 164 | 165 | /* 166 | * Now we have below range indices caluclated 167 | * Correct cases: 168 | * - 8 for C0~FF 169 | * - 3 for 1st byte after F0~FF 170 | * - 2 for 1st byte after E0~EF or 2nd byte after F0~FF 171 | * - 1 for 1st byte after C0~DF or 2nd byte after E0~EF or 172 | * 3rd byte after F0~FF 173 | * - 0 for others 174 | * Error cases: 175 | * 9,10,11 if non ascii First Byte overlaps 176 | * E.g., F1 80 C2 90 --> 8 3 10 2, where 10 indicates error 177 | */ 178 | 179 | /* Adjust Second Byte range for special First Bytes(E0,ED,F0,F4) */ 180 | /* See _range_adjust_tbl[] definition for details */ 181 | /* Overlaps lead to index 9~15, which are illegal in range table */ 182 | uint8x16_t shift1 = vextq_u8(prev_input, input, 15); 183 | uint8x16_t pos = vsubq_u8(shift1, const_e0); 184 | range = vaddq_u8(range, vqtbl2q_u8(range_adjust_tbl, pos)); 185 | 186 | /* Load min and max values per calculated range index */ 187 | uint8x16_t minv = vqtbl1q_u8(range_min_tbl, range); 188 | uint8x16_t maxv = vqtbl1q_u8(range_max_tbl, range); 189 | 190 | /* Check value range */ 191 | error1 = vorrq_u8(error1, vcltq_u8(input, minv)); 192 | error2 = vorrq_u8(error2, vcgtq_u8(input, maxv)); 193 | 194 | prev_input = input; 195 | prev_first_len = first_len; 196 | 197 | data += 16; 198 | len -= 16; 199 | } 200 | /* Merge our error counters together */ 201 | error1 = vorrq_u8(error1, error2); 202 | 203 | /* Delay error check till loop ends */ 204 | if (vmaxvq_u8(error1)) 205 | return -1; 206 | 207 | /* Find previous token (not 80~BF) */ 208 | uint32_t token4; 209 | vst1q_lane_u32(&token4, vreinterpretq_u32_u8(prev_input), 3); 210 | 211 | const int8_t *token = (const int8_t *)&token4; 212 | int lookahead = 0; 213 | if (token[3] > (int8_t)0xBF) 214 | lookahead = 1; 215 | else if (token[2] > (int8_t)0xBF) 216 | lookahead = 2; 217 | else if (token[1] > (int8_t)0xBF) 218 | lookahead = 3; 219 | 220 | data -= lookahead; 221 | len += lookahead; 222 | } 223 | 224 | /* Check remaining bytes with naive method */ 225 | return utf8_naive(data, len); 226 | } 227 | 228 | #endif 229 | -------------------------------------------------------------------------------- /range-sse.c: -------------------------------------------------------------------------------- 1 | #ifdef __x86_64__ 2 | 3 | #include 4 | #include 5 | #include 6 | 7 | int utf8_naive(const unsigned char *data, int len); 8 | 9 | #if 0 10 | static void print128(const char *s, const __m128i v128) 11 | { 12 | const unsigned char *v8 = (const unsigned char *)&v128; 13 | if (s) 14 | printf("%s:\t", s); 15 | for (int i = 0; i < 16; i++) 16 | printf("%02x ", v8[i]); 17 | printf("\n"); 18 | } 19 | #endif 20 | 21 | /* 22 | * Map high nibble of "First Byte" to legal character length minus 1 23 | * 0x00 ~ 0xBF --> 0 24 | * 0xC0 ~ 0xDF --> 1 25 | * 0xE0 ~ 0xEF --> 2 26 | * 0xF0 ~ 0xFF --> 3 27 | */ 28 | static const int8_t _first_len_tbl[] = { 29 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 3, 30 | }; 31 | 32 | /* Map "First Byte" to 8-th item of range table (0xC2 ~ 0xF4) */ 33 | static const int8_t _first_range_tbl[] = { 34 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 35 | }; 36 | 37 | /* 38 | * Range table, map range index to min and max values 39 | * Index 0 : 00 ~ 7F (First Byte, ascii) 40 | * Index 1,2,3: 80 ~ BF (Second, Third, Fourth Byte) 41 | * Index 4 : A0 ~ BF (Second Byte after E0) 42 | * Index 5 : 80 ~ 9F (Second Byte after ED) 43 | * Index 6 : 90 ~ BF (Second Byte after F0) 44 | * Index 7 : 80 ~ 8F (Second Byte after F4) 45 | * Index 8 : C2 ~ F4 (First Byte, non ascii) 46 | * Index 9~15 : illegal: i >= 127 && i <= -128 47 | */ 48 | static const int8_t _range_min_tbl[] = { 49 | 0x00, 0x80, 0x80, 0x80, 0xA0, 0x80, 0x90, 0x80, 50 | 0xC2, 0x7F, 0x7F, 0x7F, 0x7F, 0x7F, 0x7F, 0x7F, 51 | }; 52 | static const int8_t _range_max_tbl[] = { 53 | 0x7F, 0xBF, 0xBF, 0xBF, 0xBF, 0x9F, 0xBF, 0x8F, 54 | 0xF4, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 55 | }; 56 | 57 | /* 58 | * Tables for fast handling of four special First Bytes(E0,ED,F0,F4), after 59 | * which the Second Byte are not 80~BF. It contains "range index adjustment". 60 | * +------------+---------------+------------------+----------------+ 61 | * | First Byte | original range| range adjustment | adjusted range | 62 | * +------------+---------------+------------------+----------------+ 63 | * | E0 | 2 | 2 | 4 | 64 | * +------------+---------------+------------------+----------------+ 65 | * | ED | 2 | 3 | 5 | 66 | * +------------+---------------+------------------+----------------+ 67 | * | F0 | 3 | 3 | 6 | 68 | * +------------+---------------+------------------+----------------+ 69 | * | F4 | 4 | 4 | 8 | 70 | * +------------+---------------+------------------+----------------+ 71 | */ 72 | /* index1 -> E0, index14 -> ED */ 73 | static const int8_t _df_ee_tbl[] = { 74 | 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 75 | }; 76 | /* index1 -> F0, index5 -> F4 */ 77 | static const int8_t _ef_fe_tbl[] = { 78 | 0, 3, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 79 | }; 80 | 81 | #define RET_ERR_IDX 0 /* Define 1 to return index of first error char */ 82 | 83 | /* 5x faster than naive method */ 84 | /* Return 0 - success, -1 - error, >0 - first error char(if RET_ERR_IDX = 1) */ 85 | int utf8_range(const unsigned char *data, int len) 86 | { 87 | #if RET_ERR_IDX 88 | int err_pos = 1; 89 | #endif 90 | 91 | if (len >= 16) { 92 | __m128i prev_input = _mm_set1_epi8(0); 93 | __m128i prev_first_len = _mm_set1_epi8(0); 94 | 95 | /* Cached tables */ 96 | const __m128i first_len_tbl = 97 | _mm_loadu_si128((const __m128i *)_first_len_tbl); 98 | const __m128i first_range_tbl = 99 | _mm_loadu_si128((const __m128i *)_first_range_tbl); 100 | const __m128i range_min_tbl = 101 | _mm_loadu_si128((const __m128i *)_range_min_tbl); 102 | const __m128i range_max_tbl = 103 | _mm_loadu_si128((const __m128i *)_range_max_tbl); 104 | const __m128i df_ee_tbl = 105 | _mm_loadu_si128((const __m128i *)_df_ee_tbl); 106 | const __m128i ef_fe_tbl = 107 | _mm_loadu_si128((const __m128i *)_ef_fe_tbl); 108 | 109 | __m128i error = _mm_set1_epi8(0); 110 | 111 | while (len >= 16) { 112 | const __m128i input = _mm_loadu_si128((const __m128i *)data); 113 | 114 | /* high_nibbles = input >> 4 */ 115 | const __m128i high_nibbles = 116 | _mm_and_si128(_mm_srli_epi16(input, 4), _mm_set1_epi8(0x0F)); 117 | 118 | /* first_len = legal character length minus 1 */ 119 | /* 0 for 00~7F, 1 for C0~DF, 2 for E0~EF, 3 for F0~FF */ 120 | /* first_len = first_len_tbl[high_nibbles] */ 121 | __m128i first_len = _mm_shuffle_epi8(first_len_tbl, high_nibbles); 122 | 123 | /* First Byte: set range index to 8 for bytes within 0xC0 ~ 0xFF */ 124 | /* range = first_range_tbl[high_nibbles] */ 125 | __m128i range = _mm_shuffle_epi8(first_range_tbl, high_nibbles); 126 | 127 | /* Second Byte: set range index to first_len */ 128 | /* 0 for 00~7F, 1 for C0~DF, 2 for E0~EF, 3 for F0~FF */ 129 | /* range |= (first_len, prev_first_len) << 1 byte */ 130 | range = _mm_or_si128( 131 | range, _mm_alignr_epi8(first_len, prev_first_len, 15)); 132 | 133 | /* Third Byte: set range index to saturate_sub(first_len, 1) */ 134 | /* 0 for 00~7F, 0 for C0~DF, 1 for E0~EF, 2 for F0~FF */ 135 | __m128i tmp; 136 | /* tmp = (first_len, prev_first_len) << 2 bytes */ 137 | tmp = _mm_alignr_epi8(first_len, prev_first_len, 14); 138 | /* tmp = saturate_sub(tmp, 1) */ 139 | tmp = _mm_subs_epu8(tmp, _mm_set1_epi8(1)); 140 | /* range |= tmp */ 141 | range = _mm_or_si128(range, tmp); 142 | 143 | /* Fourth Byte: set range index to saturate_sub(first_len, 2) */ 144 | /* 0 for 00~7F, 0 for C0~DF, 0 for E0~EF, 1 for F0~FF */ 145 | /* tmp = (first_len, prev_first_len) << 3 bytes */ 146 | tmp = _mm_alignr_epi8(first_len, prev_first_len, 13); 147 | /* tmp = saturate_sub(tmp, 2) */ 148 | tmp = _mm_subs_epu8(tmp, _mm_set1_epi8(2)); 149 | /* range |= tmp */ 150 | range = _mm_or_si128(range, tmp); 151 | 152 | /* 153 | * Now we have below range indices caluclated 154 | * Correct cases: 155 | * - 8 for C0~FF 156 | * - 3 for 1st byte after F0~FF 157 | * - 2 for 1st byte after E0~EF or 2nd byte after F0~FF 158 | * - 1 for 1st byte after C0~DF or 2nd byte after E0~EF or 159 | * 3rd byte after F0~FF 160 | * - 0 for others 161 | * Error cases: 162 | * 9,10,11 if non ascii First Byte overlaps 163 | * E.g., F1 80 C2 90 --> 8 3 10 2, where 10 indicates error 164 | */ 165 | 166 | /* Adjust Second Byte range for special First Bytes(E0,ED,F0,F4) */ 167 | /* Overlaps lead to index 9~15, which are illegal in range table */ 168 | __m128i shift1, pos, range2; 169 | /* shift1 = (input, prev_input) << 1 byte */ 170 | shift1 = _mm_alignr_epi8(input, prev_input, 15); 171 | pos = _mm_sub_epi8(shift1, _mm_set1_epi8(0xEF)); 172 | /* 173 | * shift1: | EF F0 ... FE | FF 00 ... ... DE | DF E0 ... EE | 174 | * pos: | 0 1 15 | 16 17 239| 240 241 255| 175 | * pos-240: | 0 0 0 | 0 0 0 | 0 1 15 | 176 | * pos+112: | 112 113 127| >= 128 | >= 128 | 177 | */ 178 | tmp = _mm_subs_epu8(pos, _mm_set1_epi8(0xF0)); 179 | range2 = _mm_shuffle_epi8(df_ee_tbl, tmp); 180 | tmp = _mm_adds_epu8(pos, _mm_set1_epi8(0x70)); 181 | range2 = _mm_add_epi8(range2, _mm_shuffle_epi8(ef_fe_tbl, tmp)); 182 | 183 | range = _mm_add_epi8(range, range2); 184 | 185 | /* Load min and max values per calculated range index */ 186 | __m128i minv = _mm_shuffle_epi8(range_min_tbl, range); 187 | __m128i maxv = _mm_shuffle_epi8(range_max_tbl, range); 188 | 189 | /* Check value range */ 190 | #if RET_ERR_IDX 191 | error = _mm_cmplt_epi8(input, minv); 192 | error = _mm_or_si128(error, _mm_cmpgt_epi8(input, maxv)); 193 | /* 5% performance drop from this conditional branch */ 194 | if (!_mm_testz_si128(error, error)) 195 | break; 196 | #else 197 | /* error |= (input < minv) | (input > maxv) */ 198 | tmp = _mm_or_si128( 199 | _mm_cmplt_epi8(input, minv), 200 | _mm_cmpgt_epi8(input, maxv) 201 | ); 202 | error = _mm_or_si128(error, tmp); 203 | #endif 204 | 205 | prev_input = input; 206 | prev_first_len = first_len; 207 | 208 | data += 16; 209 | len -= 16; 210 | #if RET_ERR_IDX 211 | err_pos += 16; 212 | #endif 213 | } 214 | 215 | #if RET_ERR_IDX 216 | /* Error in first 16 bytes */ 217 | if (err_pos == 1) 218 | goto do_naive; 219 | #else 220 | if (!_mm_testz_si128(error, error)) 221 | return -1; 222 | #endif 223 | 224 | /* Find previous token (not 80~BF) */ 225 | int32_t token4 = _mm_extract_epi32(prev_input, 3); 226 | const int8_t *token = (const int8_t *)&token4; 227 | int lookahead = 0; 228 | if (token[3] > (int8_t)0xBF) 229 | lookahead = 1; 230 | else if (token[2] > (int8_t)0xBF) 231 | lookahead = 2; 232 | else if (token[1] > (int8_t)0xBF) 233 | lookahead = 3; 234 | 235 | data -= lookahead; 236 | len += lookahead; 237 | #if RET_ERR_IDX 238 | err_pos -= lookahead; 239 | #endif 240 | } 241 | 242 | /* Check remaining bytes with naive method */ 243 | #if RET_ERR_IDX 244 | int err_pos2; 245 | do_naive: 246 | err_pos2 = utf8_naive(data, len); 247 | if (err_pos2) 248 | return err_pos + err_pos2 - 1; 249 | return 0; 250 | #else 251 | return utf8_naive(data, len); 252 | #endif 253 | } 254 | 255 | #endif 256 | -------------------------------------------------------------------------------- /range.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cyb70289/utf8/d7e2737acc1a7416a5ea13bf9e0d693453e775be/range.png -------------------------------------------------------------------------------- /range2-neon.c: -------------------------------------------------------------------------------- 1 | /* 2 | * Process 2x16 bytes in each iteration. 3 | * Comments removed for brevity. See range-neon.c for details. 4 | */ 5 | #ifdef __aarch64__ 6 | 7 | #include 8 | #include 9 | #include 10 | 11 | int utf8_naive(const unsigned char *data, int len); 12 | 13 | static const uint8_t _first_len_tbl[] = { 14 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 3, 15 | }; 16 | 17 | static const uint8_t _first_range_tbl[] = { 18 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 19 | }; 20 | 21 | static const uint8_t _range_min_tbl[] = { 22 | 0x00, 0x80, 0x80, 0x80, 0xA0, 0x80, 0x90, 0x80, 23 | 0xC2, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 24 | }; 25 | static const uint8_t _range_max_tbl[] = { 26 | 0x7F, 0xBF, 0xBF, 0xBF, 0xBF, 0x9F, 0xBF, 0x8F, 27 | 0xF4, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 28 | }; 29 | 30 | static const uint8_t _range_adjust_tbl[] = { 31 | 2, 3, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 32 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 33 | }; 34 | 35 | /* Return 0 on success, -1 on error */ 36 | int utf8_range2(const unsigned char *data, int len) 37 | { 38 | if (len >= 32) { 39 | uint8x16_t prev_input = vdupq_n_u8(0); 40 | uint8x16_t prev_first_len = vdupq_n_u8(0); 41 | 42 | const uint8x16_t first_len_tbl = vld1q_u8(_first_len_tbl); 43 | const uint8x16_t first_range_tbl = vld1q_u8(_first_range_tbl); 44 | const uint8x16_t range_min_tbl = vld1q_u8(_range_min_tbl); 45 | const uint8x16_t range_max_tbl = vld1q_u8(_range_max_tbl); 46 | const uint8x16x2_t range_adjust_tbl = vld2q_u8(_range_adjust_tbl); 47 | 48 | const uint8x16_t const_1 = vdupq_n_u8(1); 49 | const uint8x16_t const_2 = vdupq_n_u8(2); 50 | const uint8x16_t const_e0 = vdupq_n_u8(0xE0); 51 | 52 | uint8x16_t error1 = vdupq_n_u8(0); 53 | uint8x16_t error2 = vdupq_n_u8(0); 54 | uint8x16_t error3 = vdupq_n_u8(0); 55 | uint8x16_t error4 = vdupq_n_u8(0); 56 | 57 | while (len >= 32) { 58 | /******************* two blocks interleaved **********************/ 59 | 60 | #if defined(__GNUC__) && !defined(__clang__) && (__GNUC__ < 8) 61 | /* gcc doesn't support vldq1_u8_x2 until version 8 */ 62 | const uint8x16_t input_a = vld1q_u8(data); 63 | const uint8x16_t input_b = vld1q_u8(data + 16); 64 | #else 65 | /* Forces a double load on Clang */ 66 | const uint8x16x2_t input_pair = vld1q_u8_x2(data); 67 | const uint8x16_t input_a = input_pair.val[0]; 68 | const uint8x16_t input_b = input_pair.val[1]; 69 | #endif 70 | 71 | const uint8x16_t high_nibbles_a = vshrq_n_u8(input_a, 4); 72 | const uint8x16_t high_nibbles_b = vshrq_n_u8(input_b, 4); 73 | 74 | const uint8x16_t first_len_a = 75 | vqtbl1q_u8(first_len_tbl, high_nibbles_a); 76 | const uint8x16_t first_len_b = 77 | vqtbl1q_u8(first_len_tbl, high_nibbles_b); 78 | 79 | uint8x16_t range_a = vqtbl1q_u8(first_range_tbl, high_nibbles_a); 80 | uint8x16_t range_b = vqtbl1q_u8(first_range_tbl, high_nibbles_b); 81 | 82 | range_a = 83 | vorrq_u8(range_a, vextq_u8(prev_first_len, first_len_a, 15)); 84 | range_b = 85 | vorrq_u8(range_b, vextq_u8(first_len_a, first_len_b, 15)); 86 | 87 | uint8x16_t tmp1_a, tmp2_a, tmp1_b, tmp2_b; 88 | tmp1_a = vextq_u8(prev_first_len, first_len_a, 14); 89 | tmp1_a = vqsubq_u8(tmp1_a, const_1); 90 | range_a = vorrq_u8(range_a, tmp1_a); 91 | 92 | tmp1_b = vextq_u8(first_len_a, first_len_b, 14); 93 | tmp1_b = vqsubq_u8(tmp1_b, const_1); 94 | range_b = vorrq_u8(range_b, tmp1_b); 95 | 96 | tmp2_a = vextq_u8(prev_first_len, first_len_a, 13); 97 | tmp2_a = vqsubq_u8(tmp2_a, const_2); 98 | range_a = vorrq_u8(range_a, tmp2_a); 99 | 100 | tmp2_b = vextq_u8(first_len_a, first_len_b, 13); 101 | tmp2_b = vqsubq_u8(tmp2_b, const_2); 102 | range_b = vorrq_u8(range_b, tmp2_b); 103 | 104 | uint8x16_t shift1_a = vextq_u8(prev_input, input_a, 15); 105 | uint8x16_t pos_a = vsubq_u8(shift1_a, const_e0); 106 | range_a = vaddq_u8(range_a, vqtbl2q_u8(range_adjust_tbl, pos_a)); 107 | 108 | uint8x16_t shift1_b = vextq_u8(input_a, input_b, 15); 109 | uint8x16_t pos_b = vsubq_u8(shift1_b, const_e0); 110 | range_b = vaddq_u8(range_b, vqtbl2q_u8(range_adjust_tbl, pos_b)); 111 | 112 | uint8x16_t minv_a = vqtbl1q_u8(range_min_tbl, range_a); 113 | uint8x16_t maxv_a = vqtbl1q_u8(range_max_tbl, range_a); 114 | 115 | uint8x16_t minv_b = vqtbl1q_u8(range_min_tbl, range_b); 116 | uint8x16_t maxv_b = vqtbl1q_u8(range_max_tbl, range_b); 117 | 118 | error1 = vorrq_u8(error1, vcltq_u8(input_a, minv_a)); 119 | error2 = vorrq_u8(error2, vcgtq_u8(input_a, maxv_a)); 120 | 121 | error3 = vorrq_u8(error3, vcltq_u8(input_b, minv_b)); 122 | error4 = vorrq_u8(error4, vcgtq_u8(input_b, maxv_b)); 123 | 124 | /************************ next iteration *************************/ 125 | prev_input = input_b; 126 | prev_first_len = first_len_b; 127 | 128 | data += 32; 129 | len -= 32; 130 | } 131 | error1 = vorrq_u8(error1, error2); 132 | error1 = vorrq_u8(error1, error3); 133 | error1 = vorrq_u8(error1, error4); 134 | 135 | if (vmaxvq_u8(error1)) 136 | return -1; 137 | 138 | uint32_t token4; 139 | vst1q_lane_u32(&token4, vreinterpretq_u32_u8(prev_input), 3); 140 | 141 | const int8_t *token = (const int8_t *)&token4; 142 | int lookahead = 0; 143 | if (token[3] > (int8_t)0xBF) 144 | lookahead = 1; 145 | else if (token[2] > (int8_t)0xBF) 146 | lookahead = 2; 147 | else if (token[1] > (int8_t)0xBF) 148 | lookahead = 3; 149 | 150 | data -= lookahead; 151 | len += lookahead; 152 | } 153 | 154 | return utf8_naive(data, len); 155 | } 156 | 157 | #endif 158 | -------------------------------------------------------------------------------- /range2-sse.c: -------------------------------------------------------------------------------- 1 | /* 2 | * Process 2x16 bytes in each iteration. 3 | * Comments removed for brevity. See range-sse.c for details. 4 | */ 5 | #ifdef __x86_64__ 6 | 7 | #include 8 | #include 9 | #include 10 | 11 | int utf8_naive(const unsigned char *data, int len); 12 | 13 | static const int8_t _first_len_tbl[] = { 14 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 3, 15 | }; 16 | 17 | static const int8_t _first_range_tbl[] = { 18 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 19 | }; 20 | 21 | static const int8_t _range_min_tbl[] = { 22 | 0x00, 0x80, 0x80, 0x80, 0xA0, 0x80, 0x90, 0x80, 23 | 0xC2, 0x7F, 0x7F, 0x7F, 0x7F, 0x7F, 0x7F, 0x7F, 24 | }; 25 | static const int8_t _range_max_tbl[] = { 26 | 0x7F, 0xBF, 0xBF, 0xBF, 0xBF, 0x9F, 0xBF, 0x8F, 27 | 0xF4, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 28 | }; 29 | 30 | static const int8_t _df_ee_tbl[] = { 31 | 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 32 | }; 33 | static const int8_t _ef_fe_tbl[] = { 34 | 0, 3, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 35 | }; 36 | 37 | /* Return 0 on success, -1 on error */ 38 | int utf8_range2(const unsigned char *data, int len) 39 | { 40 | if (len >= 32) { 41 | __m128i prev_input = _mm_set1_epi8(0); 42 | __m128i prev_first_len = _mm_set1_epi8(0); 43 | 44 | const __m128i first_len_tbl = 45 | _mm_loadu_si128((const __m128i *)_first_len_tbl); 46 | const __m128i first_range_tbl = 47 | _mm_loadu_si128((const __m128i *)_first_range_tbl); 48 | const __m128i range_min_tbl = 49 | _mm_loadu_si128((const __m128i *)_range_min_tbl); 50 | const __m128i range_max_tbl = 51 | _mm_loadu_si128((const __m128i *)_range_max_tbl); 52 | const __m128i df_ee_tbl = 53 | _mm_loadu_si128((const __m128i *)_df_ee_tbl); 54 | const __m128i ef_fe_tbl = 55 | _mm_loadu_si128((const __m128i *)_ef_fe_tbl); 56 | 57 | __m128i error = _mm_set1_epi8(0); 58 | 59 | while (len >= 32) { 60 | /***************************** block 1 ****************************/ 61 | const __m128i input_a = _mm_loadu_si128((const __m128i *)data); 62 | 63 | __m128i high_nibbles = 64 | _mm_and_si128(_mm_srli_epi16(input_a, 4), _mm_set1_epi8(0x0F)); 65 | 66 | __m128i first_len_a = _mm_shuffle_epi8(first_len_tbl, high_nibbles); 67 | 68 | __m128i range_a = _mm_shuffle_epi8(first_range_tbl, high_nibbles); 69 | 70 | range_a = _mm_or_si128( 71 | range_a, _mm_alignr_epi8(first_len_a, prev_first_len, 15)); 72 | 73 | __m128i tmp; 74 | tmp = _mm_alignr_epi8(first_len_a, prev_first_len, 14); 75 | tmp = _mm_subs_epu8(tmp, _mm_set1_epi8(1)); 76 | range_a = _mm_or_si128(range_a, tmp); 77 | 78 | tmp = _mm_alignr_epi8(first_len_a, prev_first_len, 13); 79 | tmp = _mm_subs_epu8(tmp, _mm_set1_epi8(2)); 80 | range_a = _mm_or_si128(range_a, tmp); 81 | 82 | __m128i shift1, pos, range2; 83 | shift1 = _mm_alignr_epi8(input_a, prev_input, 15); 84 | pos = _mm_sub_epi8(shift1, _mm_set1_epi8(0xEF)); 85 | tmp = _mm_subs_epu8(pos, _mm_set1_epi8(0xF0)); 86 | range2 = _mm_shuffle_epi8(df_ee_tbl, tmp); 87 | tmp = _mm_adds_epu8(pos, _mm_set1_epi8(0x70)); 88 | range2 = _mm_add_epi8(range2, _mm_shuffle_epi8(ef_fe_tbl, tmp)); 89 | 90 | range_a = _mm_add_epi8(range_a, range2); 91 | 92 | __m128i minv = _mm_shuffle_epi8(range_min_tbl, range_a); 93 | __m128i maxv = _mm_shuffle_epi8(range_max_tbl, range_a); 94 | 95 | tmp = _mm_or_si128( 96 | _mm_cmplt_epi8(input_a, minv), 97 | _mm_cmpgt_epi8(input_a, maxv) 98 | ); 99 | error = _mm_or_si128(error, tmp); 100 | 101 | /***************************** block 2 ****************************/ 102 | const __m128i input_b = _mm_loadu_si128((const __m128i *)(data+16)); 103 | 104 | high_nibbles = 105 | _mm_and_si128(_mm_srli_epi16(input_b, 4), _mm_set1_epi8(0x0F)); 106 | 107 | __m128i first_len_b = _mm_shuffle_epi8(first_len_tbl, high_nibbles); 108 | 109 | __m128i range_b = _mm_shuffle_epi8(first_range_tbl, high_nibbles); 110 | 111 | range_b = _mm_or_si128( 112 | range_b, _mm_alignr_epi8(first_len_b, first_len_a, 15)); 113 | 114 | 115 | tmp = _mm_alignr_epi8(first_len_b, first_len_a, 14); 116 | tmp = _mm_subs_epu8(tmp, _mm_set1_epi8(1)); 117 | range_b = _mm_or_si128(range_b, tmp); 118 | 119 | tmp = _mm_alignr_epi8(first_len_b, first_len_a, 13); 120 | tmp = _mm_subs_epu8(tmp, _mm_set1_epi8(2)); 121 | range_b = _mm_or_si128(range_b, tmp); 122 | 123 | shift1 = _mm_alignr_epi8(input_b, input_a, 15); 124 | pos = _mm_sub_epi8(shift1, _mm_set1_epi8(0xEF)); 125 | tmp = _mm_subs_epu8(pos, _mm_set1_epi8(0xF0)); 126 | range2 = _mm_shuffle_epi8(df_ee_tbl, tmp); 127 | tmp = _mm_adds_epu8(pos, _mm_set1_epi8(0x70)); 128 | range2 = _mm_add_epi8(range2, _mm_shuffle_epi8(ef_fe_tbl, tmp)); 129 | 130 | range_b = _mm_add_epi8(range_b, range2); 131 | 132 | minv = _mm_shuffle_epi8(range_min_tbl, range_b); 133 | maxv = _mm_shuffle_epi8(range_max_tbl, range_b); 134 | 135 | 136 | tmp = _mm_or_si128( 137 | _mm_cmplt_epi8(input_b, minv), 138 | _mm_cmpgt_epi8(input_b, maxv) 139 | ); 140 | error = _mm_or_si128(error, tmp); 141 | 142 | /************************ next iteration **************************/ 143 | prev_input = input_b; 144 | prev_first_len = first_len_b; 145 | 146 | data += 32; 147 | len -= 32; 148 | } 149 | 150 | if (!_mm_testz_si128(error, error)) 151 | return -1; 152 | 153 | int32_t token4 = _mm_extract_epi32(prev_input, 3); 154 | const int8_t *token = (const int8_t *)&token4; 155 | int lookahead = 0; 156 | if (token[3] > (int8_t)0xBF) 157 | lookahead = 1; 158 | else if (token[2] > (int8_t)0xBF) 159 | lookahead = 2; 160 | else if (token[1] > (int8_t)0xBF) 161 | lookahead = 3; 162 | 163 | data -= lookahead; 164 | len += lookahead; 165 | } 166 | 167 | return utf8_naive(data, len); 168 | } 169 | 170 | #endif 171 | -------------------------------------------------------------------------------- /utf8_to_utf16/.gitignore: -------------------------------------------------------------------------------- 1 | utf8to16 2 | -------------------------------------------------------------------------------- /utf8_to_utf16/Makefile: -------------------------------------------------------------------------------- 1 | CC = gcc 2 | CPPFLAGS = -g -O3 -Wall -march=native 3 | 4 | OBJS = main.o iconv.o naive.o 5 | 6 | utf8to16: ${OBJS} 7 | gcc $^ -o $@ 8 | 9 | .PHONY: clean 10 | clean: 11 | rm -f utf8to16 *.o 12 | -------------------------------------------------------------------------------- /utf8_to_utf16/iconv.c: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include 5 | 6 | static iconv_t s_cd; 7 | 8 | /* Call iconv_open only once so the benchmark will be faster? */ 9 | static void __attribute__ ((constructor)) init_iconv(void) 10 | { 11 | #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ 12 | s_cd = iconv_open("UTF-16LE", "UTF-8"); 13 | #else 14 | s_cd = iconv_open("UTF-16BE", "UTF-8"); 15 | #endif 16 | if (s_cd == (iconv_t)-1) { 17 | perror("iconv_open"); 18 | exit(1); 19 | } 20 | } 21 | 22 | /* 23 | * Parameters: 24 | * - buf8, len8: input utf-8 string 25 | * - buf16: buffer to store decoded utf-16 string 26 | * - *len16: on entry - utf-16 buffer length in bytes 27 | * on exit - length in bytes of valid decoded utf-16 string 28 | * Returns: 29 | * - 0: success 30 | * - >0: error position of input utf-8 string 31 | * - -1: utf-16 buffer overflow 32 | * LE/BE depends on host 33 | */ 34 | int utf8_to16_iconv(const unsigned char *buf8, size_t len8, 35 | unsigned short *buf16, size_t *len16) 36 | { 37 | size_t ret, len16_save = *len16; 38 | const unsigned char *buf8_0 = buf8; 39 | 40 | ret = iconv(s_cd, (char **)&buf8, &len8, (char **)&buf16, len16); 41 | 42 | *len16 = len16_save - *len16; 43 | 44 | if (ret != (size_t)-1) 45 | return 0; 46 | 47 | if (errno == E2BIG) 48 | return -1; /* Output buffer full */ 49 | 50 | return buf8 - buf8_0 + 1; /* EILSEQ, EINVAL, error position */ 51 | } 52 | -------------------------------------------------------------------------------- /utf8_to_utf16/main.c: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include 5 | #include 6 | #include 7 | #include 8 | #include 9 | #include 10 | 11 | int utf8_to16_iconv(const unsigned char *buf8, size_t len8, 12 | unsigned short *buf16, size_t *len16); 13 | int utf8_to16_naive(const unsigned char *buf8, size_t len8, 14 | unsigned short *buf16, size_t *len16); 15 | 16 | static struct ftab { 17 | const char *name; 18 | int (*func)(const unsigned char *buf8, size_t len8, 19 | unsigned short *buf16, size_t *len16); 20 | } ftab[] = { 21 | { 22 | .name = "iconv", 23 | .func = utf8_to16_iconv, 24 | }, { 25 | .name = "naive", 26 | .func = utf8_to16_naive, 27 | }, 28 | }; 29 | 30 | static unsigned char *load_test_buf(int len) 31 | { 32 | const char utf8[] = "\xF0\x90\xBF\x80"; 33 | const int utf8_len = sizeof(utf8)/sizeof(utf8[0]) - 1; 34 | 35 | unsigned char *data = malloc(len); 36 | unsigned char *p = data; 37 | 38 | while (len >= utf8_len) { 39 | memcpy(p, utf8, utf8_len); 40 | p += utf8_len; 41 | len -= utf8_len; 42 | } 43 | 44 | while (len--) 45 | *p++ = 0x7F; 46 | 47 | return data; 48 | } 49 | 50 | static unsigned char *load_test_file(int *len) 51 | { 52 | unsigned char *data; 53 | int fd; 54 | struct stat stat; 55 | 56 | fd = open("../UTF-8-demo.txt", O_RDONLY); 57 | if (fd == -1) { 58 | printf("Failed to open ../UTF-8-demo.txt!\n"); 59 | exit(1); 60 | } 61 | if (fstat(fd, &stat) == -1) { 62 | printf("Failed to get file size!\n"); 63 | exit(1); 64 | } 65 | 66 | *len = stat.st_size; 67 | data = malloc(*len); 68 | if (read(fd, data, *len) != *len) { 69 | printf("Failed to read file!\n"); 70 | exit(1); 71 | } 72 | 73 | close(fd); 74 | 75 | return data; 76 | } 77 | 78 | static void print_test(const unsigned char *data, int len) 79 | { 80 | printf(" [len=%d] \"", len); 81 | while (len--) 82 | printf("\\x%02X", *data++); 83 | 84 | printf("\"\n"); 85 | } 86 | 87 | struct test { 88 | const unsigned char *data; 89 | int len; 90 | }; 91 | 92 | static void prepare_test_buf(unsigned char *buf, const struct test *pos, 93 | int pos_len, int pos_idx) 94 | { 95 | /* Round concatenate correct tokens to 1024 bytes */ 96 | int buf_idx = 0; 97 | while (buf_idx < 1024) { 98 | int buf_len = 1024 - buf_idx; 99 | 100 | if (buf_len >= pos[pos_idx].len) { 101 | memcpy(buf+buf_idx, pos[pos_idx].data, pos[pos_idx].len); 102 | buf_idx += pos[pos_idx].len; 103 | } else { 104 | memset(buf+buf_idx, 0, buf_len); 105 | buf_idx += buf_len; 106 | } 107 | 108 | if (++pos_idx == pos_len) 109 | pos_idx = 0; 110 | } 111 | } 112 | 113 | /* Return 0 on success, -1 on error */ 114 | static int test_manual(const struct ftab *ftab, unsigned short *buf16, 115 | unsigned short *_buf16) 116 | { 117 | #define LEN16 4096 118 | 119 | #pragma GCC diagnostic push 120 | #pragma GCC diagnostic ignored "-Wpointer-sign" 121 | /* positive tests */ 122 | static const struct test pos[] = { 123 | {"", 0}, 124 | {"\x00", 1}, 125 | {"\x66", 1}, 126 | {"\x7F", 1}, 127 | {"\x00\x7F", 2}, 128 | {"\x7F\x00", 2}, 129 | {"\xC2\x80", 2}, 130 | {"\xDF\xBF", 2}, 131 | {"\xE0\xA0\x80", 3}, 132 | {"\xE0\xA0\xBF", 3}, 133 | {"\xED\x9F\x80", 3}, 134 | {"\xEF\x80\xBF", 3}, 135 | {"\xF0\x90\xBF\x80", 4}, 136 | {"\xF2\x81\xBE\x99", 4}, 137 | {"\xF4\x8F\x88\xAA", 4}, 138 | }; 139 | 140 | /* negative tests */ 141 | static const struct test neg[] = { 142 | {"\x80", 1}, 143 | {"\xBF", 1}, 144 | {"\xC0\x80", 2}, 145 | {"\xC1\x00", 2}, 146 | {"\xC2\x7F", 2}, 147 | {"\xDF\xC0", 2}, 148 | {"\xE0\x9F\x80", 3}, 149 | {"\xE0\xC2\x80", 3}, 150 | {"\xED\xA0\x80", 3}, 151 | {"\xED\x7F\x80", 3}, 152 | {"\xEF\x80\x00", 3}, 153 | {"\xF0\x8F\x80\x80", 4}, 154 | {"\xF0\xEE\x80\x80", 4}, 155 | {"\xF2\x90\x91\x7F", 4}, 156 | {"\xF4\x90\x88\xAA", 4}, 157 | {"\xF4\x00\xBF\xBF", 4}, 158 | {"\x00\x00\x00\x00\x00\xC2\x80\x00\x00\x00\xE1\x80\x80\x00\x00\xC2" \ 159 | "\xC2\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00", 160 | 32}, 161 | {"\x00\x00\x00\x00\x00\xC2\xC2\x80\x00\x00\xE1\x80\x80\x00\x00\x00", 162 | 16}, 163 | {"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" \ 164 | "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xF1\x80", 165 | 32}, 166 | {"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" \ 167 | "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xF1", 168 | 32}, 169 | {"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" \ 170 | "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xF1\x80" \ 171 | "\x80", 33}, 172 | {"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" \ 173 | "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xF1\x80" \ 174 | "\xC2\x80", 34}, 175 | {"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" \ 176 | "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xF0" \ 177 | "\x80\x80\x80", 35}, 178 | }; 179 | #pragma GCC diagnostic push 180 | 181 | size_t len16 = LEN16, _len16 = LEN16; 182 | int ret, _ret; 183 | 184 | /* Test single token */ 185 | for (int i = 0; i < sizeof(pos)/sizeof(pos[0]); ++i) { 186 | ret = ftab->func(pos[i].data, pos[i].len, buf16, &len16); 187 | _ret = utf8_to16_iconv(pos[i].data, pos[i].len, _buf16, &_len16); 188 | if (ret != _ret || len16 != _len16 || memcmp(buf16, _buf16, len16)) { 189 | printf("FAILED positive test(%d:%d, %lu:%lu): ", 190 | ret, _ret, len16, _len16); 191 | print_test(pos[i].data, pos[i].len); 192 | return -1; 193 | } 194 | len16 = _len16 = LEN16; 195 | } 196 | for (int i = 0; i < sizeof(neg)/sizeof(neg[0]); ++i) { 197 | ret = ftab->func(neg[i].data, neg[i].len, buf16, &len16); 198 | _ret = utf8_to16_iconv(neg[i].data, neg[i].len, _buf16, &_len16); 199 | if (ret != _ret || len16 != _len16 || memcmp(buf16, _buf16, len16)) { 200 | printf("FAILED negitive test(%d:%d, %lu:%lu): ", 201 | ret, _ret, len16, _len16); 202 | print_test(neg[i].data, neg[i].len); 203 | return -1; 204 | } 205 | len16 = _len16 = LEN16; 206 | } 207 | 208 | /* Test shifted buffer to cover 1k length */ 209 | /* buffer size must be greater than 1024 + 16 + max(test string length) */ 210 | const int max_size = 1024*2; 211 | uint64_t buf64[max_size/8 + 2]; 212 | /* Offset 8 bytes by 1 byte */ 213 | unsigned char *buf = ((unsigned char *)buf64) + 1; 214 | int buf_len; 215 | 216 | for (int i = 0; i < sizeof(pos)/sizeof(pos[0]); ++i) { 217 | /* Positive test: shift 16 bytes, validate each shift */ 218 | prepare_test_buf(buf, pos, sizeof(pos)/sizeof(pos[0]), i); 219 | buf_len = 1024; 220 | for (int j = 0; j < 16; ++j) { 221 | ret = ftab->func(buf, buf_len, buf16, &len16); 222 | _ret = utf8_to16_iconv(buf, buf_len, _buf16, &_len16); 223 | if (ret != _ret || len16 != _len16 || \ 224 | memcmp(buf16, _buf16, len16)) { 225 | printf("FAILED positive test(%d:%d, %lu:%lu): ", 226 | ret, _ret, len16, _len16); 227 | print_test(buf, buf_len); 228 | return -1; 229 | } 230 | len16 = _len16 = LEN16; 231 | for (int k = buf_len; k >= 1; --k) 232 | buf[k] = buf[k-1]; 233 | buf[0] = '\x55'; 234 | ++buf_len; 235 | } 236 | 237 | /* Negative test: trunk last non ascii */ 238 | while (buf_len >= 1 && buf[buf_len-1] <= 0x7F) 239 | --buf_len; 240 | if (buf_len) { 241 | ret = ftab->func(buf, buf_len-1, buf16, &len16); 242 | _ret = utf8_to16_iconv(buf, buf_len-1, _buf16, &_len16); 243 | if (ret != _ret || len16 != _len16 || \ 244 | memcmp(buf16, _buf16, len16)) { 245 | printf("FAILED negative test(%d:%d, %lu:%lu): ", 246 | ret, _ret, len16, _len16); 247 | print_test(buf, buf_len-1); 248 | return -1; 249 | } 250 | len16 = _len16 = LEN16; 251 | } 252 | } 253 | 254 | /* Negative test */ 255 | for (int i = 0; i < sizeof(neg)/sizeof(neg[0]); ++i) { 256 | /* Append one error token, shift 16 bytes, validate each shift */ 257 | int pos_idx = i % (sizeof(pos)/sizeof(pos[0])); 258 | prepare_test_buf(buf, pos, sizeof(pos)/sizeof(pos[0]), pos_idx); 259 | memcpy(buf+1024, neg[i].data, neg[i].len); 260 | buf_len = 1024 + neg[i].len; 261 | for (int j = 0; j < 16; ++j) { 262 | ret = ftab->func(buf, buf_len, buf16, &len16); 263 | _ret = utf8_to16_iconv(buf, buf_len, _buf16, &_len16); 264 | if (ret != _ret || len16 != _len16 || \ 265 | memcmp(buf16, _buf16, len16)) { 266 | printf("FAILED negative test(%d:%d, %lu:%lu): ", 267 | ret, _ret, len16, _len16); 268 | print_test(buf, buf_len); 269 | return -1; 270 | } 271 | len16 = _len16 = LEN16; 272 | for (int k = buf_len; k >= 1; --k) 273 | buf[k] = buf[k-1]; 274 | buf[0] = '\x66'; 275 | ++buf_len; 276 | } 277 | } 278 | 279 | return 0; 280 | } 281 | 282 | static void test(const unsigned char *buf8, size_t len8, 283 | unsigned short *buf16, size_t len16, const struct ftab *ftab) 284 | { 285 | /* Use iconv as the reference answer */ 286 | if (strcmp(ftab->name, "iconv") == 0) 287 | return; 288 | 289 | printf("%s\n", ftab->name); 290 | 291 | /* Test file or buffer */ 292 | size_t _len16 = len16; 293 | unsigned short *_buf16 = (unsigned short *)malloc(_len16); 294 | if (utf8_to16_iconv(buf8, len8, _buf16, &_len16)) { 295 | printf("Invalid test file or buffer!\n"); 296 | exit(1); 297 | } 298 | printf("standard test: "); 299 | if (ftab->func(buf8, len8, buf16, &len16) || len16 != _len16 || \ 300 | memcmp(buf16, _buf16, len16) != 0) 301 | printf("FAIL\n"); 302 | else 303 | printf("pass\n"); 304 | free(_buf16); 305 | 306 | /* Manual cases */ 307 | unsigned short *mbuf8 = (unsigned short *)malloc(LEN16); 308 | unsigned short *mbuf16 = (unsigned short *)malloc(LEN16); 309 | printf("manual test: %s\n", 310 | test_manual(ftab, mbuf8, mbuf16) ? "FAIL" : "pass"); 311 | free(mbuf8); 312 | free(mbuf16); 313 | printf("\n"); 314 | } 315 | 316 | static void bench(const unsigned char *buf8, size_t len8, 317 | unsigned short *buf16, size_t len16, const struct ftab *ftab) 318 | { 319 | const int loops = 1024*1024*1024/len8; 320 | int ret = 0; 321 | double time, size; 322 | struct timeval tv1, tv2; 323 | 324 | fprintf(stderr, "bench %s... ", ftab->name); 325 | gettimeofday(&tv1, 0); 326 | for (int i = 0; i < loops; ++i) 327 | ret |= ftab->func(buf8, len8, buf16, &len16); 328 | gettimeofday(&tv2, 0); 329 | printf("%s\n", ret?"FAIL":"pass"); 330 | 331 | time = tv2.tv_usec - tv1.tv_usec; 332 | time = time / 1000000 + tv2.tv_sec - tv1.tv_sec; 333 | size = ((double)len8 * loops) / (1024*1024); 334 | printf("time: %.4f s\n", time); 335 | printf("data: %.0f MB\n", size); 336 | printf("BW: %.2f MB/s\n", size / time); 337 | printf("\n"); 338 | } 339 | 340 | static void usage(const char *bin) 341 | { 342 | printf("Usage:\n"); 343 | printf("%s test [alg] ==> test all or one algorithm\n", bin); 344 | printf("%s bench [alg] ==> benchmark all or one algorithm\n", bin); 345 | printf("%s bench size NUM ==> benchmark with specific buffer size\n", bin); 346 | printf("alg = "); 347 | for (int i = 0; i < sizeof(ftab)/sizeof(ftab[0]); ++i) 348 | printf("%s ", ftab[i].name); 349 | printf("\nNUM = buffer size in bytes, 1 ~ 67108864(64M)\n"); 350 | } 351 | 352 | int main(int argc, char *argv[]) 353 | { 354 | int len8 = 0, len16; 355 | unsigned char *buf8; 356 | unsigned short *buf16; 357 | const char *alg = NULL; 358 | void (*tb)(const unsigned char *buf8, size_t len8, 359 | unsigned short *buf16, size_t len16, const struct ftab *ftab); 360 | 361 | tb = NULL; 362 | if (argc >= 2) { 363 | if (strcmp(argv[1], "test") == 0) 364 | tb = test; 365 | else if (strcmp(argv[1], "bench") == 0) 366 | tb = bench; 367 | if (argc >= 3) { 368 | alg = argv[2]; 369 | if (strcmp(alg, "size") == 0) { 370 | if (argc < 4) { 371 | tb = NULL; 372 | } else { 373 | alg = NULL; 374 | len8 = atoi(argv[3]); 375 | if (len8 <= 0 || len8 > 67108864) { 376 | printf("Buffer size error!\n\n"); 377 | tb = NULL; 378 | } 379 | } 380 | } 381 | } 382 | } 383 | 384 | if (tb == NULL) { 385 | usage(argv[0]); 386 | return 1; 387 | } 388 | 389 | /* Load UTF8 test buffer */ 390 | if (len8) 391 | buf8 = load_test_buf(len8); 392 | else 393 | buf8 = load_test_file(&len8); 394 | 395 | /* Prepare UTF16 buffer large enough */ 396 | len16 = len8 * 2; 397 | buf16 = (unsigned short *)malloc(len16); 398 | 399 | if (tb == bench) 400 | printf("============== Bench UTF8 (%d bytes) ==============\n", len8); 401 | for (int i = 0; i < sizeof(ftab)/sizeof(ftab[0]); ++i) { 402 | if (alg && strcmp(alg, ftab[i].name) != 0) 403 | continue; 404 | tb((const unsigned char *)buf8, len8, buf16, len16, &ftab[i]); 405 | } 406 | 407 | #if 0 408 | if (tb == bench) { 409 | printf("==================== Bench ASCII ====================\n"); 410 | /* Change test buffer to ascii */ 411 | for (int i = 0; i < len; i++) 412 | data[i] &= 0x7F; 413 | 414 | for (int i = 0; i < sizeof(ftab)/sizeof(ftab[0]); ++i) { 415 | if (alg && strcmp(alg, ftab[i].name) != 0) 416 | continue; 417 | tb((const unsigned char *)data, len, &ftab[i]); 418 | printf("\n"); 419 | } 420 | } 421 | #endif 422 | 423 | return 0; 424 | } 425 | -------------------------------------------------------------------------------- /utf8_to_utf16/naive.c: -------------------------------------------------------------------------------- 1 | #include 2 | 3 | /* 4 | * UTF-8 to UTF-16 5 | * Table from https://woboq.com/blog/utf-8-processing-using-simd.html 6 | * 7 | * +-------------------------------------+-------------------+ 8 | * | UTF-8 | UTF-16LE (HI LO) | 9 | * +-------------------------------------+-------------------+ 10 | * | 0aaaaaaa | 00000000 0aaaaaaa | 11 | * +-------------------------------------+-------------------+ 12 | * | 110bbbbb 10aaaaaa | 00000bbb bbaaaaaa | 13 | * +-------------------------------------+-------------------+ 14 | * | 1110cccc 10bbbbbb 10aaaaaa | ccccbbbb bbaaaaaa | 15 | * +-------------------------------------+-------------------+ 16 | * | 11110ddd 10ddcccc 10bbbbbb 10aaaaaa | 110110uu uuccccbb | 17 | * + uuuu = ddddd - 1 | 110111bb bbaaaaaa | 18 | * +-------------------------------------+-------------------+ 19 | */ 20 | 21 | /* 22 | * Parameters: 23 | * - buf8, len8: input utf-8 string 24 | * - buf16: buffer to store decoded utf-16 string 25 | * - *len16: on entry - utf-16 buffer length in bytes 26 | * on exit - length in bytes of valid decoded utf-16 string 27 | * Returns: 28 | * - 0: success 29 | * - >0: error position of input utf-8 string 30 | * - -1: utf-16 buffer overflow 31 | * LE/BE depends on host 32 | */ 33 | int utf8_to16_naive(const unsigned char *buf8, size_t len8, 34 | unsigned short *buf16, size_t *len16) 35 | { 36 | int err_pos = 1; 37 | size_t len16_left = *len16; 38 | 39 | *len16 = 0; 40 | 41 | while (len8) { 42 | unsigned char b0, b1, b2, b3; 43 | unsigned int u; 44 | 45 | /* Output buffer full */ 46 | if (len16_left < 2) 47 | return -1; 48 | 49 | /* 1st byte */ 50 | b0 = buf8[0]; 51 | 52 | if ((b0 & 0x80) == 0) { 53 | /* 0aaaaaaa -> 00000000 0aaaaaaa */ 54 | *buf16++ = b0; 55 | ++buf8; 56 | --len8; 57 | ++err_pos; 58 | *len16 += 2; 59 | len16_left -= 2; 60 | continue; 61 | } 62 | 63 | /* Character length */ 64 | size_t clen = b0 & 0xF0; 65 | clen >>= 4; /* 10xx, 110x, 1110, 1111 */ 66 | clen -= 12; /* -4~-1, 0/1, 2, 3 */ 67 | clen += !clen; /* -4~-1, 1, 2, 3 */ 68 | 69 | /* String too short or invalid 1st byte (10xxxxxx) */ 70 | if (len8 <= clen) 71 | return err_pos; 72 | 73 | /* Trailing bytes must be within 0x80 ~ 0xBF */ 74 | b1 = buf8[1]; 75 | if ((signed char)b1 >= (signed char)0xC0) 76 | return err_pos; 77 | b1 &= 0x3F; 78 | 79 | ++clen; 80 | if (clen == 2) { 81 | u = b0 & 0x1F; 82 | u <<= 6; 83 | u |= b1; 84 | if (u <= 0x7F) 85 | return err_pos; 86 | *buf16++ = u; 87 | } else { 88 | b2 = buf8[2]; 89 | if ((signed char)b2 >= (signed char)0xC0) 90 | return err_pos; 91 | b2 &= 0x3F; 92 | if (clen == 3) { 93 | u = b0 & 0x0F; 94 | u <<= 6; 95 | u |= b1; 96 | u <<= 6; 97 | u |= b2; 98 | if (u <= 0x7FF || (u >= 0xD800 && u <= 0xDFFF)) 99 | return err_pos; 100 | *buf16++ = u; 101 | } else { 102 | /* clen == 4 */ 103 | if (len16_left < 4) 104 | return -1; /* Output buffer full */ 105 | b3 = buf8[3]; 106 | if ((signed char)b3 >= (signed char)0xC0) 107 | return err_pos; 108 | u = b0 & 0x07; 109 | u <<= 6; 110 | u |= b1; 111 | u <<= 6; 112 | u |= b2; 113 | u <<= 6; 114 | u |= (b3 & 0x3F); 115 | if (u <= 0xFFFF || u > 0x10FFFF) 116 | return err_pos; 117 | u -= 0x10000; 118 | *buf16++ = (((u >> 10) & 0x3FF) | 0xD800); 119 | *buf16++ = ((u & 0x3FF) | 0xDC00); 120 | *len16 += 2; 121 | len16_left -= 2; 122 | } 123 | } 124 | 125 | buf8 += clen; 126 | len8 -= clen; 127 | err_pos += clen; 128 | *len16 += 2; 129 | len16_left -= 2; 130 | } 131 | 132 | return 0; 133 | } 134 | --------------------------------------------------------------------------------