├── .gitignore
├── .travis.yml
├── LICENSE
├── Makefile
├── README.md
├── UTF-8-demo.txt
├── ascii.cpp
├── boost.cpp
├── lemire-avx2.c
├── lemire-neon.c
├── lemire-sse.c
├── lookup.c
├── main.c
├── naive.c
├── range-avx2.c
├── range-neon.c
├── range-sse.c
├── range.png
├── range2-neon.c
├── range2-sse.c
└── utf8_to_utf16
├── .gitignore
├── Makefile
├── iconv.c
├── main.c
└── naive.c
/.gitignore:
--------------------------------------------------------------------------------
1 | utf8
2 | utf8-boost
3 | ascii
4 | *.o
5 | *.swp
6 |
--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
1 | language: c
2 | sudo: false
3 | arch:
4 | - amd64
5 | - arm64
6 | os: linux
7 | dist: bionic
8 |
9 | script:
10 | - make
11 | - ./utf8 test
12 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2019 Yibo Cai
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
1 | CC = gcc
2 | CXX = g++
3 | CPPFLAGS = -g -O3 -Wall -march=native
4 | CXXFLAGS = -std=c++11
5 |
6 | OBJS = main.o naive.o lookup.o lemire-sse.o lemire-neon.o \
7 | range-sse.o range-neon.o range2-sse.o range2-neon.o \
8 | lemire-avx2.o range-avx2.o
9 |
10 | utf8: ${OBJS}
11 | gcc $^ -o $@
12 |
13 | utf8-boost: CFLAGS += -DBOOST
14 | utf8-boost: ${OBJS} boost.o
15 | g++ $^ -o $@
16 |
17 | .PHONY: clean
18 | clean:
19 | rm -f utf8 utf8-boost ascii *.o
20 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | [](https://travis-ci.com/cyb70289/utf8)
2 |
3 | # Fast UTF-8 validation with Range algorithm (NEON+SSE4+AVX2)
4 |
5 | This is a brand new algorithm to leverage SIMD for fast UTF-8 string validation. Both **NEON**(armv8a) and **SSE4** versions are implemented. **AVX2** implementation contributed by [ioioioio](https://github.com/ioioioio).
6 |
7 | Four UTF-8 validation methods are compared on both x86 and Arm platforms. Benchmark result shows range base algorithm is the best solution on Arm, and achieves same performance as [Lemire's approach](https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/) on x86.
8 |
9 | * Range based algorithm
10 | * range-neon.c: NEON version
11 | * range-sse.c: SSE4 version
12 | * range-avx2.c: AVX2 version
13 | * range2-neon.c, range2-sse.c: Process two blocks in one iteration
14 | * [Lemire's SIMD implementation](https://github.com/lemire/fastvalidate-utf-8)
15 | * lemire-sse.c: SSE4 version
16 | * lemire-avx2.c: AVX2 version
17 | * lemire-neon.c: NEON porting
18 | * naive.c: Naive UTF-8 validation byte by byte
19 | * lookup.c: [Lookup-table method](http://bjoern.hoehrmann.de/utf-8/decoder/dfa/)
20 |
21 | ## About the code
22 |
23 | * Run "make" to build. Built and tested with gcc-7.3.
24 | * Run "./utf8" to see all command line options.
25 | * Benchmark
26 | * Run "./utf8 bench" to bechmark all algorithms with [default test file](https://raw.githubusercontent.com/cyb70289/utf8/master/UTF-8-demo.txt).
27 | * Run "./utf8 bench size NUM" to benchmark specified string size.
28 | * Run "./utf8 test" to test all algorithms with positive and negative test cases.
29 | * To benchmark or test specific algorithm, run something like "./utf8 bench range".
30 |
31 | ## Benchmark result (MB/s)
32 |
33 | ### Method
34 | 1. Generate UTF-8 test buffer per [test file](https://raw.githubusercontent.com/cyb70289/utf8/master/UTF-8-demo.txt) or buffer size.
35 | 1. Call validation sub-routines in a loop until 1G bytes are checked.
36 | 1. Calculate speed(MB/s) of validating UTF-8 strings.
37 |
38 | ### NEON(armv8a)
39 | Test case | naive | lookup | lemire | range | range2
40 | :-------- | :---- | :----- | :----- | :---- | :-----
41 | [UTF-demo.txt](https://raw.githubusercontent.com/cyb70289/utf8/master/UTF-8-demo.txt) | 562.25 | 412.84 | 1198.50 | 1411.72 | **1579.85**
42 | 32 bytes | 651.55 | 441.70 | 891.38 | 1003.95 | **1043.58**
43 | 33 bytes | 660.00 | 446.78 | 588.77 | 1009.31 | **1048.12**
44 | 129 bytes | 771.89 | 402.55 | 938.07 | 1283.77 | **1401.76**
45 | 1K bytes | 811.92 | 411.58 | 1188.96 | 1398.15 | **1560.23**
46 | 8K bytes | 812.25 | 412.74 | 1198.90 | 1412.18 | **1580.65**
47 | 64K bytes | 817.35 | 412.24 | 1200.20 | 1415.11 | **1583.86**
48 | 1M bytes | 815.70 | 411.93 | 1200.93 | 1415.65 | **1585.40**
49 |
50 | ### SSE4(E5-2650)
51 | Test case | naive | lookup | lemire | range | range2
52 | :-------- | :---- | :----- | :----- | :---- | :-----
53 | [UTF-demo.txt](https://raw.githubusercontent.com/cyb70289/utf8/master/UTF-8-demo.txt) | 753.70 | 310.41 | 3954.74 | 3945.60 | **3986.13**
54 | 32 bytes | 1135.76 | 364.07 | **2890.52** | 2351.81 | 2173.02
55 | 33 bytes | 1161.85 | 376.29 | 1352.95 | **2239.55** | 2041.43
56 | 129 bytes | 1161.22 | 322.47 | 2742.49 | **3315.33** | 3249.35
57 | 1K bytes | 1310.95 | 310.72 | 3755.88 | 3781.23 | **3874.17**
58 | 8K bytes | 1348.32 | 307.93 | 3860.71 | 3922.81 | **3968.93**
59 | 64K bytes | 1301.34 | 308.39 | 3935.15 | 3973.50 | **3983.44**
60 | 1M bytes | 1279.78 | 309.06 | 3923.51 | 3953.00 | **3960.49**
61 |
62 | ## Range algorithm analysis
63 |
64 | Basic idea:
65 | * Load 16 bytes
66 | * Leverage SIMD to calculate value range for each byte efficiently
67 | * Validate 16 bytes at once
68 |
69 | ### UTF-8 coding format
70 |
71 | http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf, page 94
72 |
73 | Table 3-7. Well-Formed UTF-8 Byte Sequences
74 |
75 | Code Points | First Byte | Second Byte | Third Byte | Fourth Byte |
76 | :---------- | :--------- | :---------- | :--------- | :---------- |
77 | U+0000..U+007F | 00..7F | | | |
78 | U+0080..U+07FF | C2..DF | 80..BF | | |
79 | U+0800..U+0FFF | E0 | ***A0***..BF| 80..BF | |
80 | U+1000..U+CFFF | E1..EC | 80..BF | 80..BF | |
81 | U+D000..U+D7FF | ED | 80..***9F***| 80..BF | |
82 | U+E000..U+FFFF | EE..EF | 80..BF | 80..BF | |
83 | U+10000..U+3FFFF | F0 | ***90***..BF| 80..BF | 80..BF |
84 | U+40000..U+FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF |
85 | U+100000..U+10FFFF | F4 | 80..***8F***| 80..BF | 80..BF |
86 |
87 | To summarise UTF-8 encoding:
88 | * Depending on First Byte, one legal character can be 1, 2, 3, 4 bytes
89 | * For First Byte within C0..DF, character length = 2
90 | * For First Byte within E0..EF, character length = 3
91 | * For First Byte within F0..F4, character length = 4
92 | * C0, C1, F5..FF are not allowed
93 | * Second,Third,Fourth Bytes must lie in 80..BF.
94 | * There are four **special cases** for Second Byte, shown ***bold italic*** in above table.
95 |
96 | ### Range table
97 |
98 | Range table maps range index 0 ~ 15 to minimal and maximum values allowed. Our task is to observe input string, find the pattern and set correct range index for each byte, then validate input string.
99 |
100 | Index | Min | Max | Byte type
101 | :---- | :-- | :-- | :--------
102 | 0 | 00 | 7F | First Byte, ASCII
103 | 1,2,3 | 80 | BF | Second, Third, Fourth Bytes
104 | 4 | A0 | BF | Second Byte after E0
105 | 5 | 80 | 9F | Second Byte after ED
106 | 6 | 90 | BF | Second Byte after F0
107 | 7 | 80 | 8F | Second Byte after F4
108 | 8 | C2 | F4 | First Byte, non-ASCII
109 | 9..15(NEON) | FF | 00 | Illegal: unsigned char >= 255 && unsigned char <= 0
110 | 9..15(SSE) | 7F | 80 | Illegal: signed char >= 127 && signed char <= -128
111 |
112 | ### Calculate byte ranges (ignore special cases)
113 |
114 | Ignoring the four special cases(E0,ED,F0,F4), how should we set range index for each byte?
115 |
116 | * Set range index to 0(00..7F) for all bytes by default
117 | * Find non-ASCII First Byte (C0..FF), set their range index to 8(C2..F4)
118 | * For First Byte within C0..DF, set next byte's range index to 1(80..BF)
119 | * For First Byte within E0..EF, set next two byte's range index to 2,1(80..BF) in sequence
120 | * For First Byte within F0..FF, set next three byte's range index to 3,2,1(80..BF) in sequence
121 |
122 | To implement above operations efficiently with SIMD:
123 | * For 16 input bytes, use lookup table to map C0..DF to 1, E0..EF to 2, F0..FF to 3, others to 0. Save to first_len.
124 | * Map C0..FF to 8, we get range indices for First Byte.
125 | * Shift first_len one byte, we get range indices for Second Byte.
126 | * Saturate substract first_len by one(3->2, 2->1, 1->0, 0->0), then shift two bytes, we get range indices for Third Byte.
127 | * Saturate substract first_len by two(3->1, 2->0, 1->0, 0->0), then shift three bytes, we get range indices for Fourth Byte.
128 |
129 | Example(assume no previous data)
130 |
131 | Input | F1 | 80 | 80 | 80 | 80 | C2 | 80 | 80 | ...
132 | :---- | :- | :- | :- | :- | :- | :- | :- | :- | :--
133 | *first_len* |*3* |*0* |*0* |*0* |*0* |*1* |*0* |*0* |*...*
134 | First Byte | 8 | 0 | 0 | 0 | 0 | 8 | 0 | 0 | ...
135 | Second Byte | 0 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | ...
136 | Third Byte | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | ...
137 | Fourth Byte | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ...
138 | Range index | 8 | 3 | 2 | 1 | 0 | 8 | 1 | 0 | ...
139 |
140 | ```c
141 | Range_index = First_Byte | Second_Byte | Third_Byte | Fourth_Byte
142 | ```
143 |
144 | #### Error handling
145 |
146 | * C0,C1,F5..FF are not included in range table and will always be detected.
147 | * Illegal 80..BF will have range index 0(00..7F) and be detected.
148 | * Based on First Byte, according Second, Third and Fourth Bytes will have range index 1/2/3, to make sure they must lie in 80..BF.
149 | * If non-ASCII First Byte overlaps, above algorithm will set range index of the latter First Byte to 9,10,11, which are illegal ranges. E.g, Input = F1 80 C2 90 --> Range index = 8 3 10 1, where 10 indicates error. See table below.
150 |
151 | Overlapped non-ASCII First Byte
152 |
153 | Input | F1 | 80 | C2 | 90
154 | :---- | :- | :- | :- | :-
155 | *first_len* |*3* |*0* |*1* |*0*
156 | First Byte | 8 | 0 | 8 | 0
157 | Second Byte | 0 | 3 | 0 | 1
158 | Third Byte | 0 | 0 | 2 | 0
159 | Fourth Byte | 0 | 0 | 0 | 1
160 | Range index | 8 | 3 |***10***| 1
161 |
162 | ### Adjust Second Byte range for special cases
163 |
164 | Range index adjustment for four special cases
165 |
166 | First Byte | Second Byte | Before adjustment | Correct index | Adjustment |
167 | :--------- | :---------- | :---------------- | :------------ | :---------
168 | E0 | A0..BF | 2 | 4 | **2**
169 | ED | 80..9F | 2 | 5 | **3**
170 | F0 | 90..BF | 3 | 6 | **3**
171 | F4 | 80..8F | 3 | 7 | **4**
172 |
173 | Range index adjustment can be reduced to below problem:
174 |
175 | ***Given 16 bytes, replace E0 with 2, ED with 3, F0 with 3, F4 with 4, others with 0.***
176 |
177 | A naive SIMD approach:
178 | 1. Compare 16 bytes with E0, get the mask for eacy byte (FF if equal, 00 otherwise)
179 | 1. And the mask with 2 to get adjustment for E0
180 | 1. Repeat step 1,2 for ED,F0,F4
181 |
182 | At least **eight** operations are required for naive approach.
183 |
184 | Observing special bytes(E0,ED,F0,F4) are close to each other, we can do much better using lookup table.
185 |
186 | #### NEON
187 |
188 | NEON ```tbl``` instruction is very convenient for table lookup:
189 | * Table can be up to 16x4 bytes in size
190 | * Return zero if index is out of range
191 |
192 | Leverage these features, we can solve the problem with as few as **two** operations:
193 | * Precreate a 16x2 lookup table, where table[0]=2, table[13]=3, table[16]=3, table[20]=4, table[others]=0.
194 | * Substract input bytes with E0 (E0 -> 0, ED -> 13, F0 -> 16, F4 -> 20).
195 | * Use the substracted byte as index of lookup table and get range adjustment directly.
196 | * For indices less than 32, we get zero or required adjustment value per input byte
197 | * For out of bound indices, we get zero per ```tbl``` behaviour
198 |
199 | #### SSE
200 |
201 | SSE ```pshufb``` instruction is not as friendly as NEON ```tbl``` in this case:
202 | * Table can only be 16 bytes in size
203 | * Out of bound indices are handled this way:
204 | * If 7-th bit of index is 0, least four bits are used as index (E.g, index 0x73 returns 3rd element)
205 | * If 7-th bit of index is 1, return 0 (E.g, index 0x83 returns 0)
206 |
207 | We can still leverage these features to solve the problem in **five** operations:
208 | * Precreate two tables:
209 | * table_df[1] = 2, table_df[14] = 3, table_df[others] = 0
210 | * table_ef[1] = 3, table_ef[5] = 4, table_ef[others] = 0
211 | * Substract input bytes with EF (E0 -> 241, ED -> 254, F0 -> 1, F4 -> 5) to get the temporary indices
212 | * Get range index for E0,ED
213 | * Saturate substract temporary indices with 240 (E0 -> 1, ED -> 14, all values below 240 becomes 0)
214 | * Use substracted indices to look up table_df, get the correct adjustment
215 | * Get range index for F0,F4
216 | * Saturate add temporary indices with 112(0x70) (F0 -> 0x71, F4 -> 0x75, all values above 16 will be larger than 128(7-th bit set))
217 | * Use added indices to look up table_ef, get the correct adjustment (index 0x71,0x75 returns 1st,5th elements, per ```pshufb``` behaviour)
218 |
219 | #### Error handling
220 |
221 | * For overlapped non-ASCII First Byte, range index before adjustment is 9,10,11. After adjustment (adds 2,3,4 or 0), the range index will be 9 to 15, which is still illegal in range table. So the error will be detected.
222 |
223 | ### Handling remaining bytes
224 |
225 | For remaining input less than 16 bytes, we will fallback to naive byte by byte approach to validate them, which is actually faster than SIMD processing.
226 | * Look back last 16 bytes buffer to find First Byte. At most three bytes need to look back. Otherwise we either happen to be at character boundray, or there are some errors we already detected.
227 | * Validate string byte by byte starting from the First Byte.
228 |
229 | ## Tests
230 |
231 | It's necessary to design test cases to cover corner cases as more as possible.
232 |
233 | ### Positive cases
234 |
235 | 1. Prepare correct characters
236 | 2. Validate correct characters
237 | 3. Validate long strings
238 | * Round concatenate characters starting from first character to 1024 bytes
239 | * Validate 1024 bytes string
240 | * Shift 1 byte, validate 1025 bytes string
241 | * Shift 2 bytes, Validate 1026 bytes string
242 | * ...
243 | * Shift 16 bytes, validate 1040 bytes string
244 | 4. Repeat step3, test buffer starting from second character
245 | 5. Repeat step3, test buffer starting from third character
246 | 6. ...
247 |
248 | ### Negative cases
249 |
250 | 1. Prepare bad characters and bad strings
251 | * Bad character
252 | * Bad character cross 16 bytes boundary
253 | * Bad character cross last 16 bytes and remaining bytes boundary
254 | 2. Test long strings
255 | * Prepare correct long strings same as positive cases
256 | * Append bad characters
257 | * Shift one byte for each iteration
258 | * Validate each shift
259 |
260 | ## Code breakdown
261 |
262 | Below table shows how 16 bytes input are processed step by step. See [range-neon.c](range-neon.c) for according code.
263 |
264 | 
265 |
--------------------------------------------------------------------------------
/UTF-8-demo.txt:
--------------------------------------------------------------------------------
1 |
2 | UTF-8 encoded sample plain-text file
3 | ‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
4 |
5 | Markus Kuhn [ˈmaʳkʊs kuːn] — 2002-07-25 CC BY
6 |
7 |
8 | The ASCII compatible UTF-8 encoding used in this plain-text file
9 | is defined in Unicode, ISO 10646-1, and RFC 2279.
10 |
11 |
12 | Using Unicode/UTF-8, you can write in emails and source code things such as
13 |
14 | Mathematics and sciences:
15 |
16 | ∮ E⋅da = Q, n → ∞, ∑ f(i) = ∏ g(i), ⎧⎡⎛┌─────┐⎞⎤⎫
17 | ⎪⎢⎜│a²+b³ ⎟⎥⎪
18 | ∀x∈ℝ: ⌈x⌉ = −⌊−x⌋, α ∧ ¬β = ¬(¬α ∨ β), ⎪⎢⎜│───── ⎟⎥⎪
19 | ⎪⎢⎜⎷ c₈ ⎟⎥⎪
20 | ℕ ⊆ ℕ₀ ⊂ ℤ ⊂ ℚ ⊂ ℝ ⊂ ℂ, ⎨⎢⎜ ⎟⎥⎬
21 | ⎪⎢⎜ ∞ ⎟⎥⎪
22 | ⊥ < a ≠ b ≡ c ≤ d ≪ ⊤ ⇒ (⟦A⟧ ⇔ ⟪B⟫), ⎪⎢⎜ ⎲ ⎟⎥⎪
23 | ⎪⎢⎜ ⎳aⁱ-bⁱ⎟⎥⎪
24 | 2H₂ + O₂ ⇌ 2H₂O, R = 4.7 kΩ, ⌀ 200 mm ⎩⎣⎝i=1 ⎠⎦⎭
25 |
26 | Linguistics and dictionaries:
27 |
28 | ði ıntəˈnæʃənəl fəˈnɛtık əsoʊsiˈeıʃn
29 | Y [ˈʏpsilɔn], Yen [jɛn], Yoga [ˈjoːgɑ]
30 |
31 | APL:
32 |
33 | ((V⍳V)=⍳⍴V)/V←,V ⌷←⍳→⍴∆∇⊃‾⍎⍕⌈
34 |
35 | Nicer typography in plain text files:
36 |
37 | ╔══════════════════════════════════════════╗
38 | ║ ║
39 | ║ • ‘single’ and “double” quotes ║
40 | ║ ║
41 | ║ • Curly apostrophes: “We’ve been here” ║
42 | ║ ║
43 | ║ • Latin-1 apostrophe and accents: '´` ║
44 | ║ ║
45 | ║ • ‚deutsche‘ „Anführungszeichen“ ║
46 | ║ ║
47 | ║ • †, ‡, ‰, •, 3–4, —, −5/+5, ™, … ║
48 | ║ ║
49 | ║ • ASCII safety test: 1lI|, 0OD, 8B ║
50 | ║ ╭─────────╮ ║
51 | ║ • the euro symbol: │ 14.95 € │ ║
52 | ║ ╰─────────╯ ║
53 | ╚══════════════════════════════════════════╝
54 |
55 | Combining characters:
56 |
57 | STARGΛ̊TE SG-1, a = v̇ = r̈, a⃑ ⊥ b⃑
58 |
59 | Greek (in Polytonic):
60 |
61 | The Greek anthem:
62 |
63 | Σὲ γνωρίζω ἀπὸ τὴν κόψη
64 | τοῦ σπαθιοῦ τὴν τρομερή,
65 | σὲ γνωρίζω ἀπὸ τὴν ὄψη
66 | ποὺ μὲ βία μετράει τὴ γῆ.
67 |
68 | ᾿Απ᾿ τὰ κόκκαλα βγαλμένη
69 | τῶν ῾Ελλήνων τὰ ἱερά
70 | καὶ σὰν πρῶτα ἀνδρειωμένη
71 | χαῖρε, ὦ χαῖρε, ᾿Ελευθεριά!
72 |
73 | From a speech of Demosthenes in the 4th century BC:
74 |
75 | Οὐχὶ ταὐτὰ παρίσταταί μοι γιγνώσκειν, ὦ ἄνδρες ᾿Αθηναῖοι,
76 | ὅταν τ᾿ εἰς τὰ πράγματα ἀποβλέψω καὶ ὅταν πρὸς τοὺς
77 | λόγους οὓς ἀκούω· τοὺς μὲν γὰρ λόγους περὶ τοῦ
78 | τιμωρήσασθαι Φίλιππον ὁρῶ γιγνομένους, τὰ δὲ πράγματ᾿
79 | εἰς τοῦτο προήκοντα, ὥσθ᾿ ὅπως μὴ πεισόμεθ᾿ αὐτοὶ
80 | πρότερον κακῶς σκέψασθαι δέον. οὐδέν οὖν ἄλλο μοι δοκοῦσιν
81 | οἱ τὰ τοιαῦτα λέγοντες ἢ τὴν ὑπόθεσιν, περὶ ἧς βουλεύεσθαι,
82 | οὐχὶ τὴν οὖσαν παριστάντες ὑμῖν ἁμαρτάνειν. ἐγὼ δέ, ὅτι μέν
83 | ποτ᾿ ἐξῆν τῇ πόλει καὶ τὰ αὑτῆς ἔχειν ἀσφαλῶς καὶ Φίλιππον
84 | τιμωρήσασθαι, καὶ μάλ᾿ ἀκριβῶς οἶδα· ἐπ᾿ ἐμοῦ γάρ, οὐ πάλαι
85 | γέγονεν ταῦτ᾿ ἀμφότερα· νῦν μέντοι πέπεισμαι τοῦθ᾿ ἱκανὸν
86 | προλαβεῖν ἡμῖν εἶναι τὴν πρώτην, ὅπως τοὺς συμμάχους
87 | σώσομεν. ἐὰν γὰρ τοῦτο βεβαίως ὑπάρξῃ, τότε καὶ περὶ τοῦ
88 | τίνα τιμωρήσεταί τις καὶ ὃν τρόπον ἐξέσται σκοπεῖν· πρὶν δὲ
89 | τὴν ἀρχὴν ὀρθῶς ὑποθέσθαι, μάταιον ἡγοῦμαι περὶ τῆς
90 | τελευτῆς ὁντινοῦν ποιεῖσθαι λόγον.
91 |
92 | Δημοσθένους, Γ´ ᾿Ολυνθιακὸς
93 |
94 | Georgian:
95 |
96 | From a Unicode conference invitation:
97 |
98 | გთხოვთ ახლავე გაიაროთ რეგისტრაცია Unicode-ის მეათე საერთაშორისო
99 | კონფერენციაზე დასასწრებად, რომელიც გაიმართება 10-12 მარტს,
100 | ქ. მაინცში, გერმანიაში. კონფერენცია შეჰკრებს ერთად მსოფლიოს
101 | ექსპერტებს ისეთ დარგებში როგორიცაა ინტერნეტი და Unicode-ი,
102 | ინტერნაციონალიზაცია და ლოკალიზაცია, Unicode-ის გამოყენება
103 | ოპერაციულ სისტემებსა, და გამოყენებით პროგრამებში, შრიფტებში,
104 | ტექსტების დამუშავებასა და მრავალენოვან კომპიუტერულ სისტემებში.
105 |
106 | Russian:
107 |
108 | From a Unicode conference invitation:
109 |
110 | Зарегистрируйтесь сейчас на Десятую Международную Конференцию по
111 | Unicode, которая состоится 10-12 марта 1997 года в Майнце в Германии.
112 | Конференция соберет широкий круг экспертов по вопросам глобального
113 | Интернета и Unicode, локализации и интернационализации, воплощению и
114 | применению Unicode в различных операционных системах и программных
115 | приложениях, шрифтах, верстке и многоязычных компьютерных системах.
116 |
117 | Thai (UCS Level 2):
118 |
119 | Excerpt from a poetry on The Romance of The Three Kingdoms (a Chinese
120 | classic 'San Gua'):
121 |
122 | [----------------------------|------------------------]
123 | ๏ แผ่นดินฮั่นเสื่อมโทรมแสนสังเวช พระปกเกศกองบู๊กู้ขึ้นใหม่
124 | สิบสองกษัตริย์ก่อนหน้าแลถัดไป สององค์ไซร้โง่เขลาเบาปัญญา
125 | ทรงนับถือขันทีเป็นที่พึ่ง บ้านเมืองจึงวิปริตเป็นนักหนา
126 | โฮจิ๋นเรียกทัพทั่วหัวเมืองมา หมายจะฆ่ามดชั่วตัวสำคัญ
127 | เหมือนขับไสไล่เสือจากเคหา รับหมาป่าเข้ามาเลยอาสัญ
128 | ฝ่ายอ้องอุ้นยุแยกให้แตกกัน ใช้สาวนั้นเป็นชนวนชื่นชวนใจ
129 | พลันลิฉุยกุยกีกลับก่อเหตุ ช่างอาเพศจริงหนาฟ้าร้องไห้
130 | ต้องรบราฆ่าฟันจนบรรลัย ฤๅหาใครค้ำชูกู้บรรลังก์ ฯ
131 |
132 | (The above is a two-column text. If combining characters are handled
133 | correctly, the lines of the second column should be aligned with the
134 | | character above.)
135 |
136 | Ethiopian:
137 |
138 | Proverbs in the Amharic language:
139 |
140 | ሰማይ አይታረስ ንጉሥ አይከሰስ።
141 | ብላ ካለኝ እንደአባቴ በቆመጠኝ።
142 | ጌጥ ያለቤቱ ቁምጥና ነው።
143 | ደሀ በሕልሙ ቅቤ ባይጠጣ ንጣት በገደለው።
144 | የአፍ ወለምታ በቅቤ አይታሽም።
145 | አይጥ በበላ ዳዋ ተመታ።
146 | ሲተረጉሙ ይደረግሙ።
147 | ቀስ በቀስ፥ ዕንቁላል በእግሩ ይሄዳል።
148 | ድር ቢያብር አንበሳ ያስር።
149 | ሰው እንደቤቱ እንጅ እንደ ጉረቤቱ አይተዳደርም።
150 | እግዜር የከፈተውን ጉሮሮ ሳይዘጋው አይድርም።
151 | የጎረቤት ሌባ፥ ቢያዩት ይስቅ ባያዩት ያጠልቅ።
152 | ሥራ ከመፍታት ልጄን ላፋታት።
153 | ዓባይ ማደሪያ የለው፥ ግንድ ይዞ ይዞራል።
154 | የእስላም አገሩ መካ የአሞራ አገሩ ዋርካ።
155 | ተንጋሎ ቢተፉ ተመልሶ ባፉ።
156 | ወዳጅህ ማር ቢሆን ጨርስህ አትላሰው።
157 | እግርህን በፍራሽህ ልክ ዘርጋ።
158 |
159 | Runes:
160 |
161 | ᚻᛖ ᚳᚹᚫᚦ ᚦᚫᛏ ᚻᛖ ᛒᚢᛞᛖ ᚩᚾ ᚦᚫᛗ ᛚᚪᚾᛞᛖ ᚾᚩᚱᚦᚹᛖᚪᚱᛞᚢᛗ ᚹᛁᚦ ᚦᚪ ᚹᛖᛥᚫ
162 |
163 | (Old English, which transcribed into Latin reads 'He cwaeth that he
164 | bude thaem lande northweardum with tha Westsae.' and means 'He said
165 | that he lived in the northern land near the Western Sea.')
166 |
167 | Braille:
168 |
169 | ⡌⠁⠧⠑ ⠼⠁⠒ ⡍⠜⠇⠑⠹⠰⠎ ⡣⠕⠌
170 |
171 | ⡍⠜⠇⠑⠹ ⠺⠁⠎ ⠙⠑⠁⠙⠒ ⠞⠕ ⠃⠑⠛⠔ ⠺⠊⠹⠲ ⡹⠻⠑ ⠊⠎ ⠝⠕ ⠙⠳⠃⠞
172 | ⠱⠁⠞⠑⠧⠻ ⠁⠃⠳⠞ ⠹⠁⠞⠲ ⡹⠑ ⠗⠑⠛⠊⠌⠻ ⠕⠋ ⠙⠊⠎ ⠃⠥⠗⠊⠁⠇ ⠺⠁⠎
173 | ⠎⠊⠛⠝⠫ ⠃⠹ ⠹⠑ ⠊⠇⠻⠛⠹⠍⠁⠝⠂ ⠹⠑ ⠊⠇⠻⠅⠂ ⠹⠑ ⠥⠝⠙⠻⠞⠁⠅⠻⠂
174 | ⠁⠝⠙ ⠹⠑ ⠡⠊⠑⠋ ⠍⠳⠗⠝⠻⠲ ⡎⠊⠗⠕⠕⠛⠑ ⠎⠊⠛⠝⠫ ⠊⠞⠲ ⡁⠝⠙
175 | ⡎⠊⠗⠕⠕⠛⠑⠰⠎ ⠝⠁⠍⠑ ⠺⠁⠎ ⠛⠕⠕⠙ ⠥⠏⠕⠝ ⠰⡡⠁⠝⠛⠑⠂ ⠋⠕⠗ ⠁⠝⠹⠹⠔⠛ ⠙⠑
176 | ⠡⠕⠎⠑ ⠞⠕ ⠏⠥⠞ ⠙⠊⠎ ⠙⠁⠝⠙ ⠞⠕⠲
177 |
178 | ⡕⠇⠙ ⡍⠜⠇⠑⠹ ⠺⠁⠎ ⠁⠎ ⠙⠑⠁⠙ ⠁⠎ ⠁ ⠙⠕⠕⠗⠤⠝⠁⠊⠇⠲
179 |
180 | ⡍⠔⠙⠖ ⡊ ⠙⠕⠝⠰⠞ ⠍⠑⠁⠝ ⠞⠕ ⠎⠁⠹ ⠹⠁⠞ ⡊ ⠅⠝⠪⠂ ⠕⠋ ⠍⠹
181 | ⠪⠝ ⠅⠝⠪⠇⠫⠛⠑⠂ ⠱⠁⠞ ⠹⠻⠑ ⠊⠎ ⠏⠜⠞⠊⠊⠥⠇⠜⠇⠹ ⠙⠑⠁⠙ ⠁⠃⠳⠞
182 | ⠁ ⠙⠕⠕⠗⠤⠝⠁⠊⠇⠲ ⡊ ⠍⠊⠣⠞ ⠙⠁⠧⠑ ⠃⠑⠲ ⠔⠊⠇⠔⠫⠂ ⠍⠹⠎⠑⠇⠋⠂ ⠞⠕
183 | ⠗⠑⠛⠜⠙ ⠁ ⠊⠕⠋⠋⠔⠤⠝⠁⠊⠇ ⠁⠎ ⠹⠑ ⠙⠑⠁⠙⠑⠌ ⠏⠊⠑⠊⠑ ⠕⠋ ⠊⠗⠕⠝⠍⠕⠝⠛⠻⠹
184 | ⠔ ⠹⠑ ⠞⠗⠁⠙⠑⠲ ⡃⠥⠞ ⠹⠑ ⠺⠊⠎⠙⠕⠍ ⠕⠋ ⠳⠗ ⠁⠝⠊⠑⠌⠕⠗⠎
185 | ⠊⠎ ⠔ ⠹⠑ ⠎⠊⠍⠊⠇⠑⠆ ⠁⠝⠙ ⠍⠹ ⠥⠝⠙⠁⠇⠇⠪⠫ ⠙⠁⠝⠙⠎
186 | ⠩⠁⠇⠇ ⠝⠕⠞ ⠙⠊⠌⠥⠗⠃ ⠊⠞⠂ ⠕⠗ ⠹⠑ ⡊⠳⠝⠞⠗⠹⠰⠎ ⠙⠕⠝⠑ ⠋⠕⠗⠲ ⡹⠳
187 | ⠺⠊⠇⠇ ⠹⠻⠑⠋⠕⠗⠑ ⠏⠻⠍⠊⠞ ⠍⠑ ⠞⠕ ⠗⠑⠏⠑⠁⠞⠂ ⠑⠍⠏⠙⠁⠞⠊⠊⠁⠇⠇⠹⠂ ⠹⠁⠞
188 | ⡍⠜⠇⠑⠹ ⠺⠁⠎ ⠁⠎ ⠙⠑⠁⠙ ⠁⠎ ⠁ ⠙⠕⠕⠗⠤⠝⠁⠊⠇⠲
189 |
190 | (The first couple of paragraphs of "A Christmas Carol" by Dickens)
191 |
192 | Compact font selection example text:
193 |
194 | ABCDEFGHIJKLMNOPQRSTUVWXYZ /0123456789
195 | abcdefghijklmnopqrstuvwxyz £©µÀÆÖÞßéöÿ
196 | –—‘“”„†•…‰™œŠŸž€ ΑΒΓΔΩαβγδω АБВГДабвгд
197 | ∀∂∈ℝ∧∪≡∞ ↑↗↨↻⇣ ┐┼╔╘░►☺♀ fi�⑀₂ἠḂӥẄɐː⍎אԱა
198 |
199 | Greetings in various languages:
200 |
201 | Hello world, Καλημέρα κόσμε, コンニチハ
202 |
203 | Box drawing alignment tests: █
204 | ▉
205 | ╔══╦══╗ ┌──┬──┐ ╭──┬──╮ ╭──┬──╮ ┏━━┳━━┓ ┎┒┏┑ ╷ ╻ ┏┯┓ ┌┰┐ ▊ ╱╲╱╲╳╳╳
206 | ║┌─╨─┐║ │╔═╧═╗│ │╒═╪═╕│ │╓─╁─╖│ ┃┌─╂─┐┃ ┗╃╄┙ ╶┼╴╺╋╸┠┼┨ ┝╋┥ ▋ ╲╱╲╱╳╳╳
207 | ║│╲ ╱│║ │║ ║│ ││ │ ││ │║ ┃ ║│ ┃│ ╿ │┃ ┍╅╆┓ ╵ ╹ ┗┷┛ └┸┘ ▌ ╱╲╱╲╳╳╳
208 | ╠╡ ╳ ╞╣ ├╢ ╟┤ ├┼─┼─┼┤ ├╫─╂─╫┤ ┣┿╾┼╼┿┫ ┕┛┖┚ ┌┄┄┐ ╎ ┏┅┅┓ ┋ ▍ ╲╱╲╱╳╳╳
209 | ║│╱ ╲│║ │║ ║│ ││ │ ││ │║ ┃ ║│ ┃│ ╽ │┃ ░░▒▒▓▓██ ┊ ┆ ╎ ╏ ┇ ┋ ▎
210 | ║└─╥─┘║ │╚═╤═╝│ │╘═╪═╛│ │╙─╀─╜│ ┃└─╂─┘┃ ░░▒▒▓▓██ ┊ ┆ ╎ ╏ ┇ ┋ ▏
211 | ╚══╩══╝ └──┴──┘ ╰──┴──╯ ╰──┴──╯ ┗━━┻━━┛ ▗▄▖▛▀▜ └╌╌┘ ╎ ┗╍╍┛ ┋ ▁▂▃▄▅▆▇█
212 | ▝▀▘▙▄▟
213 |
--------------------------------------------------------------------------------
/ascii.cpp:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 | #include
4 | #include
5 | #include
6 | #include
7 |
8 | #include
9 |
10 | static inline int ascii_std(const uint8_t *data, int len)
11 | {
12 | return !std::any_of(data, data+len, [] (int8_t b) { return b < 0; });
13 | }
14 |
15 | static inline int ascii_u64(const uint8_t *data, int len)
16 | {
17 | uint8_t orall = 0;
18 |
19 | if (len >= 16) {
20 |
21 | uint64_t or1 = 0, or2 = 0;
22 | const uint8_t *data2 = data+8;
23 |
24 | do {
25 | or1 |= *(const uint64_t *)data;
26 | or2 |= *(const uint64_t *)data2;
27 | data += 16;
28 | data2 += 16;
29 | len -= 16;
30 | } while (len >= 16);
31 |
32 | /*
33 | * Idea from Benny Halevy
34 | * - 7-th bit set ==> orall = !(non-zero) - 1 = 0 - 1 = 0xFF
35 | * - 7-th bit clear ==> orall = !0 - 1 = 1 - 1 = 0x00
36 | */
37 | orall = !((or1 | or2) & 0x8080808080808080ULL) - 1;
38 | }
39 |
40 | while (len--)
41 | orall |= *data++;
42 |
43 | return orall < 0x80;
44 | }
45 |
46 | #if defined(__x86_64__)
47 | #include
48 |
49 | static inline int ascii_simd(const uint8_t *data, int len)
50 | {
51 | if (len >= 32) {
52 | const uint8_t *data2 = data+16;
53 |
54 | __m128i or1 = _mm_set1_epi8(0), or2 = or1;
55 |
56 | while (len >= 32) {
57 | __m128i input1 = _mm_loadu_si128((const __m128i *)data);
58 | __m128i input2 = _mm_loadu_si128((const __m128i *)data2);
59 |
60 | or1 = _mm_or_si128(or1, input1);
61 | or2 = _mm_or_si128(or2, input2);
62 |
63 | data += 32;
64 | data2 += 32;
65 | len -= 32;
66 | }
67 |
68 | or1 = _mm_or_si128(or1, or2);
69 | if (_mm_movemask_epi8(_mm_cmplt_epi8(or1, _mm_set1_epi8(0))))
70 | return 0;
71 | }
72 |
73 | return ascii_u64(data, len);
74 | }
75 |
76 | #elif defined(__aarch64__)
77 | #include
78 |
79 | static inline int ascii_simd(const uint8_t *data, int len)
80 | {
81 | if (len >= 32) {
82 | const uint8_t *data2 = data+16;
83 |
84 | uint8x16_t or1 = vdupq_n_u8(0), or2 = or1;
85 |
86 | while (len >= 32) {
87 | const uint8x16_t input1 = vld1q_u8(data);
88 | const uint8x16_t input2 = vld1q_u8(data2);
89 |
90 | or1 = vorrq_u8(or1, input1);
91 | or2 = vorrq_u8(or2, input2);
92 |
93 | data += 32;
94 | data2 += 32;
95 | len -= 32;
96 | }
97 |
98 | or1 = vorrq_u8(or1, or2);
99 | if (vmaxvq_u8(or1) >= 0x80)
100 | return 0;
101 | }
102 |
103 | return ascii_u64(data, len);
104 | }
105 |
106 | #endif
107 |
108 | struct ftab {
109 | const char *name;
110 | int (*func)(const uint8_t *data, int len);
111 | };
112 |
113 | static const std::vector _f = {
114 | {
115 | .name = "std",
116 | .func = ascii_std,
117 | }, {
118 | .name = "u64",
119 | .func = ascii_u64,
120 | }, {
121 | .name = "simd",
122 | .func = ascii_simd,
123 | },
124 | };
125 |
126 | static void load_test_buf(uint8_t *data, int len)
127 | {
128 | uint8_t v = 0;
129 |
130 | for (int i = 0; i < len; ++i) {
131 | data[i] = v++;
132 | v &= 0x7F;
133 | }
134 | }
135 |
136 | static void bench(const struct ftab &f, const uint8_t *data, int len)
137 | {
138 | const int loops = 1024*1024*1024/len;
139 | int ret = 1;
140 | double time_aligned, time_unaligned, size;
141 | struct timeval tv1, tv2;
142 |
143 | fprintf(stderr, "bench %s (%d bytes)... ", f.name, len);
144 |
145 | /* aligned */
146 | gettimeofday(&tv1, 0);
147 | for (int i = 0; i < loops; ++i)
148 | ret &= f.func(data, len);
149 | gettimeofday(&tv2, 0);
150 | time_aligned = tv2.tv_usec - tv1.tv_usec;
151 | time_aligned = time_aligned / 1000000 + tv2.tv_sec - tv1.tv_sec;
152 |
153 | /* unaligned */
154 | gettimeofday(&tv1, 0);
155 | for (int i = 0; i < loops; ++i)
156 | ret &= f.func(data+1, len);
157 | gettimeofday(&tv2, 0);
158 | time_unaligned = tv2.tv_usec - tv1.tv_usec;
159 | time_unaligned = time_unaligned / 1000000 + tv2.tv_sec - tv1.tv_sec;
160 |
161 | printf("%s ", ret?"pass":"FAIL");
162 |
163 | size = ((double)len * loops) / (1024*1024);
164 | printf("%.0f/%.0f MB/s\n", size / time_aligned, size / time_unaligned);
165 | }
166 |
167 | static void test(const struct ftab &f, uint8_t *data, int len)
168 | {
169 | int error = 0;
170 |
171 | fprintf(stderr, "test %s (%d bytes)... ", f.name, len);
172 |
173 | /* positive */
174 | error |= !f.func(data, len);
175 |
176 | /* negative */
177 | if (len < 100*1024) {
178 | for (int i = 0; i < len; ++i) {
179 | data[i] += 0x80;
180 | error |= f.func(data, len);
181 | data[i] -= 0x80;
182 | }
183 | }
184 |
185 | printf("%s\n", error ? "FAIL" : "pass");
186 | }
187 |
188 | /* ./ascii [test|bench] [alg] */
189 | int main(int argc, const char *argv[])
190 | {
191 | int do_test = 1, do_bench = 1;
192 | const char *alg = NULL;
193 |
194 | if (argc > 1) {
195 | do_bench &= !!strcmp(argv[1], "test");
196 | do_test &= !!strcmp(argv[1], "bench");
197 | }
198 |
199 | if (do_bench && argc > 2)
200 | alg = argv[2];
201 |
202 | const std::vector size = { 9, 16+1, 32-1, 128+1, 1024+15,
203 | 16*1024+1, 64*1024+15, 1024*1024 };
204 |
205 | int max_size = *std::max_element(size.begin(), size.end());
206 | uint8_t *_data = new uint8_t[max_size+1];
207 | assert(((uintptr_t)_data & 7) == 0);
208 | uint8_t *data = _data+1; /* Unalign buffer address */
209 |
210 | _data[0] = 0;
211 | load_test_buf(data, max_size);
212 |
213 | if (do_test) {
214 | printf("==================== Test ====================\n");
215 | for (int sz : size) {
216 | for (auto &f : _f) {
217 | test(f, data, sz);
218 | }
219 | }
220 | }
221 |
222 | if (do_bench) {
223 | printf("==================== Bench ====================\n");
224 | for (int sz : size) {
225 | for (auto &f : _f) {
226 | if (!alg || strcmp(alg, f.name) == 0)
227 | bench(f, _data, sz);
228 | }
229 | printf("-----------------------------------------------\n");
230 | }
231 | }
232 |
233 | delete _data;
234 | return 0;
235 | }
236 |
--------------------------------------------------------------------------------
/boost.cpp:
--------------------------------------------------------------------------------
1 | #include
2 |
3 | using namespace std;
4 |
5 | /* Return 0 on sucess, -1 on error */
6 | extern "C" int utf8_boost(const unsigned char *data, int len)
7 | {
8 | try {
9 | boost::locale::conv::utf_to_utf(data, data+len,
10 | boost::locale::conv::stop);
11 | } catch (const boost::locale::conv::conversion_error& ex) {
12 | return -1;
13 | }
14 |
15 | return 0;
16 | }
17 |
--------------------------------------------------------------------------------
/lemire-avx2.c:
--------------------------------------------------------------------------------
1 | // Adapted from https://github.com/lemire/fastvalidate-utf-8
2 |
3 | #ifdef __AVX2__
4 |
5 | #include
6 | #include
7 | #include
8 | #include
9 | #include
10 |
11 | /*
12 | * legal utf-8 byte sequence
13 | * http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf - page 94
14 | *
15 | * Code Points 1st 2s 3s 4s
16 | * U+0000..U+007F 00..7F
17 | * U+0080..U+07FF C2..DF 80..BF
18 | * U+0800..U+0FFF E0 A0..BF 80..BF
19 | * U+1000..U+CFFF E1..EC 80..BF 80..BF
20 | * U+D000..U+D7FF ED 80..9F 80..BF
21 | * U+E000..U+FFFF EE..EF 80..BF 80..BF
22 | * U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
23 | * U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
24 | * U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
25 | *
26 | */
27 |
28 | #if 0
29 | static void print256(const char *s, const __m256i v256)
30 | {
31 | const unsigned char *v8 = (const unsigned char *)&v256;
32 | if (s)
33 | printf("%s:\t", s);
34 | for (int i = 0; i < 32; i++)
35 | printf("%02x ", v8[i]);
36 | printf("\n");
37 | }
38 | #endif
39 |
40 | static inline __m256i push_last_byte_of_a_to_b(__m256i a, __m256i b) {
41 | return _mm256_alignr_epi8(b, _mm256_permute2x128_si256(a, b, 0x21), 15);
42 | }
43 |
44 | static inline __m256i push_last_2bytes_of_a_to_b(__m256i a, __m256i b) {
45 | return _mm256_alignr_epi8(b, _mm256_permute2x128_si256(a, b, 0x21), 14);
46 | }
47 |
48 | // all byte values must be no larger than 0xF4
49 | static inline void avxcheckSmallerThan0xF4(__m256i current_bytes,
50 | __m256i *has_error) {
51 | // unsigned, saturates to 0 below max
52 | *has_error = _mm256_or_si256(
53 | *has_error, _mm256_subs_epu8(current_bytes, _mm256_set1_epi8(0xF4)));
54 | }
55 |
56 | static inline __m256i avxcontinuationLengths(__m256i high_nibbles) {
57 | return _mm256_shuffle_epi8(
58 | _mm256_setr_epi8(1, 1, 1, 1, 1, 1, 1, 1, // 0xxx (ASCII)
59 | 0, 0, 0, 0, // 10xx (continuation)
60 | 2, 2, // 110x
61 | 3, // 1110
62 | 4, // 1111, next should be 0 (not checked here)
63 | 1, 1, 1, 1, 1, 1, 1, 1, // 0xxx (ASCII)
64 | 0, 0, 0, 0, // 10xx (continuation)
65 | 2, 2, // 110x
66 | 3, // 1110
67 | 4 // 1111, next should be 0 (not checked here)
68 | ),
69 | high_nibbles);
70 | }
71 |
72 | static inline __m256i avxcarryContinuations(__m256i initial_lengths,
73 | __m256i previous_carries) {
74 |
75 | __m256i right1 = _mm256_subs_epu8(
76 | push_last_byte_of_a_to_b(previous_carries, initial_lengths),
77 | _mm256_set1_epi8(1));
78 | __m256i sum = _mm256_add_epi8(initial_lengths, right1);
79 |
80 | __m256i right2 = _mm256_subs_epu8(
81 | push_last_2bytes_of_a_to_b(previous_carries, sum), _mm256_set1_epi8(2));
82 | return _mm256_add_epi8(sum, right2);
83 | }
84 |
85 | static inline void avxcheckContinuations(__m256i initial_lengths,
86 | __m256i carries, __m256i *has_error) {
87 |
88 | // overlap || underlap
89 | // carry > length && length > 0 || !(carry > length) && !(length > 0)
90 | // (carries > length) == (lengths > 0)
91 | __m256i overunder = _mm256_cmpeq_epi8(
92 | _mm256_cmpgt_epi8(carries, initial_lengths),
93 | _mm256_cmpgt_epi8(initial_lengths, _mm256_setzero_si256()));
94 |
95 | *has_error = _mm256_or_si256(*has_error, overunder);
96 | }
97 |
98 | // when 0xED is found, next byte must be no larger than 0x9F
99 | // when 0xF4 is found, next byte must be no larger than 0x8F
100 | // next byte must be continuation, ie sign bit is set, so signed < is ok
101 | static inline void avxcheckFirstContinuationMax(__m256i current_bytes,
102 | __m256i off1_current_bytes,
103 | __m256i *has_error) {
104 | __m256i maskED =
105 | _mm256_cmpeq_epi8(off1_current_bytes, _mm256_set1_epi8(0xED));
106 | __m256i maskF4 =
107 | _mm256_cmpeq_epi8(off1_current_bytes, _mm256_set1_epi8(0xF4));
108 |
109 | __m256i badfollowED = _mm256_and_si256(
110 | _mm256_cmpgt_epi8(current_bytes, _mm256_set1_epi8(0x9F)), maskED);
111 | __m256i badfollowF4 = _mm256_and_si256(
112 | _mm256_cmpgt_epi8(current_bytes, _mm256_set1_epi8(0x8F)), maskF4);
113 |
114 | *has_error =
115 | _mm256_or_si256(*has_error, _mm256_or_si256(badfollowED, badfollowF4));
116 | }
117 |
118 | // map off1_hibits => error condition
119 | // hibits off1 cur
120 | // C => < C2 && true
121 | // E => < E1 && < A0
122 | // F => < F1 && < 90
123 | // else false && false
124 | static inline void avxcheckOverlong(__m256i current_bytes,
125 | __m256i off1_current_bytes, __m256i hibits,
126 | __m256i previous_hibits,
127 | __m256i *has_error) {
128 | __m256i off1_hibits = push_last_byte_of_a_to_b(previous_hibits, hibits);
129 | __m256i initial_mins = _mm256_shuffle_epi8(
130 | _mm256_setr_epi8(-128, -128, -128, -128, -128, -128, -128, -128, -128,
131 | -128, -128, -128, // 10xx => false
132 | 0xC2, -128, // 110x
133 | 0xE1, // 1110
134 | 0xF1, -128, -128, -128, -128, -128, -128, -128, -128,
135 | -128, -128, -128, -128, // 10xx => false
136 | 0xC2, -128, // 110x
137 | 0xE1, // 1110
138 | 0xF1),
139 | off1_hibits);
140 |
141 | __m256i initial_under = _mm256_cmpgt_epi8(initial_mins, off1_current_bytes);
142 |
143 | __m256i second_mins = _mm256_shuffle_epi8(
144 | _mm256_setr_epi8(-128, -128, -128, -128, -128, -128, -128, -128, -128,
145 | -128, -128, -128, // 10xx => false
146 | 127, 127, // 110x => true
147 | 0xA0, // 1110
148 | 0x90, -128, -128, -128, -128, -128, -128, -128, -128,
149 | -128, -128, -128, -128, // 10xx => false
150 | 127, 127, // 110x => true
151 | 0xA0, // 1110
152 | 0x90),
153 | off1_hibits);
154 | __m256i second_under = _mm256_cmpgt_epi8(second_mins, current_bytes);
155 | *has_error = _mm256_or_si256(*has_error,
156 | _mm256_and_si256(initial_under, second_under));
157 | }
158 |
159 | struct avx_processed_utf_bytes {
160 | __m256i rawbytes;
161 | __m256i high_nibbles;
162 | __m256i carried_continuations;
163 | };
164 |
165 | static inline void avx_count_nibbles(__m256i bytes,
166 | struct avx_processed_utf_bytes *answer) {
167 | answer->rawbytes = bytes;
168 | answer->high_nibbles =
169 | _mm256_and_si256(_mm256_srli_epi16(bytes, 4), _mm256_set1_epi8(0x0F));
170 | }
171 |
172 | // check whether the current bytes are valid UTF-8
173 | // at the end of the function, previous gets updated
174 | static struct avx_processed_utf_bytes
175 | avxcheckUTF8Bytes(__m256i current_bytes,
176 | struct avx_processed_utf_bytes *previous,
177 | __m256i *has_error) {
178 | struct avx_processed_utf_bytes pb;
179 | avx_count_nibbles(current_bytes, &pb);
180 |
181 | avxcheckSmallerThan0xF4(current_bytes, has_error);
182 |
183 | __m256i initial_lengths = avxcontinuationLengths(pb.high_nibbles);
184 |
185 | pb.carried_continuations =
186 | avxcarryContinuations(initial_lengths, previous->carried_continuations);
187 |
188 | avxcheckContinuations(initial_lengths, pb.carried_continuations, has_error);
189 |
190 | __m256i off1_current_bytes =
191 | push_last_byte_of_a_to_b(previous->rawbytes, pb.rawbytes);
192 | avxcheckFirstContinuationMax(current_bytes, off1_current_bytes, has_error);
193 |
194 | avxcheckOverlong(current_bytes, off1_current_bytes, pb.high_nibbles,
195 | previous->high_nibbles, has_error);
196 | return pb;
197 | }
198 |
199 | /* Return 0 on success, -1 on error */
200 | int utf8_lemire_avx2(const unsigned char *src, int len) {
201 | size_t i = 0;
202 | __m256i has_error = _mm256_setzero_si256();
203 | struct avx_processed_utf_bytes previous = {
204 | .rawbytes = _mm256_setzero_si256(),
205 | .high_nibbles = _mm256_setzero_si256(),
206 | .carried_continuations = _mm256_setzero_si256()};
207 | if (len >= 32) {
208 | for (; i <= len - 32; i += 32) {
209 | __m256i current_bytes = _mm256_loadu_si256((const __m256i *)(src + i));
210 | previous = avxcheckUTF8Bytes(current_bytes, &previous, &has_error);
211 | }
212 | }
213 |
214 | // last part
215 | if (i < len) {
216 | char buffer[32];
217 | memset(buffer, 0, 32);
218 | memcpy(buffer, src + i, len - i);
219 | __m256i current_bytes = _mm256_loadu_si256((const __m256i *)(buffer));
220 | previous = avxcheckUTF8Bytes(current_bytes, &previous, &has_error);
221 | } else {
222 | has_error = _mm256_or_si256(
223 | _mm256_cmpgt_epi8(previous.carried_continuations,
224 | _mm256_setr_epi8(9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
225 | 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
226 | 9, 9, 9, 9, 9, 9, 9, 1)),
227 | has_error);
228 | }
229 |
230 | return _mm256_testz_si256(has_error, has_error) ? 0 : -1;
231 | }
232 |
233 | #endif
234 |
--------------------------------------------------------------------------------
/lemire-neon.c:
--------------------------------------------------------------------------------
1 | // Adapted from https://github.com/lemire/fastvalidate-utf-8
2 |
3 | #ifdef __aarch64__
4 |
5 | #include
6 | #include
7 | #include
8 | #include
9 | #include
10 | #include
11 |
12 | /*
13 | * legal utf-8 byte sequence
14 | * http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf - page 94
15 | *
16 | * Code Points 1st 2s 3s 4s
17 | * U+0000..U+007F 00..7F
18 | * U+0080..U+07FF C2..DF 80..BF
19 | * U+0800..U+0FFF E0 A0..BF 80..BF
20 | * U+1000..U+CFFF E1..EC 80..BF 80..BF
21 | * U+D000..U+D7FF ED 80..9F 80..BF
22 | * U+E000..U+FFFF EE..EF 80..BF 80..BF
23 | * U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
24 | * U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
25 | * U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
26 | *
27 | */
28 |
29 | #if 0
30 | static void print128(const char *s, const int8x16_t *v128)
31 | {
32 | int8_t v8[16];
33 | vst1q_s8(v8, *v128);
34 |
35 | if (s)
36 | printf("%s:\t", s);
37 | for (int i = 0; i < 16; ++i)
38 | printf("%02x ", (unsigned char)v8[i]);
39 | printf("\n");
40 | }
41 | #endif
42 |
43 | // all byte values must be no larger than 0xF4
44 | static inline void checkSmallerThan0xF4(int8x16_t current_bytes,
45 | int8x16_t *has_error) {
46 | // unsigned, saturates to 0 below max
47 | *has_error = vorrq_s8(*has_error,
48 | vreinterpretq_s8_u8(vqsubq_u8(vreinterpretq_u8_s8(current_bytes), vdupq_n_u8(0xF4))));
49 | }
50 |
51 | static const int8_t _nibbles[] = {
52 | 1, 1, 1, 1, 1, 1, 1, 1, // 0xxx (ASCII)
53 | 0, 0, 0, 0, // 10xx (continuation)
54 | 2, 2, // 110x
55 | 3, // 1110
56 | 4, // 1111, next should be 0 (not checked here)
57 | };
58 |
59 | static inline int8x16_t continuationLengths(int8x16_t high_nibbles) {
60 | return vqtbl1q_s8(vld1q_s8(_nibbles), vreinterpretq_u8_s8(high_nibbles));
61 | }
62 |
63 | static inline int8x16_t carryContinuations(int8x16_t initial_lengths,
64 | int8x16_t previous_carries) {
65 |
66 | int8x16_t right1 =
67 | vreinterpretq_s8_u8(vqsubq_u8(vreinterpretq_u8_s8(vextq_s8(previous_carries, initial_lengths, 16 - 1)),
68 | vdupq_n_u8(1)));
69 | int8x16_t sum = vaddq_s8(initial_lengths, right1);
70 |
71 | int8x16_t right2 = vreinterpretq_s8_u8(vqsubq_u8(vreinterpretq_u8_s8(vextq_s8(previous_carries, sum, 16 - 2)),
72 | vdupq_n_u8(2)));
73 | return vaddq_s8(sum, right2);
74 | }
75 |
76 | static inline void checkContinuations(int8x16_t initial_lengths, int8x16_t carries,
77 | int8x16_t *has_error) {
78 |
79 | // overlap || underlap
80 | // carry > length && length > 0 || !(carry > length) && !(length > 0)
81 | // (carries > length) == (lengths > 0)
82 | uint8x16_t overunder =
83 | vceqq_u8(vcgtq_s8(carries, initial_lengths),
84 | vcgtq_s8(initial_lengths, vdupq_n_s8(0)));
85 |
86 | *has_error = vorrq_s8(*has_error, vreinterpretq_s8_u8(overunder));
87 | }
88 |
89 | // when 0xED is found, next byte must be no larger than 0x9F
90 | // when 0xF4 is found, next byte must be no larger than 0x8F
91 | // next byte must be continuation, ie sign bit is set, so signed < is ok
92 | static inline void checkFirstContinuationMax(int8x16_t current_bytes,
93 | int8x16_t off1_current_bytes,
94 | int8x16_t *has_error) {
95 | uint8x16_t maskED = vceqq_s8(off1_current_bytes, vdupq_n_s8(0xED));
96 | uint8x16_t maskF4 = vceqq_s8(off1_current_bytes, vdupq_n_s8(0xF4));
97 |
98 | uint8x16_t badfollowED =
99 | vandq_u8(vcgtq_s8(current_bytes, vdupq_n_s8(0x9F)), maskED);
100 | uint8x16_t badfollowF4 =
101 | vandq_u8(vcgtq_s8(current_bytes, vdupq_n_s8(0x8F)), maskF4);
102 |
103 | *has_error = vorrq_s8(*has_error, vreinterpretq_s8_u8(vorrq_u8(badfollowED, badfollowF4)));
104 | }
105 |
106 | static const int8_t _initial_mins[] = {
107 | -128, -128, -128, -128, -128, -128, -128, -128, -128, -128,
108 | -128, -128, // 10xx => false
109 | 0xC2, -128, // 110x
110 | 0xE1, // 1110
111 | 0xF1,
112 | };
113 |
114 | static const int8_t _second_mins[] = {
115 | -128, -128, -128, -128, -128, -128, -128, -128, -128, -128,
116 | -128, -128, // 10xx => false
117 | 127, 127, // 110x => true
118 | 0xA0, // 1110
119 | 0x90,
120 | };
121 |
122 | // map off1_hibits => error condition
123 | // hibits off1 cur
124 | // C => < C2 && true
125 | // E => < E1 && < A0
126 | // F => < F1 && < 90
127 | // else false && false
128 | static inline void checkOverlong(int8x16_t current_bytes,
129 | int8x16_t off1_current_bytes, int8x16_t hibits,
130 | int8x16_t previous_hibits, int8x16_t *has_error) {
131 | int8x16_t off1_hibits = vextq_s8(previous_hibits, hibits, 16 - 1);
132 | int8x16_t initial_mins = vqtbl1q_s8(vld1q_s8(_initial_mins), vreinterpretq_u8_s8(off1_hibits));
133 |
134 | uint8x16_t initial_under = vcgtq_s8(initial_mins, off1_current_bytes);
135 |
136 | int8x16_t second_mins = vqtbl1q_s8(vld1q_s8(_second_mins), vreinterpretq_u8_s8(off1_hibits));
137 | uint8x16_t second_under = vcgtq_s8(second_mins, current_bytes);
138 | *has_error =
139 | vorrq_s8(*has_error, vreinterpretq_s8_u8(vandq_u8(initial_under, second_under)));
140 | }
141 |
142 | struct processed_utf_bytes {
143 | int8x16_t rawbytes;
144 | int8x16_t high_nibbles;
145 | int8x16_t carried_continuations;
146 | };
147 |
148 | static inline void count_nibbles(int8x16_t bytes,
149 | struct processed_utf_bytes *answer) {
150 | answer->rawbytes = bytes;
151 | answer->high_nibbles =
152 | vreinterpretq_s8_u8(vshrq_n_u8(vreinterpretq_u8_s8(bytes), 4));
153 | }
154 |
155 | // check whether the current bytes are valid UTF-8
156 | // at the end of the function, previous gets updated
157 | static inline struct processed_utf_bytes
158 | checkUTF8Bytes(int8x16_t current_bytes, struct processed_utf_bytes *previous,
159 | int8x16_t *has_error) {
160 | struct processed_utf_bytes pb;
161 | count_nibbles(current_bytes, &pb);
162 |
163 | checkSmallerThan0xF4(current_bytes, has_error);
164 |
165 | int8x16_t initial_lengths = continuationLengths(pb.high_nibbles);
166 |
167 | pb.carried_continuations =
168 | carryContinuations(initial_lengths, previous->carried_continuations);
169 |
170 | checkContinuations(initial_lengths, pb.carried_continuations, has_error);
171 |
172 | int8x16_t off1_current_bytes =
173 | vextq_s8(previous->rawbytes, pb.rawbytes, 16 - 1);
174 | checkFirstContinuationMax(current_bytes, off1_current_bytes, has_error);
175 |
176 | checkOverlong(current_bytes, off1_current_bytes, pb.high_nibbles,
177 | previous->high_nibbles, has_error);
178 | return pb;
179 | }
180 |
181 | static const int8_t _verror[] = {9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 1};
182 |
183 | /* Return 0 on success, -1 on error */
184 | int utf8_lemire(const unsigned char *src, int len) {
185 | size_t i = 0;
186 | int8x16_t has_error = vdupq_n_s8(0);
187 | struct processed_utf_bytes previous = {.rawbytes = vdupq_n_s8(0),
188 | .high_nibbles = vdupq_n_s8(0),
189 | .carried_continuations =
190 | vdupq_n_s8(0)};
191 | if (len >= 16) {
192 | for (; i <= len - 16; i += 16) {
193 | int8x16_t current_bytes = vld1q_s8((int8_t*)(src + i));
194 | previous = checkUTF8Bytes(current_bytes, &previous, &has_error);
195 | }
196 | }
197 |
198 | // last part
199 | if (i < len) {
200 | char buffer[16];
201 | memset(buffer, 0, 16);
202 | memcpy(buffer, src + i, len - i);
203 | int8x16_t current_bytes = vld1q_s8((int8_t *)buffer);
204 | previous = checkUTF8Bytes(current_bytes, &previous, &has_error);
205 | } else {
206 | has_error =
207 | vorrq_s8(vreinterpretq_s8_u8(vcgtq_s8(previous.carried_continuations,
208 | vld1q_s8(_verror))),
209 | has_error);
210 | }
211 |
212 | return vmaxvq_u8(vreinterpretq_u8_s8(has_error)) == 0 ? 0 : -1;
213 | }
214 |
215 | #endif
216 |
--------------------------------------------------------------------------------
/lemire-sse.c:
--------------------------------------------------------------------------------
1 | // Adapted from https://github.com/lemire/fastvalidate-utf-8
2 |
3 | #ifdef __x86_64__
4 |
5 | #include
6 | #include
7 | #include
8 | #include
9 | #include
10 |
11 | /*
12 | * legal utf-8 byte sequence
13 | * http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf - page 94
14 | *
15 | * Code Points 1st 2s 3s 4s
16 | * U+0000..U+007F 00..7F
17 | * U+0080..U+07FF C2..DF 80..BF
18 | * U+0800..U+0FFF E0 A0..BF 80..BF
19 | * U+1000..U+CFFF E1..EC 80..BF 80..BF
20 | * U+D000..U+D7FF ED 80..9F 80..BF
21 | * U+E000..U+FFFF EE..EF 80..BF 80..BF
22 | * U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
23 | * U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
24 | * U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
25 | *
26 | */
27 |
28 | #if 0
29 | static void print128(const char *s, const __m128i *v128)
30 | {
31 | const unsigned char *v8 = (const unsigned char *)v128;
32 | if (s)
33 | printf("%s: ", s);
34 | for (int i = 0; i < 16; i++)
35 | printf("%02x ", v8[i]);
36 | printf("\n");
37 | }
38 | #endif
39 |
40 | // all byte values must be no larger than 0xF4
41 | static inline void checkSmallerThan0xF4(__m128i current_bytes,
42 | __m128i *has_error) {
43 | // unsigned, saturates to 0 below max
44 | *has_error = _mm_or_si128(*has_error,
45 | _mm_subs_epu8(current_bytes, _mm_set1_epi8(0xF4)));
46 | }
47 |
48 | static inline __m128i continuationLengths(__m128i high_nibbles) {
49 | return _mm_shuffle_epi8(
50 | _mm_setr_epi8(1, 1, 1, 1, 1, 1, 1, 1, // 0xxx (ASCII)
51 | 0, 0, 0, 0, // 10xx (continuation)
52 | 2, 2, // 110x
53 | 3, // 1110
54 | 4), // 1111, next should be 0 (not checked here)
55 | high_nibbles);
56 | }
57 |
58 | static inline __m128i carryContinuations(__m128i initial_lengths,
59 | __m128i previous_carries) {
60 |
61 | __m128i right1 =
62 | _mm_subs_epu8(_mm_alignr_epi8(initial_lengths, previous_carries, 16 - 1),
63 | _mm_set1_epi8(1));
64 | __m128i sum = _mm_add_epi8(initial_lengths, right1);
65 |
66 | __m128i right2 = _mm_subs_epu8(_mm_alignr_epi8(sum, previous_carries, 16 - 2),
67 | _mm_set1_epi8(2));
68 | return _mm_add_epi8(sum, right2);
69 | }
70 |
71 | static inline void checkContinuations(__m128i initial_lengths, __m128i carries,
72 | __m128i *has_error) {
73 |
74 | // overlap || underlap
75 | // carry > length && length > 0 || !(carry > length) && !(length > 0)
76 | // (carries > length) == (lengths > 0)
77 | __m128i overunder =
78 | _mm_cmpeq_epi8(_mm_cmpgt_epi8(carries, initial_lengths),
79 | _mm_cmpgt_epi8(initial_lengths, _mm_setzero_si128()));
80 |
81 | *has_error = _mm_or_si128(*has_error, overunder);
82 | }
83 |
84 | // when 0xED is found, next byte must be no larger than 0x9F
85 | // when 0xF4 is found, next byte must be no larger than 0x8F
86 | // next byte must be continuation, ie sign bit is set, so signed < is ok
87 | static inline void checkFirstContinuationMax(__m128i current_bytes,
88 | __m128i off1_current_bytes,
89 | __m128i *has_error) {
90 | __m128i maskED = _mm_cmpeq_epi8(off1_current_bytes, _mm_set1_epi8(0xED));
91 | __m128i maskF4 = _mm_cmpeq_epi8(off1_current_bytes, _mm_set1_epi8(0xF4));
92 |
93 | __m128i badfollowED =
94 | _mm_and_si128(_mm_cmpgt_epi8(current_bytes, _mm_set1_epi8(0x9F)), maskED);
95 | __m128i badfollowF4 =
96 | _mm_and_si128(_mm_cmpgt_epi8(current_bytes, _mm_set1_epi8(0x8F)), maskF4);
97 |
98 | *has_error = _mm_or_si128(*has_error, _mm_or_si128(badfollowED, badfollowF4));
99 | }
100 |
101 | // map off1_hibits => error condition
102 | // hibits off1 cur
103 | // C => < C2 && true
104 | // E => < E1 && < A0
105 | // F => < F1 && < 90
106 | // else false && false
107 | static inline void checkOverlong(__m128i current_bytes,
108 | __m128i off1_current_bytes, __m128i hibits,
109 | __m128i previous_hibits, __m128i *has_error) {
110 | __m128i off1_hibits = _mm_alignr_epi8(hibits, previous_hibits, 16 - 1);
111 | __m128i initial_mins = _mm_shuffle_epi8(
112 | _mm_setr_epi8(-128, -128, -128, -128, -128, -128, -128, -128, -128, -128,
113 | -128, -128, // 10xx => false
114 | 0xC2, -128, // 110x
115 | 0xE1, // 1110
116 | 0xF1),
117 | off1_hibits);
118 |
119 | __m128i initial_under = _mm_cmpgt_epi8(initial_mins, off1_current_bytes);
120 |
121 | __m128i second_mins = _mm_shuffle_epi8(
122 | _mm_setr_epi8(-128, -128, -128, -128, -128, -128, -128, -128, -128, -128,
123 | -128, -128, // 10xx => false
124 | 127, 127, // 110x => true
125 | 0xA0, // 1110
126 | 0x90),
127 | off1_hibits);
128 | __m128i second_under = _mm_cmpgt_epi8(second_mins, current_bytes);
129 | *has_error =
130 | _mm_or_si128(*has_error, _mm_and_si128(initial_under, second_under));
131 | }
132 |
133 | struct processed_utf_bytes {
134 | __m128i rawbytes;
135 | __m128i high_nibbles;
136 | __m128i carried_continuations;
137 | };
138 |
139 | static inline void count_nibbles(__m128i bytes,
140 | struct processed_utf_bytes *answer) {
141 | answer->rawbytes = bytes;
142 | answer->high_nibbles =
143 | _mm_and_si128(_mm_srli_epi16(bytes, 4), _mm_set1_epi8(0x0F));
144 | }
145 |
146 | // check whether the current bytes are valid UTF-8
147 | // at the end of the function, previous gets updated
148 | static inline struct processed_utf_bytes
149 | checkUTF8Bytes(__m128i current_bytes, struct processed_utf_bytes *previous,
150 | __m128i *has_error) {
151 |
152 | struct processed_utf_bytes pb;
153 | count_nibbles(current_bytes, &pb);
154 |
155 | checkSmallerThan0xF4(current_bytes, has_error);
156 |
157 | __m128i initial_lengths = continuationLengths(pb.high_nibbles);
158 |
159 | pb.carried_continuations =
160 | carryContinuations(initial_lengths, previous->carried_continuations);
161 |
162 | checkContinuations(initial_lengths, pb.carried_continuations, has_error);
163 |
164 | __m128i off1_current_bytes =
165 | _mm_alignr_epi8(pb.rawbytes, previous->rawbytes, 16 - 1);
166 | checkFirstContinuationMax(current_bytes, off1_current_bytes, has_error);
167 |
168 | checkOverlong(current_bytes, off1_current_bytes, pb.high_nibbles,
169 | previous->high_nibbles, has_error);
170 | return pb;
171 | }
172 |
173 | /* Return 0 on success, -1 on error */
174 | int utf8_lemire(const unsigned char *src, int len) {
175 | size_t i = 0;
176 | __m128i has_error = _mm_setzero_si128();
177 | struct processed_utf_bytes previous = {.rawbytes = _mm_setzero_si128(),
178 | .high_nibbles = _mm_setzero_si128(),
179 | .carried_continuations =
180 | _mm_setzero_si128()};
181 | if (len >= 16) {
182 | for (; i <= len - 16; i += 16) {
183 | __m128i current_bytes = _mm_loadu_si128((const __m128i *)(src + i));
184 | previous = checkUTF8Bytes(current_bytes, &previous, &has_error);
185 | }
186 | }
187 |
188 | // last part
189 | if (i < len) {
190 | char buffer[16];
191 | memset(buffer, 0, 16);
192 | memcpy(buffer, src + i, len - i);
193 | __m128i current_bytes = _mm_loadu_si128((const __m128i *)(buffer));
194 | previous = checkUTF8Bytes(current_bytes, &previous, &has_error);
195 | } else {
196 | has_error =
197 | _mm_or_si128(_mm_cmpgt_epi8(previous.carried_continuations,
198 | _mm_setr_epi8(9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
199 | 9, 9, 9, 9, 9, 1)),
200 | has_error);
201 | }
202 |
203 | return _mm_testz_si128(has_error, has_error) ? 0 : -1;
204 | }
205 |
206 | #endif
207 |
--------------------------------------------------------------------------------
/lookup.c:
--------------------------------------------------------------------------------
1 | #include
2 |
3 | /* http://bjoern.hoehrmann.de/utf-8/decoder/dfa */
4 | /* Optimized version based on Rich Felker's variant. */
5 | #define UTF8_ACCEPT 0
6 | #define UTF8_REJECT 12
7 |
8 | static const unsigned char utf8d[] = {
9 | /* The first part of the table maps bytes to character classes that
10 | * to reduce the size of the transition table and create bitmasks. */
11 | 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
12 | 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
13 | 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
14 | 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
15 | 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,
16 | 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
17 | 8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
18 | 10,3,3,3,3,3,3,3,3,3,3,3,3,4,3,3, 11,6,6,6,5,8,8,8,8,8,8,8,8,8,8,8
19 | };
20 | /* Note: Splitting the table improves performance on ARM due to its simpler
21 | * addressing modes not being able to encode x[y + 256]. */
22 | static const unsigned char utf8s[] = {
23 | /* The second part is a transition table that maps a combination
24 | * of a state of the automaton and a character class to a state. */
25 | 0,12,24,36,60,96,84,12,12,12,48,72, 12,12,12,12,12,12,12,12,12,12,12,12,
26 | 12, 0,12,12,12,12,12, 0,12, 0,12,12, 12,24,12,12,12,12,12,24,12,24,12,12,
27 | 12,12,12,12,12,12,12,24,12,12,12,12, 12,24,12,12,12,12,12,12,12,24,12,12,
28 | 12,12,12,12,12,12,12,36,12,36,12,12, 12,36,12,12,12,12,12,36,12,36,12,12,
29 | 12,36,12,12,12,12,12,12,12,12,12,12
30 | };
31 |
32 | /* Return 0 on success, -1 on error */
33 | int utf8_lookup(const unsigned char *data, int len)
34 | {
35 | int state = 0;
36 |
37 | while (len-- && state != UTF8_REJECT)
38 | state = utf8s[state + utf8d[*data++]];
39 |
40 | return state == UTF8_ACCEPT ? 0 : -1;
41 | }
42 |
--------------------------------------------------------------------------------
/main.c:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 | #include
4 | #include
5 | #include
6 | #include
7 | #include
8 | #include
9 | #include
10 |
11 | int utf8_naive(const unsigned char *data, int len);
12 | int utf8_lookup(const unsigned char *data, int len);
13 | int utf8_boost(const unsigned char *data, int len);
14 | int utf8_lemire(const unsigned char *data, int len);
15 | int utf8_range(const unsigned char *data, int len);
16 | int utf8_range2(const unsigned char *data, int len);
17 | #ifdef __AVX2__
18 | int utf8_lemire_avx2(const unsigned char *data, int len);
19 | int utf8_range_avx2(const unsigned char *data, int len);
20 | #endif
21 |
22 | static struct ftab {
23 | const char *name;
24 | int (*func)(const unsigned char *data, int len);
25 | } ftab[] = {
26 | {
27 | .name = "naive",
28 | .func = utf8_naive,
29 | },
30 | {
31 | .name = "lookup",
32 | .func = utf8_lookup,
33 | },
34 | {
35 | .name = "lemire",
36 | .func = utf8_lemire,
37 | },
38 | {
39 | .name = "range",
40 | .func = utf8_range,
41 | },
42 | {
43 | .name = "range2",
44 | .func = utf8_range2,
45 | },
46 | #ifdef __AVX2__
47 | {
48 | .name = "lemire_avx2",
49 | .func = utf8_lemire_avx2,
50 | },
51 | {
52 | .name = "range_avx2",
53 | .func = utf8_range_avx2,
54 | },
55 | #endif
56 | #ifdef BOOST
57 | {
58 | .name = "boost",
59 | .func = utf8_boost,
60 | },
61 | #endif
62 | };
63 |
64 | static unsigned char *load_test_buf(int len)
65 | {
66 | const char utf8[] = "\xF0\x90\xBF\x80";
67 | const int utf8_len = sizeof(utf8)/sizeof(utf8[0]) - 1;
68 |
69 | unsigned char *data = malloc(len);
70 | unsigned char *p = data;
71 |
72 | while (len >= utf8_len) {
73 | memcpy(p, utf8, utf8_len);
74 | p += utf8_len;
75 | len -= utf8_len;
76 | }
77 |
78 | while (len--)
79 | *p++ = 0x7F;
80 |
81 | return data;
82 | }
83 |
84 | static unsigned char *load_test_file(int *len)
85 | {
86 | unsigned char *data;
87 | int fd;
88 | struct stat stat;
89 |
90 | fd = open("./UTF-8-demo.txt", O_RDONLY);
91 | if (fd == -1) {
92 | printf("Failed to open UTF-8-demo.txt!\n");
93 | exit(1);
94 | }
95 | if (fstat(fd, &stat) == -1) {
96 | printf("Failed to get file size!\n");
97 | exit(1);
98 | }
99 |
100 | *len = stat.st_size;
101 | data = malloc(*len);
102 | if (read(fd, data, *len) != *len) {
103 | printf("Failed to read file!\n");
104 | exit(1);
105 | }
106 |
107 | utf8_range(data, *len);
108 | #ifdef __AVX2__
109 | utf8_range_avx2(data, *len);
110 | #endif
111 | close(fd);
112 |
113 | return data;
114 | }
115 |
116 | static void print_test(const unsigned char *data, int len)
117 | {
118 | while (len--)
119 | printf("\\x%02X", *data++);
120 |
121 | printf("\n");
122 | }
123 |
124 | struct test {
125 | const unsigned char *data;
126 | int len;
127 | };
128 |
129 | static void prepare_test_buf(unsigned char *buf, const struct test *pos,
130 | int pos_len, int pos_idx)
131 | {
132 | /* Round concatenate correct tokens to 1024 bytes */
133 | int buf_idx = 0;
134 | while (buf_idx < 1024) {
135 | int buf_len = 1024 - buf_idx;
136 |
137 | if (buf_len >= pos[pos_idx].len) {
138 | memcpy(buf+buf_idx, pos[pos_idx].data, pos[pos_idx].len);
139 | buf_idx += pos[pos_idx].len;
140 | } else {
141 | memset(buf+buf_idx, 0, buf_len);
142 | buf_idx += buf_len;
143 | }
144 |
145 | if (++pos_idx == pos_len)
146 | pos_idx = 0;
147 | }
148 | }
149 |
150 | /* Return 0 on success, -1 on error */
151 | static int test_manual(const struct ftab *ftab)
152 | {
153 | #pragma GCC diagnostic push
154 | #pragma GCC diagnostic ignored "-Wpointer-sign"
155 | /* positive tests */
156 | static const struct test pos[] = {
157 | {"", 0},
158 | {"\x00", 1},
159 | {"\x66", 1},
160 | {"\x7F", 1},
161 | {"\x00\x7F", 2},
162 | {"\x7F\x00", 2},
163 | {"\xC2\x80", 2},
164 | {"\xDF\xBF", 2},
165 | {"\xE0\xA0\x80", 3},
166 | {"\xE0\xA0\xBF", 3},
167 | {"\xED\x9F\x80", 3},
168 | {"\xEF\x80\xBF", 3},
169 | {"\xF0\x90\xBF\x80", 4},
170 | {"\xF2\x81\xBE\x99", 4},
171 | {"\xF4\x8F\x88\xAA", 4},
172 | };
173 |
174 | /* negative tests */
175 | static const struct test neg[] = {
176 | {"\x80", 1},
177 | {"\xBF", 1},
178 | {"\xC0\x80", 2},
179 | {"\xC1\x00", 2},
180 | {"\xC2\x7F", 2},
181 | {"\xDF\xC0", 2},
182 | {"\xE0\x9F\x80", 3},
183 | {"\xE0\xC2\x80", 3},
184 | {"\xED\xA0\x80", 3},
185 | {"\xED\x7F\x80", 3},
186 | {"\xEF\x80\x00", 3},
187 | {"\xF0\x8F\x80\x80", 4},
188 | {"\xF0\xEE\x80\x80", 4},
189 | {"\xF2\x90\x91\x7F", 4},
190 | {"\xF4\x90\x88\xAA", 4},
191 | {"\xF4\x00\xBF\xBF", 4},
192 | {"\x00\x00\x00\x00\x00\xC2\x80\x00\x00\x00\xE1\x80\x80\x00\x00\xC2" \
193 | "\xC2\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
194 | 32},
195 | {"\x00\x00\x00\x00\x00\xC2\xC2\x80\x00\x00\xE1\x80\x80\x00\x00\x00",
196 | 16},
197 | {"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" \
198 | "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xF1\x80",
199 | 32},
200 | {"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" \
201 | "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xF1",
202 | 32},
203 | {"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" \
204 | "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xF1\x80" \
205 | "\x80", 33},
206 | {"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" \
207 | "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xF1\x80" \
208 | "\xC2\x80", 34},
209 | {"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" \
210 | "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xF0" \
211 | "\x80\x80\x80", 35},
212 | };
213 | #pragma GCC diagnostic push
214 |
215 | /* Test single token */
216 | for (int i = 0; i < sizeof(pos)/sizeof(pos[0]); ++i) {
217 | if (ftab->func(pos[i].data, pos[i].len) != 0) {
218 | printf("FAILED positive test: ");
219 | print_test(pos[i].data, pos[i].len);
220 | return -1;
221 | }
222 | }
223 | for (int i = 0; i < sizeof(neg)/sizeof(neg[0]); ++i) {
224 | if (ftab->func(neg[i].data, neg[i].len) == 0) {
225 | printf("FAILED negitive test: ");
226 | print_test(neg[i].data, neg[i].len);
227 | return -1;
228 | }
229 | }
230 |
231 | /* Test shifted buffer to cover 1k length */
232 | /* buffer size must be greater than 1024 + 16 + max(test string length) */
233 | const int max_size = 1024*2;
234 | uint64_t buf64[max_size/8 + 2];
235 | /* Offset 8 bytes by 1 byte */
236 | unsigned char *buf = ((unsigned char *)buf64) + 1;
237 | int buf_len;
238 |
239 | for (int i = 0; i < sizeof(pos)/sizeof(pos[0]); ++i) {
240 | /* Positive test: shift 16 bytes, validate each shift */
241 | prepare_test_buf(buf, pos, sizeof(pos)/sizeof(pos[0]), i);
242 | buf_len = 1024;
243 | for (int j = 0; j < 16; ++j) {
244 | if (ftab->func(buf, buf_len) != 0) {
245 | printf("FAILED positive test: ");
246 | print_test(buf, buf_len);
247 | return -1;
248 | }
249 | for (int k = buf_len; k >= 1; --k)
250 | buf[k] = buf[k-1];
251 | buf[0] = '\x55';
252 | ++buf_len;
253 | }
254 |
255 | /* Negative test: trunk last non ascii */
256 | while (buf_len >= 1 && buf[buf_len-1] <= 0x7F)
257 | --buf_len;
258 | if (buf_len && ftab->func(buf, buf_len-1) == 0) {
259 | printf("FAILED negitive test: ");
260 | print_test(buf, buf_len);
261 | return -1;
262 | }
263 | }
264 |
265 | /* Negative test */
266 | for (int i = 0; i < sizeof(neg)/sizeof(neg[0]); ++i) {
267 | /* Append one error token, shift 16 bytes, validate each shift */
268 | int pos_idx = i % (sizeof(pos)/sizeof(pos[0]));
269 | prepare_test_buf(buf, pos, sizeof(pos)/sizeof(pos[0]), pos_idx);
270 | memcpy(buf+1024, neg[i].data, neg[i].len);
271 | buf_len = 1024 + neg[i].len;
272 | for (int j = 0; j < 16; ++j) {
273 | if (ftab->func(buf, buf_len) == 0) {
274 | printf("FAILED negative test: ");
275 | print_test(buf, buf_len);
276 | return -1;
277 | }
278 | for (int k = buf_len; k >= 1; --k)
279 | buf[k] = buf[k-1];
280 | buf[0] = '\x66';
281 | ++buf_len;
282 | }
283 | }
284 |
285 | return 0;
286 | }
287 |
288 | static int test(const unsigned char *data, int len, const struct ftab *ftab)
289 | {
290 | int ret_standard = ftab->func(data, len);
291 | int ret_manual = test_manual(ftab);
292 | printf("%s\n", ftab->name);
293 | printf("standard test: %s\n", ret_standard ? "FAIL" : "pass");
294 | printf("manual test: %s\n", ret_manual ? "FAIL" : "pass");
295 |
296 | return ret_standard | ret_manual;
297 | }
298 |
299 | static int bench(const unsigned char *data, int len, const struct ftab *ftab)
300 | {
301 | const int loops = 1024*1024*1024/len;
302 | int ret = 0;
303 | double time, size;
304 | struct timeval tv1, tv2;
305 |
306 | fprintf(stderr, "bench %s... ", ftab->name);
307 | gettimeofday(&tv1, 0);
308 | for (int i = 0; i < loops; ++i)
309 | ret |= ftab->func(data, len);
310 | gettimeofday(&tv2, 0);
311 | printf("%s\n", ret?"FAIL":"pass");
312 |
313 | time = tv2.tv_usec - tv1.tv_usec;
314 | time = time / 1000000 + tv2.tv_sec - tv1.tv_sec;
315 | size = ((double)len * loops) / (1024*1024);
316 | printf("time: %.4f s\n", time);
317 | printf("data: %.0f MB\n", size);
318 | printf("BW: %.2f MB/s\n", size / time);
319 |
320 | return 0;
321 | }
322 |
323 | static void usage(const char *bin)
324 | {
325 | printf("Usage:\n");
326 | printf("%s test [alg] ==> test all or one algorithm\n", bin);
327 | printf("%s bench [alg] ==> benchmark all or one algorithm\n", bin);
328 | printf("%s bench size NUM ==> benchmark with specific buffer size\n", bin);
329 | printf("alg = ");
330 | for (int i = 0; i < sizeof(ftab)/sizeof(ftab[0]); ++i)
331 | printf("%s ", ftab[i].name);
332 | printf("\nNUM = buffer size in bytes, 1 ~ 67108864(64M)\n");
333 | }
334 |
335 | int main(int argc, char *argv[])
336 | {
337 | int len = 0;
338 | unsigned char *data;
339 | const char *alg = NULL;
340 | int (*tb)(const unsigned char *data, int len, const struct ftab *ftab);
341 |
342 | tb = NULL;
343 | if (argc >= 2) {
344 | if (strcmp(argv[1], "test") == 0)
345 | tb = test;
346 | else if (strcmp(argv[1], "bench") == 0)
347 | tb = bench;
348 | if (argc >= 3) {
349 | alg = argv[2];
350 | if (strcmp(alg, "size") == 0) {
351 | if (argc < 4) {
352 | tb = NULL;
353 | } else {
354 | alg = NULL;
355 | len = atoi(argv[3]);
356 | if (len <= 0 || len > 67108864) {
357 | printf("Buffer size error!\n\n");
358 | tb = NULL;
359 | }
360 | }
361 | }
362 | }
363 | }
364 |
365 | if (tb == NULL) {
366 | usage(argv[0]);
367 | return 1;
368 | }
369 |
370 | /* Load UTF8 test buffer */
371 | if (len)
372 | data = load_test_buf(len);
373 | else
374 | data = load_test_file(&len);
375 |
376 | int ret = 0;
377 | if (tb == bench)
378 | printf("=============== Bench UTF8 (%d bytes) ===============\n", len);
379 | for (int i = 0; i < sizeof(ftab)/sizeof(ftab[0]); ++i) {
380 | if (alg && strcmp(alg, ftab[i].name) != 0)
381 | continue;
382 | ret |= tb((const unsigned char *)data, len, &ftab[i]);
383 | printf("\n");
384 | }
385 |
386 | #if 0
387 | if (tb == bench) {
388 | printf("==================== Bench ASCII ====================\n");
389 | /* Change test buffer to ascii */
390 | for (int i = 0; i < len; i++)
391 | data[i] &= 0x7F;
392 |
393 | for (int i = 0; i < sizeof(ftab)/sizeof(ftab[0]); ++i) {
394 | if (alg && strcmp(alg, ftab[i].name) != 0)
395 | continue;
396 | tb((const unsigned char *)data, len, &ftab[i]);
397 | printf("\n");
398 | }
399 | }
400 | #endif
401 |
402 | free(data);
403 |
404 | return ret;
405 | }
406 |
--------------------------------------------------------------------------------
/naive.c:
--------------------------------------------------------------------------------
1 | #include
2 |
3 | /*
4 | * http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf - page 94
5 | *
6 | * Table 3-7. Well-Formed UTF-8 Byte Sequences
7 | *
8 | * +--------------------+------------+-------------+------------+-------------+
9 | * | Code Points | First Byte | Second Byte | Third Byte | Fourth Byte |
10 | * +--------------------+------------+-------------+------------+-------------+
11 | * | U+0000..U+007F | 00..7F | | | |
12 | * +--------------------+------------+-------------+------------+-------------+
13 | * | U+0080..U+07FF | C2..DF | 80..BF | | |
14 | * +--------------------+------------+-------------+------------+-------------+
15 | * | U+0800..U+0FFF | E0 | A0..BF | 80..BF | |
16 | * +--------------------+------------+-------------+------------+-------------+
17 | * | U+1000..U+CFFF | E1..EC | 80..BF | 80..BF | |
18 | * +--------------------+------------+-------------+------------+-------------+
19 | * | U+D000..U+D7FF | ED | 80..9F | 80..BF | |
20 | * +--------------------+------------+-------------+------------+-------------+
21 | * | U+E000..U+FFFF | EE..EF | 80..BF | 80..BF | |
22 | * +--------------------+------------+-------------+------------+-------------+
23 | * | U+10000..U+3FFFF | F0 | 90..BF | 80..BF | 80..BF |
24 | * +--------------------+------------+-------------+------------+-------------+
25 | * | U+40000..U+FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF |
26 | * +--------------------+------------+-------------+------------+-------------+
27 | * | U+100000..U+10FFFF | F4 | 80..8F | 80..BF | 80..BF |
28 | * +--------------------+------------+-------------+------------+-------------+
29 | */
30 |
31 | /* Return 0 - success, >0 - index(1 based) of first error char */
32 | int utf8_naive(const unsigned char *data, int len)
33 | {
34 | int err_pos = 1;
35 |
36 | while (len) {
37 | int bytes;
38 | const unsigned char byte1 = data[0];
39 |
40 | /* 00..7F */
41 | if (byte1 <= 0x7F) {
42 | bytes = 1;
43 | /* C2..DF, 80..BF */
44 | } else if (len >= 2 && byte1 >= 0xC2 && byte1 <= 0xDF &&
45 | (signed char)data[1] <= (signed char)0xBF) {
46 | bytes = 2;
47 | } else if (len >= 3) {
48 | const unsigned char byte2 = data[1];
49 |
50 | /* Is byte2, byte3 between 0x80 ~ 0xBF */
51 | const int byte2_ok = (signed char)byte2 <= (signed char)0xBF;
52 | const int byte3_ok = (signed char)data[2] <= (signed char)0xBF;
53 |
54 | if (byte2_ok && byte3_ok &&
55 | /* E0, A0..BF, 80..BF */
56 | ((byte1 == 0xE0 && byte2 >= 0xA0) ||
57 | /* E1..EC, 80..BF, 80..BF */
58 | (byte1 >= 0xE1 && byte1 <= 0xEC) ||
59 | /* ED, 80..9F, 80..BF */
60 | (byte1 == 0xED && byte2 <= 0x9F) ||
61 | /* EE..EF, 80..BF, 80..BF */
62 | (byte1 >= 0xEE && byte1 <= 0xEF))) {
63 | bytes = 3;
64 | } else if (len >= 4) {
65 | /* Is byte4 between 0x80 ~ 0xBF */
66 | const int byte4_ok = (signed char)data[3] <= (signed char)0xBF;
67 |
68 | if (byte2_ok && byte3_ok && byte4_ok &&
69 | /* F0, 90..BF, 80..BF, 80..BF */
70 | ((byte1 == 0xF0 && byte2 >= 0x90) ||
71 | /* F1..F3, 80..BF, 80..BF, 80..BF */
72 | (byte1 >= 0xF1 && byte1 <= 0xF3) ||
73 | /* F4, 80..8F, 80..BF, 80..BF */
74 | (byte1 == 0xF4 && byte2 <= 0x8F))) {
75 | bytes = 4;
76 | } else {
77 | return err_pos;
78 | }
79 | } else {
80 | return err_pos;
81 | }
82 | } else {
83 | return err_pos;
84 | }
85 |
86 | len -= bytes;
87 | err_pos += bytes;
88 | data += bytes;
89 | }
90 |
91 | return 0;
92 | }
93 |
--------------------------------------------------------------------------------
/range-avx2.c:
--------------------------------------------------------------------------------
1 | #ifdef __AVX2__
2 |
3 | #include
4 | #include
5 | #include
6 |
7 | int utf8_naive(const unsigned char *data, int len);
8 |
9 | #if 0
10 | static void print256(const char *s, const __m256i v256)
11 | {
12 | const unsigned char *v8 = (const unsigned char *)&v256;
13 | if (s)
14 | printf("%s:\t", s);
15 | for (int i = 0; i < 32; i++)
16 | printf("%02x ", v8[i]);
17 | printf("\n");
18 | }
19 | #endif
20 |
21 | /*
22 | * Map high nibble of "First Byte" to legal character length minus 1
23 | * 0x00 ~ 0xBF --> 0
24 | * 0xC0 ~ 0xDF --> 1
25 | * 0xE0 ~ 0xEF --> 2
26 | * 0xF0 ~ 0xFF --> 3
27 | */
28 | static const int8_t _first_len_tbl[] = {
29 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 3,
30 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 3,
31 | };
32 |
33 | /* Map "First Byte" to 8-th item of range table (0xC2 ~ 0xF4) */
34 | static const int8_t _first_range_tbl[] = {
35 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8,
36 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8,
37 | };
38 |
39 | /*
40 | * Range table, map range index to min and max values
41 | * Index 0 : 00 ~ 7F (First Byte, ascii)
42 | * Index 1,2,3: 80 ~ BF (Second, Third, Fourth Byte)
43 | * Index 4 : A0 ~ BF (Second Byte after E0)
44 | * Index 5 : 80 ~ 9F (Second Byte after ED)
45 | * Index 6 : 90 ~ BF (Second Byte after F0)
46 | * Index 7 : 80 ~ 8F (Second Byte after F4)
47 | * Index 8 : C2 ~ F4 (First Byte, non ascii)
48 | * Index 9~15 : illegal: i >= 127 && i <= -128
49 | */
50 | static const int8_t _range_min_tbl[] = {
51 | 0x00, 0x80, 0x80, 0x80, 0xA0, 0x80, 0x90, 0x80,
52 | 0xC2, 0x7F, 0x7F, 0x7F, 0x7F, 0x7F, 0x7F, 0x7F,
53 | 0x00, 0x80, 0x80, 0x80, 0xA0, 0x80, 0x90, 0x80,
54 | 0xC2, 0x7F, 0x7F, 0x7F, 0x7F, 0x7F, 0x7F, 0x7F,
55 | };
56 | static const int8_t _range_max_tbl[] = {
57 | 0x7F, 0xBF, 0xBF, 0xBF, 0xBF, 0x9F, 0xBF, 0x8F,
58 | 0xF4, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80,
59 | 0x7F, 0xBF, 0xBF, 0xBF, 0xBF, 0x9F, 0xBF, 0x8F,
60 | 0xF4, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80,
61 | };
62 |
63 | /*
64 | * Tables for fast handling of four special First Bytes(E0,ED,F0,F4), after
65 | * which the Second Byte are not 80~BF. It contains "range index adjustment".
66 | * +------------+---------------+------------------+----------------+
67 | * | First Byte | original range| range adjustment | adjusted range |
68 | * +------------+---------------+------------------+----------------+
69 | * | E0 | 2 | 2 | 4 |
70 | * +------------+---------------+------------------+----------------+
71 | * | ED | 2 | 3 | 5 |
72 | * +------------+---------------+------------------+----------------+
73 | * | F0 | 3 | 3 | 6 |
74 | * +------------+---------------+------------------+----------------+
75 | * | F4 | 4 | 4 | 8 |
76 | * +------------+---------------+------------------+----------------+
77 | */
78 | /* index1 -> E0, index14 -> ED */
79 | static const int8_t _df_ee_tbl[] = {
80 | 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,
81 | 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,
82 | };
83 | /* index1 -> F0, index5 -> F4 */
84 | static const int8_t _ef_fe_tbl[] = {
85 | 0, 3, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
86 | 0, 3, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
87 | };
88 |
89 | #define RET_ERR_IDX 0 /* Define 1 to return index of first error char */
90 |
91 | static inline __m256i push_last_byte_of_a_to_b(__m256i a, __m256i b) {
92 | return _mm256_alignr_epi8(b, _mm256_permute2x128_si256(a, b, 0x21), 15);
93 | }
94 |
95 | static inline __m256i push_last_2bytes_of_a_to_b(__m256i a, __m256i b) {
96 | return _mm256_alignr_epi8(b, _mm256_permute2x128_si256(a, b, 0x21), 14);
97 | }
98 |
99 | static inline __m256i push_last_3bytes_of_a_to_b(__m256i a, __m256i b) {
100 | return _mm256_alignr_epi8(b, _mm256_permute2x128_si256(a, b, 0x21), 13);
101 | }
102 |
103 | /* 5x faster than naive method */
104 | /* Return 0 - success, -1 - error, >0 - first error char(if RET_ERR_IDX = 1) */
105 | int utf8_range_avx2(const unsigned char *data, int len)
106 | {
107 | #if RET_ERR_IDX
108 | int err_pos = 1;
109 | #endif
110 |
111 | if (len >= 32) {
112 | __m256i prev_input = _mm256_set1_epi8(0);
113 | __m256i prev_first_len = _mm256_set1_epi8(0);
114 |
115 | /* Cached tables */
116 | const __m256i first_len_tbl =
117 | _mm256_loadu_si256((const __m256i *)_first_len_tbl);
118 | const __m256i first_range_tbl =
119 | _mm256_loadu_si256((const __m256i *)_first_range_tbl);
120 | const __m256i range_min_tbl =
121 | _mm256_loadu_si256((const __m256i *)_range_min_tbl);
122 | const __m256i range_max_tbl =
123 | _mm256_loadu_si256((const __m256i *)_range_max_tbl);
124 | const __m256i df_ee_tbl =
125 | _mm256_loadu_si256((const __m256i *)_df_ee_tbl);
126 | const __m256i ef_fe_tbl =
127 | _mm256_loadu_si256((const __m256i *)_ef_fe_tbl);
128 |
129 | #if !RET_ERR_IDX
130 | __m256i error1 = _mm256_set1_epi8(0);
131 | __m256i error2 = _mm256_set1_epi8(0);
132 | #endif
133 |
134 | while (len >= 32) {
135 | const __m256i input = _mm256_loadu_si256((const __m256i *)data);
136 |
137 | /* high_nibbles = input >> 4 */
138 | const __m256i high_nibbles =
139 | _mm256_and_si256(_mm256_srli_epi16(input, 4), _mm256_set1_epi8(0x0F));
140 |
141 | /* first_len = legal character length minus 1 */
142 | /* 0 for 00~7F, 1 for C0~DF, 2 for E0~EF, 3 for F0~FF */
143 | /* first_len = first_len_tbl[high_nibbles] */
144 | __m256i first_len = _mm256_shuffle_epi8(first_len_tbl, high_nibbles);
145 |
146 | /* First Byte: set range index to 8 for bytes within 0xC0 ~ 0xFF */
147 | /* range = first_range_tbl[high_nibbles] */
148 | __m256i range = _mm256_shuffle_epi8(first_range_tbl, high_nibbles);
149 |
150 | /* Second Byte: set range index to first_len */
151 | /* 0 for 00~7F, 1 for C0~DF, 2 for E0~EF, 3 for F0~FF */
152 | /* range |= (first_len, prev_first_len) << 1 byte */
153 | range = _mm256_or_si256(
154 | range, push_last_byte_of_a_to_b(prev_first_len, first_len));
155 |
156 | /* Third Byte: set range index to saturate_sub(first_len, 1) */
157 | /* 0 for 00~7F, 0 for C0~DF, 1 for E0~EF, 2 for F0~FF */
158 | __m256i tmp1, tmp2;
159 |
160 | /* tmp1 = (first_len, prev_first_len) << 2 bytes */
161 | tmp1 = push_last_2bytes_of_a_to_b(prev_first_len, first_len);
162 | /* tmp2 = saturate_sub(tmp1, 1) */
163 | tmp2 = _mm256_subs_epu8(tmp1, _mm256_set1_epi8(1));
164 |
165 | /* range |= tmp2 */
166 | range = _mm256_or_si256(range, tmp2);
167 |
168 | /* Fourth Byte: set range index to saturate_sub(first_len, 2) */
169 | /* 0 for 00~7F, 0 for C0~DF, 0 for E0~EF, 1 for F0~FF */
170 | /* tmp1 = (first_len, prev_first_len) << 3 bytes */
171 | tmp1 = push_last_3bytes_of_a_to_b(prev_first_len, first_len);
172 | /* tmp2 = saturate_sub(tmp1, 2) */
173 | tmp2 = _mm256_subs_epu8(tmp1, _mm256_set1_epi8(2));
174 | /* range |= tmp2 */
175 | range = _mm256_or_si256(range, tmp2);
176 |
177 | /*
178 | * Now we have below range indices caluclated
179 | * Correct cases:
180 | * - 8 for C0~FF
181 | * - 3 for 1st byte after F0~FF
182 | * - 2 for 1st byte after E0~EF or 2nd byte after F0~FF
183 | * - 1 for 1st byte after C0~DF or 2nd byte after E0~EF or
184 | * 3rd byte after F0~FF
185 | * - 0 for others
186 | * Error cases:
187 | * 9,10,11 if non ascii First Byte overlaps
188 | * E.g., F1 80 C2 90 --> 8 3 10 2, where 10 indicates error
189 | */
190 |
191 | /* Adjust Second Byte range for special First Bytes(E0,ED,F0,F4) */
192 | /* Overlaps lead to index 9~15, which are illegal in range table */
193 | __m256i shift1, pos, range2;
194 | /* shift1 = (input, prev_input) << 1 byte */
195 | shift1 = push_last_byte_of_a_to_b(prev_input, input);
196 | pos = _mm256_sub_epi8(shift1, _mm256_set1_epi8(0xEF));
197 | /*
198 | * shift1: | EF F0 ... FE | FF 00 ... ... DE | DF E0 ... EE |
199 | * pos: | 0 1 15 | 16 17 239| 240 241 255|
200 | * pos-240: | 0 0 0 | 0 0 0 | 0 1 15 |
201 | * pos+112: | 112 113 127| >= 128 | >= 128 |
202 | */
203 | tmp1 = _mm256_subs_epu8(pos, _mm256_set1_epi8(240));
204 | range2 = _mm256_shuffle_epi8(df_ee_tbl, tmp1);
205 | tmp2 = _mm256_adds_epu8(pos, _mm256_set1_epi8(112));
206 | range2 = _mm256_add_epi8(range2, _mm256_shuffle_epi8(ef_fe_tbl, tmp2));
207 |
208 | range = _mm256_add_epi8(range, range2);
209 |
210 | /* Load min and max values per calculated range index */
211 | __m256i minv = _mm256_shuffle_epi8(range_min_tbl, range);
212 | __m256i maxv = _mm256_shuffle_epi8(range_max_tbl, range);
213 |
214 | /* Check value range */
215 | #if RET_ERR_IDX
216 | __m256i error = _mm256_cmpgt_epi8(minv, input);
217 | error = _mm256_or_si256(error, _mm256_cmpgt_epi8(input, maxv));
218 | /* 5% performance drop from this conditional branch */
219 | if (!_mm256_testz_si256(error, error))
220 | break;
221 | #else
222 | error1 = _mm256_or_si256(error1, _mm256_cmpgt_epi8(minv, input));
223 | error2 = _mm256_or_si256(error2, _mm256_cmpgt_epi8(input, maxv));
224 | #endif
225 |
226 | prev_input = input;
227 | prev_first_len = first_len;
228 |
229 | data += 32;
230 | len -= 32;
231 | #if RET_ERR_IDX
232 | err_pos += 32;
233 | #endif
234 | }
235 |
236 | #if RET_ERR_IDX
237 | /* Error in first 16 bytes */
238 | if (err_pos == 1)
239 | goto do_naive;
240 | #else
241 | __m256i error = _mm256_or_si256(error1, error2);
242 | if (!_mm256_testz_si256(error, error))
243 | return -1;
244 | #endif
245 |
246 | /* Find previous token (not 80~BF) */
247 | int32_t token4 = _mm256_extract_epi32(prev_input, 7);
248 | const int8_t *token = (const int8_t *)&token4;
249 | int lookahead = 0;
250 | if (token[3] > (int8_t)0xBF)
251 | lookahead = 1;
252 | else if (token[2] > (int8_t)0xBF)
253 | lookahead = 2;
254 | else if (token[1] > (int8_t)0xBF)
255 | lookahead = 3;
256 |
257 | data -= lookahead;
258 | len += lookahead;
259 | #if RET_ERR_IDX
260 | err_pos -= lookahead;
261 | #endif
262 | }
263 |
264 | /* Check remaining bytes with naive method */
265 | #if RET_ERR_IDX
266 | int err_pos2;
267 | do_naive:
268 | err_pos2 = utf8_naive(data, len);
269 | if (err_pos2)
270 | return err_pos + err_pos2 - 1;
271 | return 0;
272 | #else
273 | return utf8_naive(data, len);
274 | #endif
275 | }
276 |
277 | #endif
278 |
--------------------------------------------------------------------------------
/range-neon.c:
--------------------------------------------------------------------------------
1 | #ifdef __aarch64__
2 |
3 | #include
4 | #include
5 | #include
6 |
7 | int utf8_naive(const unsigned char *data, int len);
8 |
9 | #if 0
10 | static void print128(const char *s, const uint8x16_t v128)
11 | {
12 | unsigned char v8[16];
13 | vst1q_u8(v8, v128);
14 |
15 | if (s)
16 | printf("%s:\t", s);
17 | for (int i = 0; i < 16; ++i)
18 | printf("%02x ", v8[i]);
19 | printf("\n");
20 | }
21 | #endif
22 |
23 | /*
24 | * Map high nibble of "First Byte" to legal character length minus 1
25 | * 0x00 ~ 0xBF --> 0
26 | * 0xC0 ~ 0xDF --> 1
27 | * 0xE0 ~ 0xEF --> 2
28 | * 0xF0 ~ 0xFF --> 3
29 | */
30 | static const uint8_t _first_len_tbl[] = {
31 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 3,
32 | };
33 |
34 | /* Map "First Byte" to 8-th item of range table (0xC2 ~ 0xF4) */
35 | static const uint8_t _first_range_tbl[] = {
36 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8,
37 | };
38 |
39 | /*
40 | * Range table, map range index to min and max values
41 | * Index 0 : 00 ~ 7F (First Byte, ascii)
42 | * Index 1,2,3: 80 ~ BF (Second, Third, Fourth Byte)
43 | * Index 4 : A0 ~ BF (Second Byte after E0)
44 | * Index 5 : 80 ~ 9F (Second Byte after ED)
45 | * Index 6 : 90 ~ BF (Second Byte after F0)
46 | * Index 7 : 80 ~ 8F (Second Byte after F4)
47 | * Index 8 : C2 ~ F4 (First Byte, non ascii)
48 | * Index 9~15 : illegal: u >= 255 && u <= 0
49 | */
50 | static const uint8_t _range_min_tbl[] = {
51 | 0x00, 0x80, 0x80, 0x80, 0xA0, 0x80, 0x90, 0x80,
52 | 0xC2, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
53 | };
54 | static const uint8_t _range_max_tbl[] = {
55 | 0x7F, 0xBF, 0xBF, 0xBF, 0xBF, 0x9F, 0xBF, 0x8F,
56 | 0xF4, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
57 | };
58 |
59 | /*
60 | * This table is for fast handling four special First Bytes(E0,ED,F0,F4), after
61 | * which the Second Byte are not 80~BF. It contains "range index adjustment".
62 | * - The idea is to minus byte with E0, use the result(0~31) as the index to
63 | * lookup the "range index adjustment". Then add the adjustment to original
64 | * range index to get the correct range.
65 | * - Range index adjustment
66 | * +------------+---------------+------------------+----------------+
67 | * | First Byte | original range| range adjustment | adjusted range |
68 | * +------------+---------------+------------------+----------------+
69 | * | E0 | 2 | 2 | 4 |
70 | * +------------+---------------+------------------+----------------+
71 | * | ED | 2 | 3 | 5 |
72 | * +------------+---------------+------------------+----------------+
73 | * | F0 | 3 | 3 | 6 |
74 | * +------------+---------------+------------------+----------------+
75 | * | F4 | 4 | 4 | 8 |
76 | * +------------+---------------+------------------+----------------+
77 | * - Below is a uint8x16x2 table, data is interleaved in NEON register. So I'm
78 | * putting it vertically. 1st column is for E0~EF, 2nd column for F0~FF.
79 | */
80 | static const uint8_t _range_adjust_tbl[] = {
81 | /* index -> 0~15 16~31 <- index */
82 | /* E0 -> */ 2, 3, /* <- F0 */
83 | 0, 0,
84 | 0, 0,
85 | 0, 0,
86 | 0, 4, /* <- F4 */
87 | 0, 0,
88 | 0, 0,
89 | 0, 0,
90 | 0, 0,
91 | 0, 0,
92 | 0, 0,
93 | 0, 0,
94 | 0, 0,
95 | /* ED -> */ 3, 0,
96 | 0, 0,
97 | 0, 0,
98 | };
99 |
100 | /* 2x ~ 4x faster than naive method */
101 | /* Return 0 on success, -1 on error */
102 | int utf8_range(const unsigned char *data, int len)
103 | {
104 | if (len >= 16) {
105 | uint8x16_t prev_input = vdupq_n_u8(0);
106 | uint8x16_t prev_first_len = vdupq_n_u8(0);
107 |
108 | /* Cached tables */
109 | const uint8x16_t first_len_tbl = vld1q_u8(_first_len_tbl);
110 | const uint8x16_t first_range_tbl = vld1q_u8(_first_range_tbl);
111 | const uint8x16_t range_min_tbl = vld1q_u8(_range_min_tbl);
112 | const uint8x16_t range_max_tbl = vld1q_u8(_range_max_tbl);
113 | const uint8x16x2_t range_adjust_tbl = vld2q_u8(_range_adjust_tbl);
114 |
115 | /* Cached values */
116 | const uint8x16_t const_1 = vdupq_n_u8(1);
117 | const uint8x16_t const_2 = vdupq_n_u8(2);
118 | const uint8x16_t const_e0 = vdupq_n_u8(0xE0);
119 |
120 | /* We use two error registers to remove a dependency. */
121 | uint8x16_t error1 = vdupq_n_u8(0);
122 | uint8x16_t error2 = vdupq_n_u8(0);
123 |
124 | while (len >= 16) {
125 | const uint8x16_t input = vld1q_u8(data);
126 |
127 | /* high_nibbles = input >> 4 */
128 | const uint8x16_t high_nibbles = vshrq_n_u8(input, 4);
129 |
130 | /* first_len = legal character length minus 1 */
131 | /* 0 for 00~7F, 1 for C0~DF, 2 for E0~EF, 3 for F0~FF */
132 | /* first_len = first_len_tbl[high_nibbles] */
133 | const uint8x16_t first_len =
134 | vqtbl1q_u8(first_len_tbl, high_nibbles);
135 |
136 | /* First Byte: set range index to 8 for bytes within 0xC0 ~ 0xFF */
137 | /* range = first_range_tbl[high_nibbles] */
138 | uint8x16_t range = vqtbl1q_u8(first_range_tbl, high_nibbles);
139 |
140 | /* Second Byte: set range index to first_len */
141 | /* 0 for 00~7F, 1 for C0~DF, 2 for E0~EF, 3 for F0~FF */
142 | /* range |= (first_len, prev_first_len) << 1 byte */
143 | range =
144 | vorrq_u8(range, vextq_u8(prev_first_len, first_len, 15));
145 |
146 | /* Third Byte: set range index to saturate_sub(first_len, 1) */
147 | /* 0 for 00~7F, 0 for C0~DF, 1 for E0~EF, 2 for F0~FF */
148 | uint8x16_t tmp1, tmp2;
149 | /* tmp1 = (first_len, prev_first_len) << 2 bytes */
150 | tmp1 = vextq_u8(prev_first_len, first_len, 14);
151 | /* tmp1 = saturate_sub(tmp1, 1) */
152 | tmp1 = vqsubq_u8(tmp1, const_1);
153 | /* range |= tmp1 */
154 | range = vorrq_u8(range, tmp1);
155 |
156 | /* Fourth Byte: set range index to saturate_sub(first_len, 2) */
157 | /* 0 for 00~7F, 0 for C0~DF, 0 for E0~EF, 1 for F0~FF */
158 | /* tmp2 = (first_len, prev_first_len) << 3 bytes */
159 | tmp2 = vextq_u8(prev_first_len, first_len, 13);
160 | /* tmp2 = saturate_sub(tmp2, 2) */
161 | tmp2 = vqsubq_u8(tmp2, const_2);
162 | /* range |= tmp2 */
163 | range = vorrq_u8(range, tmp2);
164 |
165 | /*
166 | * Now we have below range indices caluclated
167 | * Correct cases:
168 | * - 8 for C0~FF
169 | * - 3 for 1st byte after F0~FF
170 | * - 2 for 1st byte after E0~EF or 2nd byte after F0~FF
171 | * - 1 for 1st byte after C0~DF or 2nd byte after E0~EF or
172 | * 3rd byte after F0~FF
173 | * - 0 for others
174 | * Error cases:
175 | * 9,10,11 if non ascii First Byte overlaps
176 | * E.g., F1 80 C2 90 --> 8 3 10 2, where 10 indicates error
177 | */
178 |
179 | /* Adjust Second Byte range for special First Bytes(E0,ED,F0,F4) */
180 | /* See _range_adjust_tbl[] definition for details */
181 | /* Overlaps lead to index 9~15, which are illegal in range table */
182 | uint8x16_t shift1 = vextq_u8(prev_input, input, 15);
183 | uint8x16_t pos = vsubq_u8(shift1, const_e0);
184 | range = vaddq_u8(range, vqtbl2q_u8(range_adjust_tbl, pos));
185 |
186 | /* Load min and max values per calculated range index */
187 | uint8x16_t minv = vqtbl1q_u8(range_min_tbl, range);
188 | uint8x16_t maxv = vqtbl1q_u8(range_max_tbl, range);
189 |
190 | /* Check value range */
191 | error1 = vorrq_u8(error1, vcltq_u8(input, minv));
192 | error2 = vorrq_u8(error2, vcgtq_u8(input, maxv));
193 |
194 | prev_input = input;
195 | prev_first_len = first_len;
196 |
197 | data += 16;
198 | len -= 16;
199 | }
200 | /* Merge our error counters together */
201 | error1 = vorrq_u8(error1, error2);
202 |
203 | /* Delay error check till loop ends */
204 | if (vmaxvq_u8(error1))
205 | return -1;
206 |
207 | /* Find previous token (not 80~BF) */
208 | uint32_t token4;
209 | vst1q_lane_u32(&token4, vreinterpretq_u32_u8(prev_input), 3);
210 |
211 | const int8_t *token = (const int8_t *)&token4;
212 | int lookahead = 0;
213 | if (token[3] > (int8_t)0xBF)
214 | lookahead = 1;
215 | else if (token[2] > (int8_t)0xBF)
216 | lookahead = 2;
217 | else if (token[1] > (int8_t)0xBF)
218 | lookahead = 3;
219 |
220 | data -= lookahead;
221 | len += lookahead;
222 | }
223 |
224 | /* Check remaining bytes with naive method */
225 | return utf8_naive(data, len);
226 | }
227 |
228 | #endif
229 |
--------------------------------------------------------------------------------
/range-sse.c:
--------------------------------------------------------------------------------
1 | #ifdef __x86_64__
2 |
3 | #include
4 | #include
5 | #include
6 |
7 | int utf8_naive(const unsigned char *data, int len);
8 |
9 | #if 0
10 | static void print128(const char *s, const __m128i v128)
11 | {
12 | const unsigned char *v8 = (const unsigned char *)&v128;
13 | if (s)
14 | printf("%s:\t", s);
15 | for (int i = 0; i < 16; i++)
16 | printf("%02x ", v8[i]);
17 | printf("\n");
18 | }
19 | #endif
20 |
21 | /*
22 | * Map high nibble of "First Byte" to legal character length minus 1
23 | * 0x00 ~ 0xBF --> 0
24 | * 0xC0 ~ 0xDF --> 1
25 | * 0xE0 ~ 0xEF --> 2
26 | * 0xF0 ~ 0xFF --> 3
27 | */
28 | static const int8_t _first_len_tbl[] = {
29 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 3,
30 | };
31 |
32 | /* Map "First Byte" to 8-th item of range table (0xC2 ~ 0xF4) */
33 | static const int8_t _first_range_tbl[] = {
34 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8,
35 | };
36 |
37 | /*
38 | * Range table, map range index to min and max values
39 | * Index 0 : 00 ~ 7F (First Byte, ascii)
40 | * Index 1,2,3: 80 ~ BF (Second, Third, Fourth Byte)
41 | * Index 4 : A0 ~ BF (Second Byte after E0)
42 | * Index 5 : 80 ~ 9F (Second Byte after ED)
43 | * Index 6 : 90 ~ BF (Second Byte after F0)
44 | * Index 7 : 80 ~ 8F (Second Byte after F4)
45 | * Index 8 : C2 ~ F4 (First Byte, non ascii)
46 | * Index 9~15 : illegal: i >= 127 && i <= -128
47 | */
48 | static const int8_t _range_min_tbl[] = {
49 | 0x00, 0x80, 0x80, 0x80, 0xA0, 0x80, 0x90, 0x80,
50 | 0xC2, 0x7F, 0x7F, 0x7F, 0x7F, 0x7F, 0x7F, 0x7F,
51 | };
52 | static const int8_t _range_max_tbl[] = {
53 | 0x7F, 0xBF, 0xBF, 0xBF, 0xBF, 0x9F, 0xBF, 0x8F,
54 | 0xF4, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80,
55 | };
56 |
57 | /*
58 | * Tables for fast handling of four special First Bytes(E0,ED,F0,F4), after
59 | * which the Second Byte are not 80~BF. It contains "range index adjustment".
60 | * +------------+---------------+------------------+----------------+
61 | * | First Byte | original range| range adjustment | adjusted range |
62 | * +------------+---------------+------------------+----------------+
63 | * | E0 | 2 | 2 | 4 |
64 | * +------------+---------------+------------------+----------------+
65 | * | ED | 2 | 3 | 5 |
66 | * +------------+---------------+------------------+----------------+
67 | * | F0 | 3 | 3 | 6 |
68 | * +------------+---------------+------------------+----------------+
69 | * | F4 | 4 | 4 | 8 |
70 | * +------------+---------------+------------------+----------------+
71 | */
72 | /* index1 -> E0, index14 -> ED */
73 | static const int8_t _df_ee_tbl[] = {
74 | 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,
75 | };
76 | /* index1 -> F0, index5 -> F4 */
77 | static const int8_t _ef_fe_tbl[] = {
78 | 0, 3, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
79 | };
80 |
81 | #define RET_ERR_IDX 0 /* Define 1 to return index of first error char */
82 |
83 | /* 5x faster than naive method */
84 | /* Return 0 - success, -1 - error, >0 - first error char(if RET_ERR_IDX = 1) */
85 | int utf8_range(const unsigned char *data, int len)
86 | {
87 | #if RET_ERR_IDX
88 | int err_pos = 1;
89 | #endif
90 |
91 | if (len >= 16) {
92 | __m128i prev_input = _mm_set1_epi8(0);
93 | __m128i prev_first_len = _mm_set1_epi8(0);
94 |
95 | /* Cached tables */
96 | const __m128i first_len_tbl =
97 | _mm_loadu_si128((const __m128i *)_first_len_tbl);
98 | const __m128i first_range_tbl =
99 | _mm_loadu_si128((const __m128i *)_first_range_tbl);
100 | const __m128i range_min_tbl =
101 | _mm_loadu_si128((const __m128i *)_range_min_tbl);
102 | const __m128i range_max_tbl =
103 | _mm_loadu_si128((const __m128i *)_range_max_tbl);
104 | const __m128i df_ee_tbl =
105 | _mm_loadu_si128((const __m128i *)_df_ee_tbl);
106 | const __m128i ef_fe_tbl =
107 | _mm_loadu_si128((const __m128i *)_ef_fe_tbl);
108 |
109 | __m128i error = _mm_set1_epi8(0);
110 |
111 | while (len >= 16) {
112 | const __m128i input = _mm_loadu_si128((const __m128i *)data);
113 |
114 | /* high_nibbles = input >> 4 */
115 | const __m128i high_nibbles =
116 | _mm_and_si128(_mm_srli_epi16(input, 4), _mm_set1_epi8(0x0F));
117 |
118 | /* first_len = legal character length minus 1 */
119 | /* 0 for 00~7F, 1 for C0~DF, 2 for E0~EF, 3 for F0~FF */
120 | /* first_len = first_len_tbl[high_nibbles] */
121 | __m128i first_len = _mm_shuffle_epi8(first_len_tbl, high_nibbles);
122 |
123 | /* First Byte: set range index to 8 for bytes within 0xC0 ~ 0xFF */
124 | /* range = first_range_tbl[high_nibbles] */
125 | __m128i range = _mm_shuffle_epi8(first_range_tbl, high_nibbles);
126 |
127 | /* Second Byte: set range index to first_len */
128 | /* 0 for 00~7F, 1 for C0~DF, 2 for E0~EF, 3 for F0~FF */
129 | /* range |= (first_len, prev_first_len) << 1 byte */
130 | range = _mm_or_si128(
131 | range, _mm_alignr_epi8(first_len, prev_first_len, 15));
132 |
133 | /* Third Byte: set range index to saturate_sub(first_len, 1) */
134 | /* 0 for 00~7F, 0 for C0~DF, 1 for E0~EF, 2 for F0~FF */
135 | __m128i tmp;
136 | /* tmp = (first_len, prev_first_len) << 2 bytes */
137 | tmp = _mm_alignr_epi8(first_len, prev_first_len, 14);
138 | /* tmp = saturate_sub(tmp, 1) */
139 | tmp = _mm_subs_epu8(tmp, _mm_set1_epi8(1));
140 | /* range |= tmp */
141 | range = _mm_or_si128(range, tmp);
142 |
143 | /* Fourth Byte: set range index to saturate_sub(first_len, 2) */
144 | /* 0 for 00~7F, 0 for C0~DF, 0 for E0~EF, 1 for F0~FF */
145 | /* tmp = (first_len, prev_first_len) << 3 bytes */
146 | tmp = _mm_alignr_epi8(first_len, prev_first_len, 13);
147 | /* tmp = saturate_sub(tmp, 2) */
148 | tmp = _mm_subs_epu8(tmp, _mm_set1_epi8(2));
149 | /* range |= tmp */
150 | range = _mm_or_si128(range, tmp);
151 |
152 | /*
153 | * Now we have below range indices caluclated
154 | * Correct cases:
155 | * - 8 for C0~FF
156 | * - 3 for 1st byte after F0~FF
157 | * - 2 for 1st byte after E0~EF or 2nd byte after F0~FF
158 | * - 1 for 1st byte after C0~DF or 2nd byte after E0~EF or
159 | * 3rd byte after F0~FF
160 | * - 0 for others
161 | * Error cases:
162 | * 9,10,11 if non ascii First Byte overlaps
163 | * E.g., F1 80 C2 90 --> 8 3 10 2, where 10 indicates error
164 | */
165 |
166 | /* Adjust Second Byte range for special First Bytes(E0,ED,F0,F4) */
167 | /* Overlaps lead to index 9~15, which are illegal in range table */
168 | __m128i shift1, pos, range2;
169 | /* shift1 = (input, prev_input) << 1 byte */
170 | shift1 = _mm_alignr_epi8(input, prev_input, 15);
171 | pos = _mm_sub_epi8(shift1, _mm_set1_epi8(0xEF));
172 | /*
173 | * shift1: | EF F0 ... FE | FF 00 ... ... DE | DF E0 ... EE |
174 | * pos: | 0 1 15 | 16 17 239| 240 241 255|
175 | * pos-240: | 0 0 0 | 0 0 0 | 0 1 15 |
176 | * pos+112: | 112 113 127| >= 128 | >= 128 |
177 | */
178 | tmp = _mm_subs_epu8(pos, _mm_set1_epi8(0xF0));
179 | range2 = _mm_shuffle_epi8(df_ee_tbl, tmp);
180 | tmp = _mm_adds_epu8(pos, _mm_set1_epi8(0x70));
181 | range2 = _mm_add_epi8(range2, _mm_shuffle_epi8(ef_fe_tbl, tmp));
182 |
183 | range = _mm_add_epi8(range, range2);
184 |
185 | /* Load min and max values per calculated range index */
186 | __m128i minv = _mm_shuffle_epi8(range_min_tbl, range);
187 | __m128i maxv = _mm_shuffle_epi8(range_max_tbl, range);
188 |
189 | /* Check value range */
190 | #if RET_ERR_IDX
191 | error = _mm_cmplt_epi8(input, minv);
192 | error = _mm_or_si128(error, _mm_cmpgt_epi8(input, maxv));
193 | /* 5% performance drop from this conditional branch */
194 | if (!_mm_testz_si128(error, error))
195 | break;
196 | #else
197 | /* error |= (input < minv) | (input > maxv) */
198 | tmp = _mm_or_si128(
199 | _mm_cmplt_epi8(input, minv),
200 | _mm_cmpgt_epi8(input, maxv)
201 | );
202 | error = _mm_or_si128(error, tmp);
203 | #endif
204 |
205 | prev_input = input;
206 | prev_first_len = first_len;
207 |
208 | data += 16;
209 | len -= 16;
210 | #if RET_ERR_IDX
211 | err_pos += 16;
212 | #endif
213 | }
214 |
215 | #if RET_ERR_IDX
216 | /* Error in first 16 bytes */
217 | if (err_pos == 1)
218 | goto do_naive;
219 | #else
220 | if (!_mm_testz_si128(error, error))
221 | return -1;
222 | #endif
223 |
224 | /* Find previous token (not 80~BF) */
225 | int32_t token4 = _mm_extract_epi32(prev_input, 3);
226 | const int8_t *token = (const int8_t *)&token4;
227 | int lookahead = 0;
228 | if (token[3] > (int8_t)0xBF)
229 | lookahead = 1;
230 | else if (token[2] > (int8_t)0xBF)
231 | lookahead = 2;
232 | else if (token[1] > (int8_t)0xBF)
233 | lookahead = 3;
234 |
235 | data -= lookahead;
236 | len += lookahead;
237 | #if RET_ERR_IDX
238 | err_pos -= lookahead;
239 | #endif
240 | }
241 |
242 | /* Check remaining bytes with naive method */
243 | #if RET_ERR_IDX
244 | int err_pos2;
245 | do_naive:
246 | err_pos2 = utf8_naive(data, len);
247 | if (err_pos2)
248 | return err_pos + err_pos2 - 1;
249 | return 0;
250 | #else
251 | return utf8_naive(data, len);
252 | #endif
253 | }
254 |
255 | #endif
256 |
--------------------------------------------------------------------------------
/range.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cyb70289/utf8/d7e2737acc1a7416a5ea13bf9e0d693453e775be/range.png
--------------------------------------------------------------------------------
/range2-neon.c:
--------------------------------------------------------------------------------
1 | /*
2 | * Process 2x16 bytes in each iteration.
3 | * Comments removed for brevity. See range-neon.c for details.
4 | */
5 | #ifdef __aarch64__
6 |
7 | #include
8 | #include
9 | #include
10 |
11 | int utf8_naive(const unsigned char *data, int len);
12 |
13 | static const uint8_t _first_len_tbl[] = {
14 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 3,
15 | };
16 |
17 | static const uint8_t _first_range_tbl[] = {
18 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8,
19 | };
20 |
21 | static const uint8_t _range_min_tbl[] = {
22 | 0x00, 0x80, 0x80, 0x80, 0xA0, 0x80, 0x90, 0x80,
23 | 0xC2, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
24 | };
25 | static const uint8_t _range_max_tbl[] = {
26 | 0x7F, 0xBF, 0xBF, 0xBF, 0xBF, 0x9F, 0xBF, 0x8F,
27 | 0xF4, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
28 | };
29 |
30 | static const uint8_t _range_adjust_tbl[] = {
31 | 2, 3, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0,
32 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0,
33 | };
34 |
35 | /* Return 0 on success, -1 on error */
36 | int utf8_range2(const unsigned char *data, int len)
37 | {
38 | if (len >= 32) {
39 | uint8x16_t prev_input = vdupq_n_u8(0);
40 | uint8x16_t prev_first_len = vdupq_n_u8(0);
41 |
42 | const uint8x16_t first_len_tbl = vld1q_u8(_first_len_tbl);
43 | const uint8x16_t first_range_tbl = vld1q_u8(_first_range_tbl);
44 | const uint8x16_t range_min_tbl = vld1q_u8(_range_min_tbl);
45 | const uint8x16_t range_max_tbl = vld1q_u8(_range_max_tbl);
46 | const uint8x16x2_t range_adjust_tbl = vld2q_u8(_range_adjust_tbl);
47 |
48 | const uint8x16_t const_1 = vdupq_n_u8(1);
49 | const uint8x16_t const_2 = vdupq_n_u8(2);
50 | const uint8x16_t const_e0 = vdupq_n_u8(0xE0);
51 |
52 | uint8x16_t error1 = vdupq_n_u8(0);
53 | uint8x16_t error2 = vdupq_n_u8(0);
54 | uint8x16_t error3 = vdupq_n_u8(0);
55 | uint8x16_t error4 = vdupq_n_u8(0);
56 |
57 | while (len >= 32) {
58 | /******************* two blocks interleaved **********************/
59 |
60 | #if defined(__GNUC__) && !defined(__clang__) && (__GNUC__ < 8)
61 | /* gcc doesn't support vldq1_u8_x2 until version 8 */
62 | const uint8x16_t input_a = vld1q_u8(data);
63 | const uint8x16_t input_b = vld1q_u8(data + 16);
64 | #else
65 | /* Forces a double load on Clang */
66 | const uint8x16x2_t input_pair = vld1q_u8_x2(data);
67 | const uint8x16_t input_a = input_pair.val[0];
68 | const uint8x16_t input_b = input_pair.val[1];
69 | #endif
70 |
71 | const uint8x16_t high_nibbles_a = vshrq_n_u8(input_a, 4);
72 | const uint8x16_t high_nibbles_b = vshrq_n_u8(input_b, 4);
73 |
74 | const uint8x16_t first_len_a =
75 | vqtbl1q_u8(first_len_tbl, high_nibbles_a);
76 | const uint8x16_t first_len_b =
77 | vqtbl1q_u8(first_len_tbl, high_nibbles_b);
78 |
79 | uint8x16_t range_a = vqtbl1q_u8(first_range_tbl, high_nibbles_a);
80 | uint8x16_t range_b = vqtbl1q_u8(first_range_tbl, high_nibbles_b);
81 |
82 | range_a =
83 | vorrq_u8(range_a, vextq_u8(prev_first_len, first_len_a, 15));
84 | range_b =
85 | vorrq_u8(range_b, vextq_u8(first_len_a, first_len_b, 15));
86 |
87 | uint8x16_t tmp1_a, tmp2_a, tmp1_b, tmp2_b;
88 | tmp1_a = vextq_u8(prev_first_len, first_len_a, 14);
89 | tmp1_a = vqsubq_u8(tmp1_a, const_1);
90 | range_a = vorrq_u8(range_a, tmp1_a);
91 |
92 | tmp1_b = vextq_u8(first_len_a, first_len_b, 14);
93 | tmp1_b = vqsubq_u8(tmp1_b, const_1);
94 | range_b = vorrq_u8(range_b, tmp1_b);
95 |
96 | tmp2_a = vextq_u8(prev_first_len, first_len_a, 13);
97 | tmp2_a = vqsubq_u8(tmp2_a, const_2);
98 | range_a = vorrq_u8(range_a, tmp2_a);
99 |
100 | tmp2_b = vextq_u8(first_len_a, first_len_b, 13);
101 | tmp2_b = vqsubq_u8(tmp2_b, const_2);
102 | range_b = vorrq_u8(range_b, tmp2_b);
103 |
104 | uint8x16_t shift1_a = vextq_u8(prev_input, input_a, 15);
105 | uint8x16_t pos_a = vsubq_u8(shift1_a, const_e0);
106 | range_a = vaddq_u8(range_a, vqtbl2q_u8(range_adjust_tbl, pos_a));
107 |
108 | uint8x16_t shift1_b = vextq_u8(input_a, input_b, 15);
109 | uint8x16_t pos_b = vsubq_u8(shift1_b, const_e0);
110 | range_b = vaddq_u8(range_b, vqtbl2q_u8(range_adjust_tbl, pos_b));
111 |
112 | uint8x16_t minv_a = vqtbl1q_u8(range_min_tbl, range_a);
113 | uint8x16_t maxv_a = vqtbl1q_u8(range_max_tbl, range_a);
114 |
115 | uint8x16_t minv_b = vqtbl1q_u8(range_min_tbl, range_b);
116 | uint8x16_t maxv_b = vqtbl1q_u8(range_max_tbl, range_b);
117 |
118 | error1 = vorrq_u8(error1, vcltq_u8(input_a, minv_a));
119 | error2 = vorrq_u8(error2, vcgtq_u8(input_a, maxv_a));
120 |
121 | error3 = vorrq_u8(error3, vcltq_u8(input_b, minv_b));
122 | error4 = vorrq_u8(error4, vcgtq_u8(input_b, maxv_b));
123 |
124 | /************************ next iteration *************************/
125 | prev_input = input_b;
126 | prev_first_len = first_len_b;
127 |
128 | data += 32;
129 | len -= 32;
130 | }
131 | error1 = vorrq_u8(error1, error2);
132 | error1 = vorrq_u8(error1, error3);
133 | error1 = vorrq_u8(error1, error4);
134 |
135 | if (vmaxvq_u8(error1))
136 | return -1;
137 |
138 | uint32_t token4;
139 | vst1q_lane_u32(&token4, vreinterpretq_u32_u8(prev_input), 3);
140 |
141 | const int8_t *token = (const int8_t *)&token4;
142 | int lookahead = 0;
143 | if (token[3] > (int8_t)0xBF)
144 | lookahead = 1;
145 | else if (token[2] > (int8_t)0xBF)
146 | lookahead = 2;
147 | else if (token[1] > (int8_t)0xBF)
148 | lookahead = 3;
149 |
150 | data -= lookahead;
151 | len += lookahead;
152 | }
153 |
154 | return utf8_naive(data, len);
155 | }
156 |
157 | #endif
158 |
--------------------------------------------------------------------------------
/range2-sse.c:
--------------------------------------------------------------------------------
1 | /*
2 | * Process 2x16 bytes in each iteration.
3 | * Comments removed for brevity. See range-sse.c for details.
4 | */
5 | #ifdef __x86_64__
6 |
7 | #include
8 | #include
9 | #include
10 |
11 | int utf8_naive(const unsigned char *data, int len);
12 |
13 | static const int8_t _first_len_tbl[] = {
14 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 3,
15 | };
16 |
17 | static const int8_t _first_range_tbl[] = {
18 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8,
19 | };
20 |
21 | static const int8_t _range_min_tbl[] = {
22 | 0x00, 0x80, 0x80, 0x80, 0xA0, 0x80, 0x90, 0x80,
23 | 0xC2, 0x7F, 0x7F, 0x7F, 0x7F, 0x7F, 0x7F, 0x7F,
24 | };
25 | static const int8_t _range_max_tbl[] = {
26 | 0x7F, 0xBF, 0xBF, 0xBF, 0xBF, 0x9F, 0xBF, 0x8F,
27 | 0xF4, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80,
28 | };
29 |
30 | static const int8_t _df_ee_tbl[] = {
31 | 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,
32 | };
33 | static const int8_t _ef_fe_tbl[] = {
34 | 0, 3, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
35 | };
36 |
37 | /* Return 0 on success, -1 on error */
38 | int utf8_range2(const unsigned char *data, int len)
39 | {
40 | if (len >= 32) {
41 | __m128i prev_input = _mm_set1_epi8(0);
42 | __m128i prev_first_len = _mm_set1_epi8(0);
43 |
44 | const __m128i first_len_tbl =
45 | _mm_loadu_si128((const __m128i *)_first_len_tbl);
46 | const __m128i first_range_tbl =
47 | _mm_loadu_si128((const __m128i *)_first_range_tbl);
48 | const __m128i range_min_tbl =
49 | _mm_loadu_si128((const __m128i *)_range_min_tbl);
50 | const __m128i range_max_tbl =
51 | _mm_loadu_si128((const __m128i *)_range_max_tbl);
52 | const __m128i df_ee_tbl =
53 | _mm_loadu_si128((const __m128i *)_df_ee_tbl);
54 | const __m128i ef_fe_tbl =
55 | _mm_loadu_si128((const __m128i *)_ef_fe_tbl);
56 |
57 | __m128i error = _mm_set1_epi8(0);
58 |
59 | while (len >= 32) {
60 | /***************************** block 1 ****************************/
61 | const __m128i input_a = _mm_loadu_si128((const __m128i *)data);
62 |
63 | __m128i high_nibbles =
64 | _mm_and_si128(_mm_srli_epi16(input_a, 4), _mm_set1_epi8(0x0F));
65 |
66 | __m128i first_len_a = _mm_shuffle_epi8(first_len_tbl, high_nibbles);
67 |
68 | __m128i range_a = _mm_shuffle_epi8(first_range_tbl, high_nibbles);
69 |
70 | range_a = _mm_or_si128(
71 | range_a, _mm_alignr_epi8(first_len_a, prev_first_len, 15));
72 |
73 | __m128i tmp;
74 | tmp = _mm_alignr_epi8(first_len_a, prev_first_len, 14);
75 | tmp = _mm_subs_epu8(tmp, _mm_set1_epi8(1));
76 | range_a = _mm_or_si128(range_a, tmp);
77 |
78 | tmp = _mm_alignr_epi8(first_len_a, prev_first_len, 13);
79 | tmp = _mm_subs_epu8(tmp, _mm_set1_epi8(2));
80 | range_a = _mm_or_si128(range_a, tmp);
81 |
82 | __m128i shift1, pos, range2;
83 | shift1 = _mm_alignr_epi8(input_a, prev_input, 15);
84 | pos = _mm_sub_epi8(shift1, _mm_set1_epi8(0xEF));
85 | tmp = _mm_subs_epu8(pos, _mm_set1_epi8(0xF0));
86 | range2 = _mm_shuffle_epi8(df_ee_tbl, tmp);
87 | tmp = _mm_adds_epu8(pos, _mm_set1_epi8(0x70));
88 | range2 = _mm_add_epi8(range2, _mm_shuffle_epi8(ef_fe_tbl, tmp));
89 |
90 | range_a = _mm_add_epi8(range_a, range2);
91 |
92 | __m128i minv = _mm_shuffle_epi8(range_min_tbl, range_a);
93 | __m128i maxv = _mm_shuffle_epi8(range_max_tbl, range_a);
94 |
95 | tmp = _mm_or_si128(
96 | _mm_cmplt_epi8(input_a, minv),
97 | _mm_cmpgt_epi8(input_a, maxv)
98 | );
99 | error = _mm_or_si128(error, tmp);
100 |
101 | /***************************** block 2 ****************************/
102 | const __m128i input_b = _mm_loadu_si128((const __m128i *)(data+16));
103 |
104 | high_nibbles =
105 | _mm_and_si128(_mm_srli_epi16(input_b, 4), _mm_set1_epi8(0x0F));
106 |
107 | __m128i first_len_b = _mm_shuffle_epi8(first_len_tbl, high_nibbles);
108 |
109 | __m128i range_b = _mm_shuffle_epi8(first_range_tbl, high_nibbles);
110 |
111 | range_b = _mm_or_si128(
112 | range_b, _mm_alignr_epi8(first_len_b, first_len_a, 15));
113 |
114 |
115 | tmp = _mm_alignr_epi8(first_len_b, first_len_a, 14);
116 | tmp = _mm_subs_epu8(tmp, _mm_set1_epi8(1));
117 | range_b = _mm_or_si128(range_b, tmp);
118 |
119 | tmp = _mm_alignr_epi8(first_len_b, first_len_a, 13);
120 | tmp = _mm_subs_epu8(tmp, _mm_set1_epi8(2));
121 | range_b = _mm_or_si128(range_b, tmp);
122 |
123 | shift1 = _mm_alignr_epi8(input_b, input_a, 15);
124 | pos = _mm_sub_epi8(shift1, _mm_set1_epi8(0xEF));
125 | tmp = _mm_subs_epu8(pos, _mm_set1_epi8(0xF0));
126 | range2 = _mm_shuffle_epi8(df_ee_tbl, tmp);
127 | tmp = _mm_adds_epu8(pos, _mm_set1_epi8(0x70));
128 | range2 = _mm_add_epi8(range2, _mm_shuffle_epi8(ef_fe_tbl, tmp));
129 |
130 | range_b = _mm_add_epi8(range_b, range2);
131 |
132 | minv = _mm_shuffle_epi8(range_min_tbl, range_b);
133 | maxv = _mm_shuffle_epi8(range_max_tbl, range_b);
134 |
135 |
136 | tmp = _mm_or_si128(
137 | _mm_cmplt_epi8(input_b, minv),
138 | _mm_cmpgt_epi8(input_b, maxv)
139 | );
140 | error = _mm_or_si128(error, tmp);
141 |
142 | /************************ next iteration **************************/
143 | prev_input = input_b;
144 | prev_first_len = first_len_b;
145 |
146 | data += 32;
147 | len -= 32;
148 | }
149 |
150 | if (!_mm_testz_si128(error, error))
151 | return -1;
152 |
153 | int32_t token4 = _mm_extract_epi32(prev_input, 3);
154 | const int8_t *token = (const int8_t *)&token4;
155 | int lookahead = 0;
156 | if (token[3] > (int8_t)0xBF)
157 | lookahead = 1;
158 | else if (token[2] > (int8_t)0xBF)
159 | lookahead = 2;
160 | else if (token[1] > (int8_t)0xBF)
161 | lookahead = 3;
162 |
163 | data -= lookahead;
164 | len += lookahead;
165 | }
166 |
167 | return utf8_naive(data, len);
168 | }
169 |
170 | #endif
171 |
--------------------------------------------------------------------------------
/utf8_to_utf16/.gitignore:
--------------------------------------------------------------------------------
1 | utf8to16
2 |
--------------------------------------------------------------------------------
/utf8_to_utf16/Makefile:
--------------------------------------------------------------------------------
1 | CC = gcc
2 | CPPFLAGS = -g -O3 -Wall -march=native
3 |
4 | OBJS = main.o iconv.o naive.o
5 |
6 | utf8to16: ${OBJS}
7 | gcc $^ -o $@
8 |
9 | .PHONY: clean
10 | clean:
11 | rm -f utf8to16 *.o
12 |
--------------------------------------------------------------------------------
/utf8_to_utf16/iconv.c:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 | #include
4 | #include
5 |
6 | static iconv_t s_cd;
7 |
8 | /* Call iconv_open only once so the benchmark will be faster? */
9 | static void __attribute__ ((constructor)) init_iconv(void)
10 | {
11 | #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
12 | s_cd = iconv_open("UTF-16LE", "UTF-8");
13 | #else
14 | s_cd = iconv_open("UTF-16BE", "UTF-8");
15 | #endif
16 | if (s_cd == (iconv_t)-1) {
17 | perror("iconv_open");
18 | exit(1);
19 | }
20 | }
21 |
22 | /*
23 | * Parameters:
24 | * - buf8, len8: input utf-8 string
25 | * - buf16: buffer to store decoded utf-16 string
26 | * - *len16: on entry - utf-16 buffer length in bytes
27 | * on exit - length in bytes of valid decoded utf-16 string
28 | * Returns:
29 | * - 0: success
30 | * - >0: error position of input utf-8 string
31 | * - -1: utf-16 buffer overflow
32 | * LE/BE depends on host
33 | */
34 | int utf8_to16_iconv(const unsigned char *buf8, size_t len8,
35 | unsigned short *buf16, size_t *len16)
36 | {
37 | size_t ret, len16_save = *len16;
38 | const unsigned char *buf8_0 = buf8;
39 |
40 | ret = iconv(s_cd, (char **)&buf8, &len8, (char **)&buf16, len16);
41 |
42 | *len16 = len16_save - *len16;
43 |
44 | if (ret != (size_t)-1)
45 | return 0;
46 |
47 | if (errno == E2BIG)
48 | return -1; /* Output buffer full */
49 |
50 | return buf8 - buf8_0 + 1; /* EILSEQ, EINVAL, error position */
51 | }
52 |
--------------------------------------------------------------------------------
/utf8_to_utf16/main.c:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 | #include
4 | #include
5 | #include
6 | #include
7 | #include
8 | #include
9 | #include
10 |
11 | int utf8_to16_iconv(const unsigned char *buf8, size_t len8,
12 | unsigned short *buf16, size_t *len16);
13 | int utf8_to16_naive(const unsigned char *buf8, size_t len8,
14 | unsigned short *buf16, size_t *len16);
15 |
16 | static struct ftab {
17 | const char *name;
18 | int (*func)(const unsigned char *buf8, size_t len8,
19 | unsigned short *buf16, size_t *len16);
20 | } ftab[] = {
21 | {
22 | .name = "iconv",
23 | .func = utf8_to16_iconv,
24 | }, {
25 | .name = "naive",
26 | .func = utf8_to16_naive,
27 | },
28 | };
29 |
30 | static unsigned char *load_test_buf(int len)
31 | {
32 | const char utf8[] = "\xF0\x90\xBF\x80";
33 | const int utf8_len = sizeof(utf8)/sizeof(utf8[0]) - 1;
34 |
35 | unsigned char *data = malloc(len);
36 | unsigned char *p = data;
37 |
38 | while (len >= utf8_len) {
39 | memcpy(p, utf8, utf8_len);
40 | p += utf8_len;
41 | len -= utf8_len;
42 | }
43 |
44 | while (len--)
45 | *p++ = 0x7F;
46 |
47 | return data;
48 | }
49 |
50 | static unsigned char *load_test_file(int *len)
51 | {
52 | unsigned char *data;
53 | int fd;
54 | struct stat stat;
55 |
56 | fd = open("../UTF-8-demo.txt", O_RDONLY);
57 | if (fd == -1) {
58 | printf("Failed to open ../UTF-8-demo.txt!\n");
59 | exit(1);
60 | }
61 | if (fstat(fd, &stat) == -1) {
62 | printf("Failed to get file size!\n");
63 | exit(1);
64 | }
65 |
66 | *len = stat.st_size;
67 | data = malloc(*len);
68 | if (read(fd, data, *len) != *len) {
69 | printf("Failed to read file!\n");
70 | exit(1);
71 | }
72 |
73 | close(fd);
74 |
75 | return data;
76 | }
77 |
78 | static void print_test(const unsigned char *data, int len)
79 | {
80 | printf(" [len=%d] \"", len);
81 | while (len--)
82 | printf("\\x%02X", *data++);
83 |
84 | printf("\"\n");
85 | }
86 |
87 | struct test {
88 | const unsigned char *data;
89 | int len;
90 | };
91 |
92 | static void prepare_test_buf(unsigned char *buf, const struct test *pos,
93 | int pos_len, int pos_idx)
94 | {
95 | /* Round concatenate correct tokens to 1024 bytes */
96 | int buf_idx = 0;
97 | while (buf_idx < 1024) {
98 | int buf_len = 1024 - buf_idx;
99 |
100 | if (buf_len >= pos[pos_idx].len) {
101 | memcpy(buf+buf_idx, pos[pos_idx].data, pos[pos_idx].len);
102 | buf_idx += pos[pos_idx].len;
103 | } else {
104 | memset(buf+buf_idx, 0, buf_len);
105 | buf_idx += buf_len;
106 | }
107 |
108 | if (++pos_idx == pos_len)
109 | pos_idx = 0;
110 | }
111 | }
112 |
113 | /* Return 0 on success, -1 on error */
114 | static int test_manual(const struct ftab *ftab, unsigned short *buf16,
115 | unsigned short *_buf16)
116 | {
117 | #define LEN16 4096
118 |
119 | #pragma GCC diagnostic push
120 | #pragma GCC diagnostic ignored "-Wpointer-sign"
121 | /* positive tests */
122 | static const struct test pos[] = {
123 | {"", 0},
124 | {"\x00", 1},
125 | {"\x66", 1},
126 | {"\x7F", 1},
127 | {"\x00\x7F", 2},
128 | {"\x7F\x00", 2},
129 | {"\xC2\x80", 2},
130 | {"\xDF\xBF", 2},
131 | {"\xE0\xA0\x80", 3},
132 | {"\xE0\xA0\xBF", 3},
133 | {"\xED\x9F\x80", 3},
134 | {"\xEF\x80\xBF", 3},
135 | {"\xF0\x90\xBF\x80", 4},
136 | {"\xF2\x81\xBE\x99", 4},
137 | {"\xF4\x8F\x88\xAA", 4},
138 | };
139 |
140 | /* negative tests */
141 | static const struct test neg[] = {
142 | {"\x80", 1},
143 | {"\xBF", 1},
144 | {"\xC0\x80", 2},
145 | {"\xC1\x00", 2},
146 | {"\xC2\x7F", 2},
147 | {"\xDF\xC0", 2},
148 | {"\xE0\x9F\x80", 3},
149 | {"\xE0\xC2\x80", 3},
150 | {"\xED\xA0\x80", 3},
151 | {"\xED\x7F\x80", 3},
152 | {"\xEF\x80\x00", 3},
153 | {"\xF0\x8F\x80\x80", 4},
154 | {"\xF0\xEE\x80\x80", 4},
155 | {"\xF2\x90\x91\x7F", 4},
156 | {"\xF4\x90\x88\xAA", 4},
157 | {"\xF4\x00\xBF\xBF", 4},
158 | {"\x00\x00\x00\x00\x00\xC2\x80\x00\x00\x00\xE1\x80\x80\x00\x00\xC2" \
159 | "\xC2\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00",
160 | 32},
161 | {"\x00\x00\x00\x00\x00\xC2\xC2\x80\x00\x00\xE1\x80\x80\x00\x00\x00",
162 | 16},
163 | {"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" \
164 | "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xF1\x80",
165 | 32},
166 | {"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" \
167 | "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xF1",
168 | 32},
169 | {"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" \
170 | "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xF1\x80" \
171 | "\x80", 33},
172 | {"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" \
173 | "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xF1\x80" \
174 | "\xC2\x80", 34},
175 | {"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" \
176 | "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xF0" \
177 | "\x80\x80\x80", 35},
178 | };
179 | #pragma GCC diagnostic push
180 |
181 | size_t len16 = LEN16, _len16 = LEN16;
182 | int ret, _ret;
183 |
184 | /* Test single token */
185 | for (int i = 0; i < sizeof(pos)/sizeof(pos[0]); ++i) {
186 | ret = ftab->func(pos[i].data, pos[i].len, buf16, &len16);
187 | _ret = utf8_to16_iconv(pos[i].data, pos[i].len, _buf16, &_len16);
188 | if (ret != _ret || len16 != _len16 || memcmp(buf16, _buf16, len16)) {
189 | printf("FAILED positive test(%d:%d, %lu:%lu): ",
190 | ret, _ret, len16, _len16);
191 | print_test(pos[i].data, pos[i].len);
192 | return -1;
193 | }
194 | len16 = _len16 = LEN16;
195 | }
196 | for (int i = 0; i < sizeof(neg)/sizeof(neg[0]); ++i) {
197 | ret = ftab->func(neg[i].data, neg[i].len, buf16, &len16);
198 | _ret = utf8_to16_iconv(neg[i].data, neg[i].len, _buf16, &_len16);
199 | if (ret != _ret || len16 != _len16 || memcmp(buf16, _buf16, len16)) {
200 | printf("FAILED negitive test(%d:%d, %lu:%lu): ",
201 | ret, _ret, len16, _len16);
202 | print_test(neg[i].data, neg[i].len);
203 | return -1;
204 | }
205 | len16 = _len16 = LEN16;
206 | }
207 |
208 | /* Test shifted buffer to cover 1k length */
209 | /* buffer size must be greater than 1024 + 16 + max(test string length) */
210 | const int max_size = 1024*2;
211 | uint64_t buf64[max_size/8 + 2];
212 | /* Offset 8 bytes by 1 byte */
213 | unsigned char *buf = ((unsigned char *)buf64) + 1;
214 | int buf_len;
215 |
216 | for (int i = 0; i < sizeof(pos)/sizeof(pos[0]); ++i) {
217 | /* Positive test: shift 16 bytes, validate each shift */
218 | prepare_test_buf(buf, pos, sizeof(pos)/sizeof(pos[0]), i);
219 | buf_len = 1024;
220 | for (int j = 0; j < 16; ++j) {
221 | ret = ftab->func(buf, buf_len, buf16, &len16);
222 | _ret = utf8_to16_iconv(buf, buf_len, _buf16, &_len16);
223 | if (ret != _ret || len16 != _len16 || \
224 | memcmp(buf16, _buf16, len16)) {
225 | printf("FAILED positive test(%d:%d, %lu:%lu): ",
226 | ret, _ret, len16, _len16);
227 | print_test(buf, buf_len);
228 | return -1;
229 | }
230 | len16 = _len16 = LEN16;
231 | for (int k = buf_len; k >= 1; --k)
232 | buf[k] = buf[k-1];
233 | buf[0] = '\x55';
234 | ++buf_len;
235 | }
236 |
237 | /* Negative test: trunk last non ascii */
238 | while (buf_len >= 1 && buf[buf_len-1] <= 0x7F)
239 | --buf_len;
240 | if (buf_len) {
241 | ret = ftab->func(buf, buf_len-1, buf16, &len16);
242 | _ret = utf8_to16_iconv(buf, buf_len-1, _buf16, &_len16);
243 | if (ret != _ret || len16 != _len16 || \
244 | memcmp(buf16, _buf16, len16)) {
245 | printf("FAILED negative test(%d:%d, %lu:%lu): ",
246 | ret, _ret, len16, _len16);
247 | print_test(buf, buf_len-1);
248 | return -1;
249 | }
250 | len16 = _len16 = LEN16;
251 | }
252 | }
253 |
254 | /* Negative test */
255 | for (int i = 0; i < sizeof(neg)/sizeof(neg[0]); ++i) {
256 | /* Append one error token, shift 16 bytes, validate each shift */
257 | int pos_idx = i % (sizeof(pos)/sizeof(pos[0]));
258 | prepare_test_buf(buf, pos, sizeof(pos)/sizeof(pos[0]), pos_idx);
259 | memcpy(buf+1024, neg[i].data, neg[i].len);
260 | buf_len = 1024 + neg[i].len;
261 | for (int j = 0; j < 16; ++j) {
262 | ret = ftab->func(buf, buf_len, buf16, &len16);
263 | _ret = utf8_to16_iconv(buf, buf_len, _buf16, &_len16);
264 | if (ret != _ret || len16 != _len16 || \
265 | memcmp(buf16, _buf16, len16)) {
266 | printf("FAILED negative test(%d:%d, %lu:%lu): ",
267 | ret, _ret, len16, _len16);
268 | print_test(buf, buf_len);
269 | return -1;
270 | }
271 | len16 = _len16 = LEN16;
272 | for (int k = buf_len; k >= 1; --k)
273 | buf[k] = buf[k-1];
274 | buf[0] = '\x66';
275 | ++buf_len;
276 | }
277 | }
278 |
279 | return 0;
280 | }
281 |
282 | static void test(const unsigned char *buf8, size_t len8,
283 | unsigned short *buf16, size_t len16, const struct ftab *ftab)
284 | {
285 | /* Use iconv as the reference answer */
286 | if (strcmp(ftab->name, "iconv") == 0)
287 | return;
288 |
289 | printf("%s\n", ftab->name);
290 |
291 | /* Test file or buffer */
292 | size_t _len16 = len16;
293 | unsigned short *_buf16 = (unsigned short *)malloc(_len16);
294 | if (utf8_to16_iconv(buf8, len8, _buf16, &_len16)) {
295 | printf("Invalid test file or buffer!\n");
296 | exit(1);
297 | }
298 | printf("standard test: ");
299 | if (ftab->func(buf8, len8, buf16, &len16) || len16 != _len16 || \
300 | memcmp(buf16, _buf16, len16) != 0)
301 | printf("FAIL\n");
302 | else
303 | printf("pass\n");
304 | free(_buf16);
305 |
306 | /* Manual cases */
307 | unsigned short *mbuf8 = (unsigned short *)malloc(LEN16);
308 | unsigned short *mbuf16 = (unsigned short *)malloc(LEN16);
309 | printf("manual test: %s\n",
310 | test_manual(ftab, mbuf8, mbuf16) ? "FAIL" : "pass");
311 | free(mbuf8);
312 | free(mbuf16);
313 | printf("\n");
314 | }
315 |
316 | static void bench(const unsigned char *buf8, size_t len8,
317 | unsigned short *buf16, size_t len16, const struct ftab *ftab)
318 | {
319 | const int loops = 1024*1024*1024/len8;
320 | int ret = 0;
321 | double time, size;
322 | struct timeval tv1, tv2;
323 |
324 | fprintf(stderr, "bench %s... ", ftab->name);
325 | gettimeofday(&tv1, 0);
326 | for (int i = 0; i < loops; ++i)
327 | ret |= ftab->func(buf8, len8, buf16, &len16);
328 | gettimeofday(&tv2, 0);
329 | printf("%s\n", ret?"FAIL":"pass");
330 |
331 | time = tv2.tv_usec - tv1.tv_usec;
332 | time = time / 1000000 + tv2.tv_sec - tv1.tv_sec;
333 | size = ((double)len8 * loops) / (1024*1024);
334 | printf("time: %.4f s\n", time);
335 | printf("data: %.0f MB\n", size);
336 | printf("BW: %.2f MB/s\n", size / time);
337 | printf("\n");
338 | }
339 |
340 | static void usage(const char *bin)
341 | {
342 | printf("Usage:\n");
343 | printf("%s test [alg] ==> test all or one algorithm\n", bin);
344 | printf("%s bench [alg] ==> benchmark all or one algorithm\n", bin);
345 | printf("%s bench size NUM ==> benchmark with specific buffer size\n", bin);
346 | printf("alg = ");
347 | for (int i = 0; i < sizeof(ftab)/sizeof(ftab[0]); ++i)
348 | printf("%s ", ftab[i].name);
349 | printf("\nNUM = buffer size in bytes, 1 ~ 67108864(64M)\n");
350 | }
351 |
352 | int main(int argc, char *argv[])
353 | {
354 | int len8 = 0, len16;
355 | unsigned char *buf8;
356 | unsigned short *buf16;
357 | const char *alg = NULL;
358 | void (*tb)(const unsigned char *buf8, size_t len8,
359 | unsigned short *buf16, size_t len16, const struct ftab *ftab);
360 |
361 | tb = NULL;
362 | if (argc >= 2) {
363 | if (strcmp(argv[1], "test") == 0)
364 | tb = test;
365 | else if (strcmp(argv[1], "bench") == 0)
366 | tb = bench;
367 | if (argc >= 3) {
368 | alg = argv[2];
369 | if (strcmp(alg, "size") == 0) {
370 | if (argc < 4) {
371 | tb = NULL;
372 | } else {
373 | alg = NULL;
374 | len8 = atoi(argv[3]);
375 | if (len8 <= 0 || len8 > 67108864) {
376 | printf("Buffer size error!\n\n");
377 | tb = NULL;
378 | }
379 | }
380 | }
381 | }
382 | }
383 |
384 | if (tb == NULL) {
385 | usage(argv[0]);
386 | return 1;
387 | }
388 |
389 | /* Load UTF8 test buffer */
390 | if (len8)
391 | buf8 = load_test_buf(len8);
392 | else
393 | buf8 = load_test_file(&len8);
394 |
395 | /* Prepare UTF16 buffer large enough */
396 | len16 = len8 * 2;
397 | buf16 = (unsigned short *)malloc(len16);
398 |
399 | if (tb == bench)
400 | printf("============== Bench UTF8 (%d bytes) ==============\n", len8);
401 | for (int i = 0; i < sizeof(ftab)/sizeof(ftab[0]); ++i) {
402 | if (alg && strcmp(alg, ftab[i].name) != 0)
403 | continue;
404 | tb((const unsigned char *)buf8, len8, buf16, len16, &ftab[i]);
405 | }
406 |
407 | #if 0
408 | if (tb == bench) {
409 | printf("==================== Bench ASCII ====================\n");
410 | /* Change test buffer to ascii */
411 | for (int i = 0; i < len; i++)
412 | data[i] &= 0x7F;
413 |
414 | for (int i = 0; i < sizeof(ftab)/sizeof(ftab[0]); ++i) {
415 | if (alg && strcmp(alg, ftab[i].name) != 0)
416 | continue;
417 | tb((const unsigned char *)data, len, &ftab[i]);
418 | printf("\n");
419 | }
420 | }
421 | #endif
422 |
423 | return 0;
424 | }
425 |
--------------------------------------------------------------------------------
/utf8_to_utf16/naive.c:
--------------------------------------------------------------------------------
1 | #include
2 |
3 | /*
4 | * UTF-8 to UTF-16
5 | * Table from https://woboq.com/blog/utf-8-processing-using-simd.html
6 | *
7 | * +-------------------------------------+-------------------+
8 | * | UTF-8 | UTF-16LE (HI LO) |
9 | * +-------------------------------------+-------------------+
10 | * | 0aaaaaaa | 00000000 0aaaaaaa |
11 | * +-------------------------------------+-------------------+
12 | * | 110bbbbb 10aaaaaa | 00000bbb bbaaaaaa |
13 | * +-------------------------------------+-------------------+
14 | * | 1110cccc 10bbbbbb 10aaaaaa | ccccbbbb bbaaaaaa |
15 | * +-------------------------------------+-------------------+
16 | * | 11110ddd 10ddcccc 10bbbbbb 10aaaaaa | 110110uu uuccccbb |
17 | * + uuuu = ddddd - 1 | 110111bb bbaaaaaa |
18 | * +-------------------------------------+-------------------+
19 | */
20 |
21 | /*
22 | * Parameters:
23 | * - buf8, len8: input utf-8 string
24 | * - buf16: buffer to store decoded utf-16 string
25 | * - *len16: on entry - utf-16 buffer length in bytes
26 | * on exit - length in bytes of valid decoded utf-16 string
27 | * Returns:
28 | * - 0: success
29 | * - >0: error position of input utf-8 string
30 | * - -1: utf-16 buffer overflow
31 | * LE/BE depends on host
32 | */
33 | int utf8_to16_naive(const unsigned char *buf8, size_t len8,
34 | unsigned short *buf16, size_t *len16)
35 | {
36 | int err_pos = 1;
37 | size_t len16_left = *len16;
38 |
39 | *len16 = 0;
40 |
41 | while (len8) {
42 | unsigned char b0, b1, b2, b3;
43 | unsigned int u;
44 |
45 | /* Output buffer full */
46 | if (len16_left < 2)
47 | return -1;
48 |
49 | /* 1st byte */
50 | b0 = buf8[0];
51 |
52 | if ((b0 & 0x80) == 0) {
53 | /* 0aaaaaaa -> 00000000 0aaaaaaa */
54 | *buf16++ = b0;
55 | ++buf8;
56 | --len8;
57 | ++err_pos;
58 | *len16 += 2;
59 | len16_left -= 2;
60 | continue;
61 | }
62 |
63 | /* Character length */
64 | size_t clen = b0 & 0xF0;
65 | clen >>= 4; /* 10xx, 110x, 1110, 1111 */
66 | clen -= 12; /* -4~-1, 0/1, 2, 3 */
67 | clen += !clen; /* -4~-1, 1, 2, 3 */
68 |
69 | /* String too short or invalid 1st byte (10xxxxxx) */
70 | if (len8 <= clen)
71 | return err_pos;
72 |
73 | /* Trailing bytes must be within 0x80 ~ 0xBF */
74 | b1 = buf8[1];
75 | if ((signed char)b1 >= (signed char)0xC0)
76 | return err_pos;
77 | b1 &= 0x3F;
78 |
79 | ++clen;
80 | if (clen == 2) {
81 | u = b0 & 0x1F;
82 | u <<= 6;
83 | u |= b1;
84 | if (u <= 0x7F)
85 | return err_pos;
86 | *buf16++ = u;
87 | } else {
88 | b2 = buf8[2];
89 | if ((signed char)b2 >= (signed char)0xC0)
90 | return err_pos;
91 | b2 &= 0x3F;
92 | if (clen == 3) {
93 | u = b0 & 0x0F;
94 | u <<= 6;
95 | u |= b1;
96 | u <<= 6;
97 | u |= b2;
98 | if (u <= 0x7FF || (u >= 0xD800 && u <= 0xDFFF))
99 | return err_pos;
100 | *buf16++ = u;
101 | } else {
102 | /* clen == 4 */
103 | if (len16_left < 4)
104 | return -1; /* Output buffer full */
105 | b3 = buf8[3];
106 | if ((signed char)b3 >= (signed char)0xC0)
107 | return err_pos;
108 | u = b0 & 0x07;
109 | u <<= 6;
110 | u |= b1;
111 | u <<= 6;
112 | u |= b2;
113 | u <<= 6;
114 | u |= (b3 & 0x3F);
115 | if (u <= 0xFFFF || u > 0x10FFFF)
116 | return err_pos;
117 | u -= 0x10000;
118 | *buf16++ = (((u >> 10) & 0x3FF) | 0xD800);
119 | *buf16++ = ((u & 0x3FF) | 0xDC00);
120 | *len16 += 2;
121 | len16_left -= 2;
122 | }
123 | }
124 |
125 | buf8 += clen;
126 | len8 -= clen;
127 | err_pos += clen;
128 | *len16 += 2;
129 | len16_left -= 2;
130 | }
131 |
132 | return 0;
133 | }
134 |
--------------------------------------------------------------------------------