├── README.md ├── LICENSE └── utf8.lua /README.md: -------------------------------------------------------------------------------- 1 | # Lua 5.1 UTF-8 2 | Requires a global "bit" library, such as LuaJIT 2.0.3's. This is only tested under LuaJIT 2.0.3. 3 | 4 | All functionality is documented under Lua 5.3's documentation for the "utf8" library with the exception of utf8.force, which replaces all invalid UTF-8 sequences with the Unicode "Replacement Character" (U+FFFD). 5 | 6 | http://www.lua.org/manual/5.3/manual.html 7 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | CC0 1.0 Universal 2 | 3 | Statement of Purpose 4 | 5 | The laws of most jurisdictions throughout the world automatically confer 6 | exclusive Copyright and Related Rights (defined below) upon the creator and 7 | subsequent owner(s) (each and all, an "owner") of an original work of 8 | authorship and/or a database (each, a "Work"). 9 | 10 | Certain owners wish to permanently relinquish those rights to a Work for the 11 | purpose of contributing to a commons of creative, cultural and scientific 12 | works ("Commons") that the public can reliably and without fear of later 13 | claims of infringement build upon, modify, incorporate in other works, reuse 14 | and redistribute as freely as possible in any form whatsoever and for any 15 | purposes, including without limitation commercial purposes. These owners may 16 | contribute to the Commons to promote the ideal of a free culture and the 17 | further production of creative, cultural and scientific works, or to gain 18 | reputation or greater distribution for their Work in part through the use and 19 | efforts of others. 20 | 21 | For these and/or other purposes and motivations, and without any expectation 22 | of additional consideration or compensation, the person associating CC0 with a 23 | Work (the "Affirmer"), to the extent that he or she is an owner of Copyright 24 | and Related Rights in the Work, voluntarily elects to apply CC0 to the Work 25 | and publicly distribute the Work under its terms, with knowledge of his or her 26 | Copyright and Related Rights in the Work and the meaning and intended legal 27 | effect of CC0 on those rights. 28 | 29 | 1. Copyright and Related Rights. A Work made available under CC0 may be 30 | protected by copyright and related or neighboring rights ("Copyright and 31 | Related Rights"). Copyright and Related Rights include, but are not limited 32 | to, the following: 33 | 34 | i. the right to reproduce, adapt, distribute, perform, display, communicate, 35 | and translate a Work; 36 | 37 | ii. moral rights retained by the original author(s) and/or performer(s); 38 | 39 | iii. publicity and privacy rights pertaining to a person's image or likeness 40 | depicted in a Work; 41 | 42 | iv. rights protecting against unfair competition in regards to a Work, 43 | subject to the limitations in paragraph 4(a), below; 44 | 45 | v. rights protecting the extraction, dissemination, use and reuse of data in 46 | a Work; 47 | 48 | vi. database rights (such as those arising under Directive 96/9/EC of the 49 | European Parliament and of the Council of 11 March 1996 on the legal 50 | protection of databases, and under any national implementation thereof, 51 | including any amended or successor version of such directive); and 52 | 53 | vii. other similar, equivalent or corresponding rights throughout the world 54 | based on applicable law or treaty, and any national implementations thereof. 55 | 56 | 2. Waiver. To the greatest extent permitted by, but not in contravention of, 57 | applicable law, Affirmer hereby overtly, fully, permanently, irrevocably and 58 | unconditionally waives, abandons, and surrenders all of Affirmer's Copyright 59 | and Related Rights and associated claims and causes of action, whether now 60 | known or unknown (including existing as well as future claims and causes of 61 | action), in the Work (i) in all territories worldwide, (ii) for the maximum 62 | duration provided by applicable law or treaty (including future time 63 | extensions), (iii) in any current or future medium and for any number of 64 | copies, and (iv) for any purpose whatsoever, including without limitation 65 | commercial, advertising or promotional purposes (the "Waiver"). Affirmer makes 66 | the Waiver for the benefit of each member of the public at large and to the 67 | detriment of Affirmer's heirs and successors, fully intending that such Waiver 68 | shall not be subject to revocation, rescission, cancellation, termination, or 69 | any other legal or equitable action to disrupt the quiet enjoyment of the Work 70 | by the public as contemplated by Affirmer's express Statement of Purpose. 71 | 72 | 3. Public License Fallback. Should any part of the Waiver for any reason be 73 | judged legally invalid or ineffective under applicable law, then the Waiver 74 | shall be preserved to the maximum extent permitted taking into account 75 | Affirmer's express Statement of Purpose. In addition, to the extent the Waiver 76 | is so judged Affirmer hereby grants to each affected person a royalty-free, 77 | non transferable, non sublicensable, non exclusive, irrevocable and 78 | unconditional license to exercise Affirmer's Copyright and Related Rights in 79 | the Work (i) in all territories worldwide, (ii) for the maximum duration 80 | provided by applicable law or treaty (including future time extensions), (iii) 81 | in any current or future medium and for any number of copies, and (iv) for any 82 | purpose whatsoever, including without limitation commercial, advertising or 83 | promotional purposes (the "License"). The License shall be deemed effective as 84 | of the date CC0 was applied by Affirmer to the Work. Should any part of the 85 | License for any reason be judged legally invalid or ineffective under 86 | applicable law, such partial invalidity or ineffectiveness shall not 87 | invalidate the remainder of the License, and in such case Affirmer hereby 88 | affirms that he or she will not (i) exercise any of his or her remaining 89 | Copyright and Related Rights in the Work or (ii) assert any associated claims 90 | and causes of action with respect to the Work, in either case contrary to 91 | Affirmer's express Statement of Purpose. 92 | 93 | 4. Limitations and Disclaimers. 94 | 95 | a. No trademark or patent rights held by Affirmer are waived, abandoned, 96 | surrendered, licensed or otherwise affected by this document. 97 | 98 | b. Affirmer offers the Work as-is and makes no representations or warranties 99 | of any kind concerning the Work, express, implied, statutory or otherwise, 100 | including without limitation warranties of title, merchantability, fitness 101 | for a particular purpose, non infringement, or the absence of latent or 102 | other defects, accuracy, or the present or absence of errors, whether or not 103 | discoverable, all to the greatest extent permissible under applicable law. 104 | 105 | c. Affirmer disclaims responsibility for clearing rights of other persons 106 | that may apply to the Work or any use thereof, including without limitation 107 | any person's Copyright and Related Rights in the Work. Further, Affirmer 108 | disclaims responsibility for obtaining any necessary consents, permissions 109 | or other rights required for any use of the Work. 110 | 111 | d. Affirmer understands and acknowledges that Creative Commons is not a 112 | party to this document and has no duty or obligation with respect to this 113 | CC0 or use of the Work. 114 | 115 | For more information, please see 116 | 117 | 118 | -------------------------------------------------------------------------------- /utf8.lua: -------------------------------------------------------------------------------- 1 | local bit = bit 2 | local error = error 3 | local ipairs = ipairs 4 | local string = string 5 | local table = table 6 | local unpack = unpack 7 | 8 | module( "utf8" ) 9 | 10 | -- 11 | -- Pattern that can be used with the string library to match a single UTF-8 byte-sequence. 12 | -- This expects the string to contain valid UTF-8 data. 13 | -- 14 | charpattern = "[%z\x01-\x7F\xC2-\xF4][\x80-\xBF]*" 15 | 16 | -- 17 | -- Transforms indexes of a string to be positive. 18 | -- Negative indices will wrap around like the string library's functions. 19 | -- 20 | local function strRelToAbs( str, ... ) 21 | 22 | local args = { ... } 23 | 24 | for k, v in ipairs( args ) do 25 | v = v > 0 and v or #str + v + 1 26 | 27 | if v < 1 or v > #str then 28 | error( "bad index to string (out of range)", 3 ) 29 | end 30 | 31 | args[ k ] = v 32 | end 33 | 34 | return unpack( args ) 35 | 36 | end 37 | 38 | -- Decodes a single UTF-8 byte-sequence from a string, ensuring it is valid 39 | -- Returns the index of the first and last character of the sequence 40 | -- 41 | local function decode( str, startPos ) 42 | 43 | startPos = strRelToAbs( str, startPos or 1 ) 44 | 45 | local b1 = str:byte( startPos, startPos ) 46 | 47 | -- Single-byte sequence 48 | if b1 < 0x80 then 49 | return startPos, startPos 50 | end 51 | 52 | -- Validate first byte of multi-byte sequence 53 | if b1 > 0xF4 or b1 < 0xC2 then 54 | return nil 55 | end 56 | 57 | -- Get 'supposed' amount of continuation bytes from primary byte 58 | local contByteCount = b1 >= 0xF0 and 3 or 59 | b1 >= 0xE0 and 2 or 60 | b1 >= 0xC0 and 1 61 | 62 | local endPos = startPos + contByteCount 63 | 64 | -- Validate our continuation bytes 65 | for _, bX in ipairs { str:byte( startPos + 1, endPos ) } do 66 | 67 | if bit.band( bX, 0xC0 ) ~= 0x80 then 68 | return nil 69 | end 70 | 71 | end 72 | 73 | return startPos, endPos 74 | 75 | end 76 | 77 | -- 78 | -- Takes zero or more integers and returns a string containing the UTF-8 representation of each 79 | -- 80 | function char( ... ) 81 | 82 | local buf = {} 83 | 84 | for k, v in ipairs { ... } do 85 | 86 | if v < 0 or v > 0x10FFFF then 87 | error( "bad argument #" .. k .. " to char (out of range)", 2 ) 88 | end 89 | 90 | local b1, b2, b3, b4 = nil, nil, nil, nil 91 | 92 | if v < 0x80 then -- Single-byte sequence 93 | 94 | table.insert( buf, string.char( v ) ) 95 | 96 | elseif v < 0x800 then -- Two-byte sequence 97 | 98 | b1 = bit.bor( 0xC0, bit.band( bit.rshift( v, 6 ), 0x1F ) ) 99 | b2 = bit.bor( 0x80, bit.band( v, 0x3F ) ) 100 | 101 | table.insert( buf, string.char( b1, b2 ) ) 102 | 103 | elseif v < 0x10000 then -- Three-byte sequence 104 | 105 | b1 = bit.bor( 0xE0, bit.band( bit.rshift( v, 12 ), 0x0F ) ) 106 | b2 = bit.bor( 0x80, bit.band( bit.rshift( v, 6 ), 0x3F ) ) 107 | b3 = bit.bor( 0x80, bit.band( v, 0x3F ) ) 108 | 109 | table.insert( buf, string.char( b1, b2, b3 ) ) 110 | 111 | else -- Four-byte sequence 112 | 113 | b1 = bit.bor( 0xF0, bit.band( bit.rshift( v, 18 ), 0x07 ) ) 114 | b2 = bit.bor( 0x80, bit.band( bit.rshift( v, 12 ), 0x3F ) ) 115 | b3 = bit.bor( 0x80, bit.band( bit.rshift( v, 6 ), 0x3F ) ) 116 | b4 = bit.bor( 0x80, bit.band( v, 0x3F ) ) 117 | 118 | table.insert( buf, string.char( b1, b2, b3, b4 ) ) 119 | 120 | end 121 | 122 | end 123 | 124 | return table.concat( buf, "" ) 125 | 126 | end 127 | 128 | -- 129 | -- Iterates over a UTF-8 string similarly to pairs 130 | -- k = index of sequence, v = string value of sequence 131 | -- 132 | function codes( str ) 133 | 134 | local i = 1 135 | 136 | return function() 137 | 138 | -- Have we hit the end of the iteration set? 139 | if i > #str then 140 | return nil 141 | end 142 | 143 | local startPos, endPos = decode( str, i ) 144 | 145 | if not startPos then 146 | error( "invalid UTF-8 code", 2 ) 147 | end 148 | 149 | i = endPos + 1 150 | 151 | return startPos, str:sub( startPos, endPos ) 152 | 153 | end 154 | 155 | end 156 | 157 | -- 158 | -- Returns an integer-representation of the UTF-8 sequence(s) in a string 159 | -- startPos defaults to 1, endPos defaults to startPos 160 | -- 161 | function codepoint( str, startPos, endPos ) 162 | 163 | startPos, endPos = strRelToAbs( str, startPos or 1, endPos or startPos or 1 ) 164 | 165 | local ret = {} 166 | 167 | repeat 168 | local seqStartPos, seqEndPos = decode( str, startPos ) 169 | 170 | if not seqStartPos then 171 | error( "invalid UTF-8 code", 2 ) 172 | end 173 | 174 | -- Increment current string index 175 | startPos = seqEndPos + 1 176 | 177 | -- Amount of bytes making up our sequence 178 | local len = seqEndPos - seqStartPos + 1 179 | 180 | if len == 1 then -- Single-byte codepoint 181 | 182 | table.insert( ret, str:byte( seqStartPos ) ) 183 | 184 | else -- Multi-byte codepoint 185 | 186 | local b1 = str:byte( seqStartPos ) 187 | local cp = 0 188 | 189 | for i = seqStartPos + 1, seqEndPos do 190 | 191 | local bX = str:byte( i ) 192 | 193 | cp = bit.bor( bit.lshift( cp, 6 ), bit.band( bX, 0x3F ) ) 194 | b1 = bit.lshift( b1, 1 ) 195 | 196 | end 197 | 198 | cp = bit.bor( cp, bit.lshift( bit.band( b1, 0x7F ), ( len - 1 ) * 5 ) ) 199 | 200 | table.insert( ret, cp ) 201 | 202 | end 203 | until seqEndPos >= endPos 204 | 205 | return unpack( ret ) 206 | 207 | end 208 | 209 | -- 210 | -- Returns the length of a UTF-8 string. false, index is returned if an invalid sequence is hit 211 | -- startPos defaults to 1, endPos defaults to -1 212 | -- 213 | function len( str, startPos, endPos ) 214 | 215 | startPos, endPos = strRelToAbs( str, startPos or 1, endPos or -1 ) 216 | 217 | local len = 0 218 | 219 | repeat 220 | local seqStartPos, seqEndPos = decode( str, startPos ) 221 | 222 | -- Hit an invalid sequence? 223 | if not seqStartPos then 224 | return false, startPos 225 | end 226 | 227 | -- Increment current string pointer 228 | startPos = seqEndPos + 1 229 | 230 | -- Increment length 231 | len = len + 1 232 | until seqEndPos >= endPos 233 | 234 | return len 235 | 236 | end 237 | 238 | -- 239 | -- Returns the byte-index of the n'th UTF-8-character after the given byte-index (nil if none) 240 | -- startPos defaults to 1 when n is positive and -1 when n is negative 241 | -- If 0 is zero, this function instead returns the byte-index of the UTF-8-character startPos lies within. 242 | -- 243 | function offset( str, n, startPos ) 244 | 245 | startPos = strRelToAbs( str, startPos or ( n >= 0 and 1 ) or #str ) 246 | 247 | -- Find the beginning of the sequence over startPos 248 | if n == 0 then 249 | 250 | for i = startPos, 1, -1 do 251 | local seqStartPos, seqEndPos = decode( str, i ) 252 | 253 | if seqStartPos then 254 | return seqStartPos 255 | end 256 | end 257 | 258 | return nil 259 | 260 | end 261 | 262 | if not decode( str, startPos ) then 263 | error( "initial position is not beginning of a valid sequence", 2 ) 264 | end 265 | 266 | local itStart, itEnd, itStep = nil, nil, nil 267 | 268 | if n > 0 then -- Find the beginning of the n'th sequence forwards 269 | 270 | itStart = startPos 271 | itEnd = #str 272 | itStep = 1 273 | 274 | else -- Find the beginning of the n'th sequence backwards 275 | 276 | n = -n 277 | itStart = startPos 278 | itEnd = 1 279 | itStep = -1 280 | 281 | end 282 | 283 | for i = itStart, itEnd, itStep do 284 | local seqStartPos, seqEndPos = decode( str, i ) 285 | 286 | if seqStartPos then 287 | 288 | n = n - 1 289 | 290 | if n == 0 then 291 | return seqStartPos 292 | end 293 | 294 | end 295 | end 296 | 297 | return nil 298 | 299 | end 300 | 301 | -- 302 | -- Forces a string to contain only valid UTF-8 data. 303 | -- Invalid sequences are replaced with U+FFFD. 304 | -- 305 | function force( str ) 306 | 307 | local buf = {} 308 | 309 | local curPos, endPos = 1, #str 310 | 311 | repeat 312 | local seqStartPos, seqEndPos = decode( str, curPos ) 313 | 314 | if not seqStartPos then 315 | 316 | table.insert( buf, char( 0xFFFD ) ) 317 | curPos = curPos + 1 318 | 319 | else 320 | 321 | table.insert( buf, str:sub( seqStartPos, seqEndPos ) ) 322 | curPos = seqEndPos + 1 323 | 324 | end 325 | until curPos > endPos 326 | 327 | return table.concat( buf, "" ) 328 | 329 | end 330 | --------------------------------------------------------------------------------