├── .gitignore ├── LICENSE ├── Main └── main.c ├── README.md ├── source ├── utf-8.c └── utf-8.h └── utf-8.mmake /.gitignore: -------------------------------------------------------------------------------- 1 | *.o 2 | *.d 3 | *.so 4 | *.exe 5 | *.ini 6 | *.tags* 7 | 8 | *.code-workspace 9 | *.sublime-project 10 | *.sublime-workspace 11 | 12 | .vscode 13 | 14 | build/ 15 | notes/ -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2016 Adrian Guerrero Vera 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Main/main.c: -------------------------------------------------------------------------------- 1 | #include "utf-8.h" 2 | 3 | #include 4 | #include 5 | 6 | int main() { 7 | 8 | const char* String = "Hello World, Γεια σου κόσμο, こんにちは世界, привет мир."; 9 | const char* Character; 10 | 11 | utf8_iter ITER; 12 | 13 | utf8_init(&ITER, String); 14 | 15 | printf("\nString = %s\n\n", ITER.ptr); 16 | 17 | while (utf8_next(&ITER)) { 18 | 19 | Character = utf8_getchar(&ITER); 20 | 21 | printf("Character = %s\t Codepoint = %u\t\t BYTES: ", Character, ITER.codepoint); 22 | 23 | for (int i = 0; i < 8; i++) { 24 | if (Character[i] == 0) break; 25 | printf("%u ", (unsigned char)Character[i]); 26 | } 27 | 28 | printf("\n"); 29 | } 30 | 31 | printf("\n"); 32 | printf("Character Count = %u\n", ITER.count); 33 | printf("Length in BYTES = %u\n", ITER.length); 34 | 35 | return 0; 36 | } 37 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # UTF8 Iterator 2 | 3 | This library is an iterator for UTF8 chains, in addition to converting characters from UTF8 to Unicode and vice versa. 4 | 5 | ### How does this library work inside? 6 | 7 | I have created a document in Spanish that explains how this library works inside. You can read it and download it in PDF from this link. [Document in Google Doc](https://docs.google.com/document/d/1sqiEnZnchDRCWZffTAnKsU5Pyc28m3Lvg0UT2o4aClU/edit?usp=sharing) 8 | 9 | ## How to use the library? 10 | 11 | Using UTF8 Iterator is very easy, it consists of a structure and two functions. 12 | 13 | ```c 14 | #include "utf-8.h" 15 | #include 16 | 17 | int main() { 18 | 19 | const char* String = "Hello World, こんにちは世界, привет мир."; 20 | 21 | utf8_iter ITER; 22 | utf8_init(&ITER, String); 23 | 24 | while (utf8_next(&ITER)) { 25 | 26 | printf("Character: %s \t Codepoint: %u\n", utf8_getchar(&ITER), ITER.codepoint); 27 | 28 | } 29 | return 0; 30 | } 31 | ``` 32 | 33 | **`utf8_iter`** is the structure, and contains important and useful data. 34 | 35 | * **`ptr`** is the original pointer to the character string, it is assigned by `utf8_init()`. 36 | * **`codepoint`** is the current character in Unicode. 37 | * **`size`** is the size in bytes of the current character. 38 | * **`position`** is the current position in the string. 39 | * **`next`** is the next position in the string. 40 | * **`count`** is the number of characters currently. 41 | * **`length`** is the length of the string with `strlen()` 42 | 43 | **`utf8_init(iter, string)`** is used to start or restart the iterator. The first argument is a pointer to the Iterator, and the second argument is the character string. 44 | 45 | **`utf8_initEx(iter, string, length)`** works the same as `utf8_init`, but allows the user to set a maximum length for the string. 46 | 47 | **`utf8_next(iter)`** checks the string, the size of the **next** character and converts the character to Unicode. `Return: 1 -> Continue, 0 -> End or Error.` 48 | 49 | **`utf8_previous(iter)`** check the string, the size of the **previous** character and converts the character to Unicode. `Return: 1 -> Continue, 0 -> End or Error.` 50 | 51 | **`utf8_getchar(iter)`** allows to obtain the character in UTF8 `(char*)` in the Iterator position. 52 | 53 | ### Other functions 54 | 55 | These functions do not require the use of the Iterator: 56 | 57 | * **`utf8_len(string)`** returns the number of **unicode** characters in the `string`. It is different from `strlen()` 58 | * **`utf8_nlen(string, end)`** returns the number of **unicode** characters in the `string` to `end`. It is different from `strnlen()` 59 | * **`utf8_to_unicode(char*)`** returns the codepoint in **unicode**. 60 | * **`unicode_to_utf8(codepoint)`** returns the pointer to a string with the character in **UTF8**. 61 | 62 | For internal use or advanced users: 63 | 64 | * **`utf8_charsize(char*)`** returns the size in bytes of the provided character. 65 | * **`unicode_charsize(codepoint)`** returns the size in bytes that a Unicode character occupies in a UTF8 string. 66 | * **`utf8_converter(char*, size)`** this function converts a UTF8 character to Unicode. This function does not perform the size check. Requires the user to provide the character size. 67 | * **`unicode_converter(codepoint, size)`** this function converts a Unicode character to UTF8. Like `utf8_converter(...)`, it requires you to provide the size of the character. 68 | 69 | ## Compile Example 70 | 71 | To compile in GCC, use the following commands within the library folder: 72 | 73 | ``` 74 | mkdir Build 75 | gcc -Isource/ -Wall main/main.c source/utf-8.c -o build/utf-8 76 | 77 | In Windows: build\utf-8.exe 78 | In Mac and Linux: ./build/utf-8 79 | ``` 80 | 81 | Tested with GCC, MinGW, XCode and Visual Studio 2017. 82 | 83 | ## Issue Report 84 | 85 | You can report a problem in English or Spanish. 86 | 87 | > Link to GitHub: 88 | 89 | ## License 90 | 91 | **UTF8 Iterator** is distributed with an MIT License. You can see LICENSE for more info. 92 | 93 | ## Screenshots 94 | 95 | ###### UFT8 Iterator in Mac and Ubuntu: 96 | 97 | ![Terminal in Mac](https://image.ibb.co/kAJKpp/Terminal_en_Mac.png) 98 | 99 | ![Terminal in Ubuntu](https://image.ibb.co/fqnMV8/Terminal_en_Ubuntu.png) 100 | 101 | ###### UTF8 Iterator in Windows, UTF8 not support in CMD :( 102 | ![CMD](https://image.ibb.co/jBNoA8/Terminal_en_Windows.png) 103 | 104 | ###### UTF8 Iterator in Windows with Sublime Text: 105 | ![Console Sublime Text](https://image.ibb.co/eHOvq8/Console_Sublime_Text.png) 106 | -------------------------------------------------------------------------------- /source/utf-8.c: -------------------------------------------------------------------------------- 1 | /* 2 | UTF-8 Iterator. Version 0.1.3 3 | 4 | Original code by Adrian Guerrero Vera (adrianwk94@gmail.com) 5 | MIT License 6 | Copyright (c) 2016 Adrian Guerrero Vera 7 | 8 | Permission is hereby granted, free of charge, to any person obtaining a copy 9 | of this software and associated documentation files (the "Software"), to deal 10 | in the Software without restriction, including without limitation the rights 11 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 12 | copies of the Software, and to permit persons to whom the Software is 13 | furnished to do so, subject to the following conditions: 14 | The above copyright notice and this permission notice shall be included in all 15 | copies or substantial portions of the Software. 16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 17 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 18 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 19 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 20 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 21 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 22 | SOFTWARE. 23 | */ 24 | 25 | #include "utf-8.h" 26 | 27 | #include 28 | 29 | //utf8_iter 30 | 31 | void utf8_init(utf8_iter* iter, const char* ptr) { 32 | if (iter) { 33 | iter->ptr = ptr; 34 | iter->codepoint = 0; 35 | iter->position = 0; 36 | iter->next = 0; 37 | iter->count = 0; 38 | iter->length = ptr == NULL ? 0 : strlen(ptr); 39 | } 40 | } 41 | 42 | void utf8_initEx(utf8_iter* iter, const char* ptr, uint32_t length) { 43 | if (iter) { 44 | iter->ptr = ptr; 45 | iter->codepoint = 0; 46 | iter->position = 0; 47 | iter->next = 0; 48 | iter->count = 0; 49 | iter->length = length; 50 | } 51 | } 52 | 53 | uint8_t utf8_next(utf8_iter* iter) { 54 | 55 | if (iter == NULL) return 0; 56 | if (iter->ptr == NULL) return 0; 57 | 58 | const char* pointer; 59 | 60 | if (iter->next < iter->length) { 61 | 62 | iter->position = iter->next; 63 | 64 | pointer = iter->ptr + iter->next; //Set Current Pointer 65 | iter->size = utf8_charsize(pointer); 66 | 67 | if (iter->size == 0) return 0; 68 | 69 | iter->next = iter->next + iter->size; 70 | iter->codepoint = utf8_converter(pointer, iter->size); 71 | 72 | if (iter->codepoint == 0) return 0; 73 | 74 | iter->count++; 75 | 76 | return 1; 77 | } 78 | else { 79 | iter->position = iter->next; 80 | return 0; 81 | } 82 | } 83 | 84 | uint8_t utf8_previous(utf8_iter* iter) { 85 | 86 | if (iter == NULL) return 0; 87 | if (iter->ptr == NULL) return 0; 88 | 89 | if (iter->length != 0) { 90 | if (iter->position == 0 && iter->next == 0) { 91 | iter->position = iter->length; 92 | iter->count = utf8_strnlen(iter->ptr, iter->length); 93 | } 94 | } 95 | 96 | const char* pointer; 97 | 98 | if (iter->position > 0) { 99 | 100 | iter->next = iter->position; 101 | iter->position--; 102 | 103 | if ((iter->ptr[iter->position] & 0x80) == 0) { 104 | iter->size = 1; 105 | } 106 | else { 107 | iter->size = 1; 108 | while ((iter->ptr[iter->position] & 0xC0) == 0x80 && iter->size < 6) { 109 | iter->position--; 110 | iter->size++; 111 | } 112 | } 113 | 114 | pointer = iter->ptr + iter->position; 115 | 116 | iter->codepoint = utf8_converter(pointer, iter->size); 117 | 118 | if (iter->codepoint == 0) return 0; 119 | 120 | iter->count--; 121 | 122 | return 1; 123 | } 124 | else { 125 | iter->next = 0; 126 | return 0; 127 | } 128 | } 129 | 130 | const char* utf8_getchar(utf8_iter* iter) { 131 | 132 | static char str[10]; 133 | 134 | str[0] = '\0'; 135 | 136 | if (iter == NULL) return str; 137 | if (iter->ptr == NULL) return str; 138 | if (iter->size == 0) return str; 139 | 140 | if (iter->size == 1) { 141 | str[0] = iter->ptr[ iter->position ]; 142 | str[1] = '\0'; 143 | return str; 144 | } 145 | 146 | const char* pointer = iter->ptr + iter->position; 147 | 148 | for (uint8_t i = 0; i < iter->size; i++) { 149 | str[i] = pointer[i]; 150 | } 151 | 152 | str[iter->size] = '\0'; 153 | 154 | return str; 155 | } 156 | 157 | //Utilities 158 | 159 | uint32_t utf8_strlen(const char* string) { 160 | 161 | if (string == NULL) return 0; 162 | 163 | uint32_t length = 0; 164 | uint32_t position = 0; 165 | 166 | while (string[position]) { 167 | position = position + utf8_charsize(string + position); 168 | length++; 169 | } 170 | 171 | return length; 172 | } 173 | 174 | uint32_t utf8_strnlen(const char* string, uint32_t end) { 175 | 176 | if (string == NULL) return 0; 177 | 178 | uint32_t length = 0; 179 | uint32_t position = 0; 180 | 181 | while (string[position] && position < end) { 182 | position = position + utf8_charsize(string + position); 183 | length++; 184 | } 185 | 186 | return length; 187 | } 188 | 189 | uint32_t utf8_to_unicode(const char* character) { 190 | 191 | if (character == NULL) return 0; 192 | if (character[0] == 0) return 0; 193 | 194 | uint8_t size = utf8_charsize(character); 195 | 196 | if (size == 0) return 0; 197 | 198 | return utf8_converter(character, size); 199 | } 200 | 201 | const char* unicode_to_utf8(uint32_t codepoint) { 202 | return unicode_converter(codepoint, unicode_charsize(codepoint)); 203 | } 204 | 205 | //Internal use / Advanced use. 206 | 207 | ////// UTF8 to Unicode 208 | 209 | uint8_t utf8_charsize(const char* character) { 210 | 211 | if (character == NULL) return 0; 212 | if (character[0] == 0) return 0; 213 | 214 | if ((character[0] & 0x80) == 0) return 1; 215 | else if ((character[0] & 0xE0) == 0xC0) return 2; 216 | else if ((character[0] & 0xF0) == 0xE0) return 3; 217 | else if ((character[0] & 0xF8) == 0xF0) return 4; 218 | else if ((character[0] & 0xFC) == 0xF8) return 5; 219 | else if ((character[0] & 0xFE) == 0xFC) return 6; 220 | 221 | return 0; 222 | } 223 | 224 | static const uint8_t table_unicode[] = {0, 0, 0x1F, 0xF, 0x7, 0x3, 0x1}; 225 | 226 | uint32_t utf8_converter(const char* character, uint8_t size) { 227 | 228 | if (size == 0) return 0; 229 | if (character == NULL) return 0; 230 | if (character[0] == 0) return 0; 231 | 232 | static uint32_t codepoint = 0; 233 | 234 | if (size == 1) { 235 | return character[0]; 236 | } 237 | 238 | codepoint = table_unicode[size] & character[0]; 239 | 240 | for (uint8_t i = 1; i < size; i++) { 241 | codepoint = codepoint << 6; 242 | codepoint = codepoint | (character[i] & 0x3F); 243 | } 244 | 245 | return codepoint; 246 | } 247 | 248 | ////// Unicode to UTF8 249 | 250 | uint8_t unicode_charsize(uint32_t codepoint) { 251 | 252 | if (codepoint == 0) return 0; 253 | 254 | if (codepoint < 0x80) return 1; 255 | else if (codepoint < 0x800) return 2; 256 | else if (codepoint < 0x10000) return 3; 257 | else if (codepoint < 0x200000) return 4; 258 | else if (codepoint < 0x4000000) return 5; 259 | else if (codepoint <= 0x7FFFFFFF) return 6; 260 | 261 | return 0; 262 | } 263 | 264 | static const uint8_t table_utf8[] = {0, 0, 0xC0, 0xE0, 0xF0, 0xF8, 0xFC}; 265 | 266 | const char* unicode_converter(uint32_t codepoint, uint8_t size) { 267 | 268 | static char str[10]; 269 | 270 | str[size] = '\0'; 271 | 272 | if (size == 0) return str; 273 | 274 | if (size == 1) { 275 | str[0] = codepoint; 276 | return str; 277 | } 278 | 279 | for (uint8_t i = size - 1; i > 0; i--) { 280 | str[i] = 0x80 | (codepoint & 0x3F); 281 | codepoint = codepoint >> 6; 282 | } 283 | 284 | str[0] = table_utf8[size] | codepoint; 285 | 286 | return str; 287 | } 288 | -------------------------------------------------------------------------------- /source/utf-8.h: -------------------------------------------------------------------------------- 1 | /* 2 | UTF-8 Iterator. Version 0.1.3 3 | 4 | Original code by Adrian Guerrero Vera (adrianwk94@gmail.com) 5 | MIT License 6 | Copyright (c) 2016 Adrian Guerrero Vera 7 | 8 | Permission is hereby granted, free of charge, to any person obtaining a copy 9 | of this software and associated documentation files (the "Software"), to deal 10 | in the Software without restriction, including without limitation the rights 11 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 12 | copies of the Software, and to permit persons to whom the Software is 13 | furnished to do so, subject to the following conditions: 14 | The above copyright notice and this permission notice shall be included in all 15 | copies or substantial portions of the Software. 16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 17 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 18 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 19 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 20 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 21 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 22 | SOFTWARE. 23 | */ 24 | 25 | #ifndef utf8_iter_H 26 | #define utf8_iter_H 27 | 28 | #include 29 | 30 | typedef struct utf8_iter { 31 | 32 | const char* ptr; 33 | uint32_t codepoint; 34 | 35 | uint8_t size; // character size in bytes 36 | uint32_t position; // current character position 37 | uint32_t next; // next character position 38 | uint32_t count; // number of counter characters currently 39 | uint32_t length; // strlen() 40 | 41 | } utf8_iter; 42 | 43 | void utf8_init (utf8_iter* iter, const char* ptr); // all values to 0 and set ptr. 44 | void utf8_initEx (utf8_iter* iter, const char* ptr, uint32_t length); // allows you to set a custom length. 45 | 46 | uint8_t utf8_next (utf8_iter* iter); // returns 1 if there is a character in the next position. If there is not, return 0. 47 | uint8_t utf8_previous (utf8_iter* iter); // returns 1 if there is a character in the back position. If there is not, return 0. 48 | 49 | const char* utf8_getchar (utf8_iter* iter); // return current character in UFT8 - no same that iter.codepoint (not codepoint/unicode) 50 | 51 | // Utilities 52 | uint32_t utf8_strlen (const char* string); 53 | uint32_t utf8_strnlen (const char* string, uint32_t end); 54 | uint32_t utf8_to_unicode (const char* character); // UTF8 to Unicode. 55 | const char* unicode_to_utf8 (uint32_t codepoint); // Unicode to UTF8. 56 | 57 | // Internal use / Advanced use. 58 | uint8_t utf8_charsize (const char* character); // calculate the number of bytes a UTF8 character occupies in a string. 59 | uint8_t unicode_charsize (uint32_t codepoint); // calculates the number of bytes occupied by a Unicode character in UTF8. 60 | 61 | uint32_t utf8_converter (const char* character, uint8_t size); 62 | const char* unicode_converter (uint32_t codepoint, uint8_t size); 63 | 64 | #endif 65 | -------------------------------------------------------------------------------- /utf-8.mmake: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 |
27 | 28 | 29 | 30 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 51 | 52 | 53 | --------------------------------------------------------------------------------