├── doc ├── ReleaseNotes └── utf8cpp.html └── source ├── utf8.h └── utf8 ├── checked.h ├── core.h └── unchecked.h /doc/ReleaseNotes: -------------------------------------------------------------------------------- 1 | utf8 cpp library 2 | Release 2.1 3 | 4 | This is a minor feature release - added the function peek_next. 5 | 6 | Changes from version 2.o 7 | - Implemented feature request [ 1770746 ] "Provide a const version of next() (some sort of a peek() ) 8 | 9 | Files included in the release: utf8.h, core.h, checked.h, unchecked.h, utf8cpp.html, ReleaseNotes 10 | -------------------------------------------------------------------------------- /doc/utf8cpp.html: -------------------------------------------------------------------------------- 1 | 2 | 3 |
4 | 6 | 8 | 9 | 10 |48 | The Sourceforge project page 49 |
50 |93 | Many C++ developers miss an easy and portable way of handling Unicode encoded 94 | strings. C++ Standard is currently Unicode agnostic, and while some work is being 95 | done to introduce Unicode to the next incarnation called C++0x, for the moment 96 | nothing of the sort is available. In the meantime, developers use 3rd party 97 | libraries like ICU, OS specific capabilities, or simply roll out their own 98 | solutions. 99 |
100 |101 | In order to easily handle UTF-8 encoded Unicode strings, I have come up with a small 102 | generic library. For anybody used to work with STL algorithms and iterators, it should be 103 | easy and natural to use. The code is freely available for any purpose - check out 104 | the license at the beginning of the utf8.h file. If you run into 105 | bugs or performance issues, please let me know and I'll do my best to address them. 106 |
107 |108 | The purpose of this article is not to offer an introduction to Unicode in general, 109 | and UTF-8 in particular. If you are not familiar with Unicode, be sure to check out 110 | Unicode Home Page or some other source of 111 | information for Unicode. Also, it is not my aim to advocate the use of UTF-8 112 | encoded strings in C++ programs; if you want to handle UTF-8 encoded strings from 113 | C++, I am sure you have good reasons for it. 114 |
115 |
119 | To illustrate the use of this utf8 library, we shall open a file containing UTF-8
120 | encoded text, check whether it starts with a byte order mark, read each line into a
121 | std::string
, check it for validity, convert the text to UTF-16, and
122 | back to UTF-8:
123 |
125 | #include <fstream> 126 | #include <iostream> 127 | #include <string> 128 | #include <vector> 129 | #include "utf8.h" 130 | using namespace std; 131 | int main() 132 | { 133 | if (argc != 2) { 134 | cout << "\nUsage: docsample filename\n"; 135 | return 0; 136 | } 137 | const char* test_file_path = argv[1]; 138 | // Open the test file (must be UTF-8 encoded) 139 | ifstream fs8(test_file_path); 140 | if (!fs8.is_open()) { 141 | cout << "Could not open " << test_file_path << endl; 143 | return 0; 144 | } 145 | // Read the first line of the file 146 | unsigned line_count = 1; 147 | string line; 148 | if (!getline(fs8, line)) 149 | return 0; 150 | // Look for utf-8 byte-order mark at the beginning 151 | if (line.size() > 2) { 152 | if (utf8::is_bom(line.c_str())) 153 | cout << "There is a byte order mark at the beginning of the file\n"; 155 | } 156 | // Play with all the lines in the file 157 | do { 158 | // check for invalid utf-8 (for a simple yes/no check, there is also utf8::is_valid function) 159 | string::iterator end_it = utf8::find_invalid(line.begin(), line.end()); 160 | if (end_it != line.end()) { 161 | cout << "Invalid UTF-8 encoding detected at line " << line_count << "\n"; 164 | cout << "This part is fine: " << string(line.begin(), end_it) << "\n"; 167 | } 168 | // Get the line length (at least for the valid part) 169 | int length = utf8::distance(line.begin(), end_it); 170 | cout << "Length of line " << line_count << " is " << length << "\n"; 173 | // Convert it to utf-16 174 | vector<unsigned short> utf16line; 175 | utf8::utf8to16(line.begin(), end_it, back_inserter(utf16line)); 176 | // And back to utf-8 177 | string utf8line; 178 | utf8::utf16to8(utf16line.begin(), utf16line.end(), back_inserter(utf8line)); 179 | // Confirm that the conversion went OK: 180 | if (utf8line != string(line.begin(), end_it)) 181 | cout << "Error in UTF-16 conversion at line: " << line_count << "\n"; 184 | getline(fs8, line); 185 | line_count++; 186 | } while (!fs8.eof()); 187 | return 0; 188 | } 189 |190 |
191 | In the previous code sample, we have seen the use of the following functions from
192 | utf8
namespace: first we used is_bom
function to detect
193 | UTF-8 byte order mark at the beginning of the file; then for each line we performed
194 | a detection of invalid UTF-8 sequences with find_invalid
; the number
195 | of characters (more precisely - the number of Unicode code points) in each line was
196 | determined with a use of utf8::distance
; finally, we have converted
197 | each line to UTF-16 encoding with utf8to16
and back to UTF-8 with
198 | utf16to8
.
199 |
210 | Available in version 1.0 and later. 211 |
212 |213 | Encodes a 32 bit code point as a UTF-8 sequence of octets and appends the sequence 214 | to a UTF-8 string. 215 |
216 |217 | template <typename octet_iterator> 219 | octet_iterator append(uint32_t cp, octet_iterator result); 220 | 221 |222 |
223 | cp
: A 32 bit integer representing a code point to append to the
224 | sequence.
225 | result
: An output iterator to the place in the sequence where to
226 | append the code point.
227 | Return value: An iterator pointing to the place
228 | after the newly appended sequence.
229 |
231 | Example of use: 232 |
233 |234 | unsigned char u[5] = {0,0,0,0,0}; 237 | unsigned char* end = append(0x0448, u); 239 | assert (u[0] == 0xd1 && u[1] == 0x88 && u[2] == 0 && u[3] == 0 && u[4] == 0); 245 |246 |
247 | Note that append
does not allocate any memory - it is the burden of
248 | the caller to make sure there is enough memory allocated for the operation. To make
249 | things more interesting, append
can add anywhere between 1 and 4
250 | octets to the sequence. In practice, you would most often want to use
251 | std::back_inserter
to ensure that the necessary memory is allocated.
252 |
254 | In case of an invalid code point, a utf8::invalid_code_point
exception
255 | is thrown.
256 |
261 | Available in version 1.0 and later. 262 |
263 |264 | Given the iterator to the beginning of the UTF-8 sequence, it returns the code 265 | point and moves the iterator to the next position. 266 |
267 |268 | template <typename octet_iterator> 270 | uint32_t next(octet_iterator& it, octet_iterator end); 271 | 272 |273 |
274 | it
: a reference to an iterator pointing to the beginning of an UTF-8
275 | encoded code point. After the function returns, it is incremented to point to the
276 | beginning of the next code point.
277 | end
: end of the UTF-8 sequence to be processed. If it
278 | gets equal to end
during the extraction of a code point, an
279 | utf8::not_enough_room
exception is thrown.
280 | Return value: the 32 bit representation of the
281 | processed UTF-8 code point.
282 |
284 | Example of use: 285 |
286 |287 | char* twochars = "\xe6\x97\xa5\xd1\x88"; 289 | char* w = twochars; 290 | int cp = next(w, twochars + 6); 291 | assert (cp == 0x65e5); 292 | assert (w == twochars + 3); 293 |294 |
295 | This function is typically used to iterate through a UTF-8 encoded string. 296 |
297 |
298 | In case of an invalid UTF-8 seqence, a utf8::invalid_utf8
exception is
299 | thrown.
300 |
305 | Available in version 2.1 and later. 306 |
307 |308 | Given the iterator to the beginning of the UTF-8 sequence, it returns the code 309 | point for the following sequence without changing the value of the iterator. 310 |
311 |312 | template <typename octet_iterator> 314 | uint32_t peek_next(octet_iterator it, octet_iterator end); 315 | 316 |317 |
318 | it
: an iterator pointing to the beginning of an UTF-8
319 | encoded code point.
320 | end
: end of the UTF-8 sequence to be processed. If it
321 | gets equal to end
during the extraction of a code point, an
322 | utf8::not_enough_room
exception is thrown.
323 | Return value: the 32 bit representation of the
324 | processed UTF-8 code point.
325 |
327 | Example of use: 328 |
329 |330 | char* twochars = "\xe6\x97\xa5\xd1\x88"; 332 | char* w = twochars; 333 | int cp = peek_next(w, twochars + 6); 334 | assert (cp == 0x65e5); 335 | assert (w == twochars); 336 |337 |
338 | In case of an invalid UTF-8 seqence, a utf8::invalid_utf8
exception is
339 | thrown.
340 |
345 | Available in version 1.02 and later. 346 |
347 |348 | Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it 349 | decreases the iterator until it hits the beginning of the previous UTF-8 encoded 350 | code point and returns the 32 bits representation of the code point. 351 |
352 |353 | template <typename octet_iterator> 355 | uint32_t prior(octet_iterator& it, octet_iterator start); 356 | 357 |358 |
359 | it
: a reference pointing to an octet within a UTF-8 encoded string.
360 | After the function returns, it is decremented to point to the beginning of the
361 | previous code point.
362 | start
: an iterator to the beginning of the sequence where the search
363 | for the beginning of a code point is performed. It is a
364 | safety measure to prevent passing the beginning of the string in the search for a
365 | UTF-8 lead octet.
366 | Return value: the 32 bit representation of the
367 | previous code point.
368 |
370 | Example of use: 371 |
372 |373 | char* twochars = "\xe6\x97\xa5\xd1\x88"; 375 | unsigned char* w = twochars + 3; 377 | int cp = prior (w, twochars); 378 | assert (cp == 0x65e5); 379 | assert (w == twochars); 380 |381 |
382 | This function has two purposes: one is two iterate backwards through a UTF-8
383 | encoded string. Note that it is usually a better idea to iterate forward instead,
384 | since utf8::next
is faster. The second purpose is to find a beginning
385 | of a UTF-8 sequence if we have a random position within a string.
386 |
388 | it
will typically point to the beginning of
389 | a code point, and start
will point to the
390 | beginning of the string to ensure we don't go backwards too far. it
is
391 | decreased until it points to a lead UTF-8 octet, and then the UTF-8 sequence
392 | beginning with that octet is decoded to a 32 bit representation and returned.
393 |
395 | In case pass_end
is reached before a UTF-8 lead octet is hit, or if an
396 | invalid UTF-8 sequence is started by the lead octet, an invalid_utf8
397 | exception is thrown.
398 |
403 | Deprecated in version 1.02 and later. 404 |
405 |406 | Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it 407 | decreases the iterator until it hits the beginning of the previous UTF-8 encoded 408 | code point and returns the 32 bits representation of the code point. 409 |
410 |411 | template <typename octet_iterator> 413 | uint32_t previous(octet_iterator& it, octet_iterator pass_start); 414 | 415 |416 |
417 | it
: a reference pointing to an octet within a UTF-8 encoded string.
418 | After the function returns, it is decremented to point to the beginning of the
419 | previous code point.
420 | pass_start
: an iterator to the point in the sequence where the search
421 | for the beginning of a code point is aborted if no result was reached. It is a
422 | safety measure to prevent passing the beginning of the string in the search for a
423 | UTF-8 lead octet.
424 | Return value: the 32 bit representation of the
425 | previous code point.
426 |
428 | Example of use: 429 |
430 |431 | char* twochars = "\xe6\x97\xa5\xd1\x88"; 433 | unsigned char* w = twochars + 3; 435 | int cp = previous (w, twochars - 1); 437 | assert (cp == 0x65e5); 438 | assert (w == twochars); 439 |440 |
441 | utf8::previous
is deprecated, and utf8::prior
should
442 | be used instead, although the existing code can continue using this function.
443 | The problem is the parameter pass_start
that points to the position
444 | just before the beginning of the sequence. Standard containers don't have the
445 | concept of "pass start" and the function can not be used with their iterators.
446 |
448 | it
will typically point to the beginning of
449 | a code point, and pass_start
will point to the octet just before the
450 | beginning of the string to ensure we don't go backwards too far. it
is
451 | decreased until it points to a lead UTF-8 octet, and then the UTF-8 sequence
452 | beginning with that octet is decoded to a 32 bit representation and returned.
453 |
455 | In case pass_end
is reached before a UTF-8 lead octet is hit, or if an
456 | invalid UTF-8 sequence is started by the lead octet, an invalid_utf8
457 | exception is thrown
458 |
463 | Available in version 1.0 and later. 464 |
465 |466 | Advances an iterator by the specified number of code points within an UTF-8 467 | sequence. 468 |
469 |470 | template <typename octet_iterator, typename distance_type> 472 | void advance (octet_iterator& it, distance_type n, octet_iterator end); 474 | 475 |476 |
477 | it
: a reference to an iterator pointing to the beginning of an UTF-8
478 | encoded code point. After the function returns, it is incremented to point to the
479 | nth following code point.
480 | n
: a positive integer that shows how many code points we want to
481 | advance.
482 | end
: end of the UTF-8 sequence to be processed. If it
483 | gets equal to end
during the extraction of a code point, an
484 | utf8::not_enough_room
exception is thrown.
485 |
487 | Example of use: 488 |
489 |490 | char* twochars = "\xe6\x97\xa5\xd1\x88"; 492 | unsigned char* w = twochars; 493 | advance (w, 2, twochars + 6); 494 | assert (w == twochars + 5); 495 |496 |
497 | This function works only "forward". In case of a negative n
, there is
498 | no effect.
499 |
501 | In case of an invalid code point, a utf8::invalid_code_point
exception
502 | is thrown.
503 |
508 | Available in version 1.0 and later. 509 |
510 |511 | Given the iterators to two UTF-8 encoded code points in a seqence, returns the 512 | number of code points between them. 513 |
514 |515 | template <typename octet_iterator> 517 | typename std::iterator_traits<octet_iterator>::difference_type distance (octet_iterator first, octet_iterator last); 519 | 520 |521 |
522 | first
: an iterator to a beginning of a UTF-8 encoded code point.
523 | last
: an iterator to a "post-end" of the last UTF-8 encoded code
524 | point in the sequence we are trying to determine the length. It can be the
525 | beginning of a new code point, or not.
526 | Return value the distance between the iterators,
527 | in code points.
528 |
530 | Example of use: 531 |
532 |533 | char* twochars = "\xe6\x97\xa5\xd1\x88"; 535 | size_t dist = utf8::distance(twochars, twochars + 5); 536 | assert (dist == 2); 537 |538 |
539 | This function is used to find the length (in code points) of a UTF-8 encoded
540 | string. The reason it is called distance, rather than, say,
541 | length is mainly because developers are used that length is an
542 | O(1) function. Computing the length of an UTF-8 string is a linear operation, and
543 | it looked better to model it after std::distance
algorithm.
544 |
546 | In case of an invalid UTF-8 seqence, a utf8::invalid_utf8
exception is
547 | thrown. If last
does not point to the past-of-end of a UTF-8 seqence,
548 | a utf8::not_enough_room
exception is thrown.
549 |
554 | Available in version 1.0 and later. 555 |
556 |557 | Converts a UTF-16 encoded string to UTF-8. 558 |
559 |560 | template <typename u16bit_iterator, typename octet_iterator> 563 | octet_iterator utf16to8 (u16bit_iterator start, u16bit_iterator end, octet_iterator result); 564 | 565 |566 |
567 | start
: an iterator pointing to the beginning of the UTF-16 encoded
568 | string to convert.
569 | end
: an iterator pointing to pass-the-end of the UTF-16 encoded
570 | string to convert.
571 | result
: an output iterator to the place in the UTF-8 string where to
572 | append the result of conversion.
573 | Return value: An iterator pointing to the place
574 | after the appended UTF-8 string.
575 |
577 | Example of use: 578 |
579 |580 | unsigned short utf16string[] = {0x41, 0x0448, 0x65e5, 0xd834, 0xdd1e}; 584 | vector<unsigned char> utf8result; 585 | utf16to8(utf16string, utf16string + 5, back_inserter(utf8result)); 587 | assert (utf8result.size() == 10); 588 |589 |
590 | In case of invalid UTF-16 sequence, a utf8::invalid_utf16
exception is
591 | thrown.
592 |
597 | Available in version 1.0 and later. 598 |
599 |600 | Converts an UTF-8 encoded string to UTF-16 601 |
602 |603 | template <typename u16bit_iterator, typename octet_iterator> 605 | u16bit_iterator utf8to16 (octet_iterator start, octet_iterator end, u16bit_iterator result); 606 | 607 |608 |
609 | start
: an iterator pointing to the beginning of the UTF-8 encoded
610 | string to convert. < br /> end
: an iterator pointing to
611 | pass-the-end of the UTF-8 encoded string to convert.
612 | result
: an output iterator to the place in the UTF-16 string where to
613 | append the result of conversion.
614 | Return value: An iterator pointing to the place
615 | after the appended UTF-16 string.
616 |
618 | Example of use: 619 |
620 |621 | char utf8_with_surrogates[] = "\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e"; 623 | vector <unsigned short> utf16result; 624 | utf8to16(utf8_with_surrogates, utf8_with_surrogates + 9, back_inserter(utf16result)); 626 | assert (utf16result.size() == 4); 627 | assert (utf16result[2] == 0xd834); 629 | assert (utf16result[3] == 0xdd1e); 631 |632 |
633 | In case of an invalid UTF-8 seqence, a utf8::invalid_utf8
exception is
634 | thrown. If end
does not point to the past-of-end of a UTF-8 seqence, a
635 | utf8::not_enough_room
exception is thrown.
636 |
641 | Available in version 1.0 and later. 642 |
643 |644 | Converts a UTF-32 encoded string to UTF-8. 645 |
646 |647 | template <typename octet_iterator, typename u32bit_iterator> 649 | octet_iterator utf32to8 (u32bit_iterator start, u32bit_iterator end, octet_iterator result); 650 | 651 |652 |
653 | start
: an iterator pointing to the beginning of the UTF-32 encoded
654 | string to convert.
655 | end
: an iterator pointing to pass-the-end of the UTF-32 encoded
656 | string to convert.
657 | result
: an output iterator to the place in the UTF-8 string where to
658 | append the result of conversion.
659 | Return value: An iterator pointing to the place
660 | after the appended UTF-8 string.
661 |
663 | Example of use: 664 |
665 |666 | int utf32string[] = {0x448, 0x65E5, 0x10346, 0}; 669 | vector<unsigned char> utf8result; 670 | utf32to8(utf32string, utf32string + 3, back_inserter(utf8result)); 672 | assert (utf8result.size() == 9); 673 |674 |
675 | In case of invalid UTF-32 string, a utf8::invalid_code_point
exception
676 | is thrown.
677 |
682 | Available in version 1.0 and later. 683 |
684 |685 | Converts a UTF-8 encoded string to UTF-32. 686 |
687 |688 | template <typename octet_iterator, typename u32bit_iterator> 691 | u32bit_iterator utf8to32 (octet_iterator start, octet_iterator end, u32bit_iterator result); 692 | 693 |694 |
695 | start
: an iterator pointing to the beginning of the UTF-8 encoded
696 | string to convert.
697 | end
: an iterator pointing to pass-the-end of the UTF-8 encoded string
698 | to convert.
699 | result
: an output iterator to the place in the UTF-32 string where to
700 | append the result of conversion.
701 | Return value: An iterator pointing to the place
702 | after the appended UTF-32 string.
703 |
705 | Example of use: 706 |
707 |708 | char* twochars = "\xe6\x97\xa5\xd1\x88"; 710 | vector<int> utf32result; 711 | utf8to32(twochars, twochars + 5, back_inserter(utf32result)); 713 | assert (utf32result.size() == 2); 714 |715 |
716 | In case of an invalid UTF-8 seqence, a utf8::invalid_utf8
exception is
717 | thrown. If end
does not point to the past-of-end of a UTF-8 seqence, a
718 | utf8::not_enough_room
exception is thrown.
719 |
724 | Available in version 1.0 and later. 725 |
726 |727 | Detects an invalid sequence within a UTF-8 string. 728 |
729 |730 | template <typename octet_iterator> 732 | octet_iterator find_invalid(octet_iterator start, octet_iterator end); 733 |734 |
735 | start
: an iterator pointing to the beginning of the UTF-8 string to
736 | test for validity.
737 | end
: an iterator pointing to pass-the-end of the UTF-8 string to test
738 | for validity.
739 | Return value: an iterator pointing to the first
740 | invalid octet in the UTF-8 string. In case none were found, equals
741 | end
.
742 |
744 | Example of use: 745 |
746 |747 | char utf_invalid[] = "\xe6\x97\xa5\xd1\x88\xfa"; 749 | char* invalid = find_invalid(utf_invalid, utf_invalid + 6); 752 | assert (invalid == utf_invalid + 5); 753 |754 |
755 | This function is typically used to make sure a UTF-8 string is valid before 756 | processing it with other functions. It is especially important to call it if before 757 | doing any of the unchecked operations on it. 758 |
759 |763 | Available in version 1.0 and later. 764 |
765 |766 | Checks whether a sequence of octets is a valid UTF-8 string. 767 |
768 |769 | template <typename octet_iterator> 771 | bool is_valid(octet_iterator start, octet_iterator end); 772 | 773 |774 |
775 | start
: an iterator pointing to the beginning of the UTF-8 string to
776 | test for validity.
777 | end
: an iterator pointing to pass-the-end of the UTF-8 string to test
778 | for validity.
779 | Return value: true
if the sequence
780 | is a valid UTF-8 string; false
if not.
781 |
784 | char utf_invalid[] = "\xe6\x97\xa5\xd1\x88\xfa"; 786 | bool bvalid = is_valid(utf_invalid, utf_invalid + 6); 788 | assert (bvalid == false); 789 |790 |
791 | is_valid
is a shorthand for find_invalid(start, end) ==
792 | end;
. You may want to use it to make sure that a byte seqence is a valid
793 | UTF-8 string without the need to know where it fails if it is not valid.
794 |
799 | Available in version 2.0 and later. 800 |
801 |802 | Replaces all invalid UTF-8 sequences within a string with a replacement marker. 803 |
804 |805 | template <typename octet_iterator, typename output_iterator> 808 | output_iterator replace_invalid(octet_iterator start, octet_iterator end, output_iterator out, uint32_t replacement); 809 | template <typename octet_iterator, typename output_iterator> 812 | output_iterator replace_invalid(octet_iterator start, octet_iterator end, output_iterator out); 813 | 814 |815 |
816 | start
: an iterator pointing to the beginning of the UTF-8 string to
817 | look for invalid UTF-8 sequences.
818 | end
: an iterator pointing to pass-the-end of the UTF-8 string to look
819 | for invalid UTF-8 sequences.
820 | out
: An output iterator to the range where the result of replacement
821 | is stored.
822 | replacement
: A Unicode code point for the replacement marker. The
823 | version without this parameter assumes the value 0xfffd
824 | Return value: An iterator pointing to the place
825 | after the UTF-8 string with replaced invalid sequences.
826 |
828 | Example of use: 829 |
830 |831 | char invalid_sequence[] = "a\x80\xe0\xa0\xc0\xaf\xed\xa0\x80z"; 833 | vector<char> replace_invalid_result; 834 | replace_invalid (invalid_sequence, invalid_sequence + sizeof(invalid_sequence), back_inserter(replace_invalid_result), '?'); 836 | bvalid = is_valid(replace_invalid_result.begin(), replace_invalid_result.end()); 837 | assert (bvalid); 838 | char* fixed_invalid_sequence = "a????z"; 840 | assert (std::equal(replace_invalid_result.begin(), replace_invalid_result.end(), fixed_invalid_sequence)); 841 |842 |
843 | replace_invalid
does not perform in-place replacement of invalid
844 | sequences. Rather, it produces a copy of the original string with the invalid
845 | sequences replaced with a replacement marker. Therefore, out
must not
846 | be in the [start, end]
range.
847 |
849 | If end
does not point to the past-of-end of a UTF-8 sequence, a
850 | utf8::not_enough_room
exception is thrown.
851 |
856 | Available in version 1.0 and later. 857 |
858 |859 | Checks whether a sequence of three octets is a UTF-8 byte order mark (BOM) 860 |
861 |862 | template <typename octet_iterator> 864 | bool is_bom (octet_iterator it); 865 |866 |
867 | it
: beginning of the 3-octet sequence to check
868 | Return value: true
if the sequence
869 | is UTF-8 byte order mark; false
if not.
870 |
872 | Example of use: 873 |
874 |875 | unsigned char byte_order_mark[] = {0xef, 0xbb, 0xbf}; 878 | bool bbom = is_bom(byte_order_mark); 879 | assert (bbom == true); 880 |881 |
882 | The typical use of this function is to check the first three bytes of a file. If 883 | they form the UTF-8 BOM, we want to skip them before processing the actual UTF-8 884 | encoded text. 885 |
886 |893 | Available in version 2.0 and later. 894 |
895 |896 | Adapts the underlying octet iterator to iterate over the sequence of code points, 897 | rather than raw octets. 898 |
899 |900 | template <typename octet_iterator> 901 | class iterator; 902 |903 | 904 |
iterator();
octet_iterator
is
907 | constructed with its default constructor.
908 | explicit iterator (const octet_iterator& octet_it,
909 | const octet_iterator& range_start,
910 | const octet_iterator& range_end);
octet_iterator
with octet_it
912 | and sets the range in which the iterator is considered valid.
913 | octet_iterator base () const;
octet_iterator
.
915 | uint32_t operator * () const;
octet_iterator
is pointing to and returns the code point.
917 | bool operator == (const iterator& rhs)
918 | const;
bool operator != (const iterator& rhs)
921 | const;
iterator& operator ++ ();
iterator operator ++ (int);
iterator& operator -- ();
iterator operator -- (int);
933 | Example of use: 934 |
935 |936 | char* threechars = "\xf0\x90\x8d\x86\xe6\x97\xa5\xd1\x88"; 937 | utf8::iterator<char*> it(threechars, threechars, threechars + 9); 938 | utf8::iterator<char*> it2 = it; 939 | assert (it2 == it); 940 | assert (*it == 0x10346); 941 | assert (*(++it) == 0x65e5); 942 | assert ((*it++) == 0x65e5); 943 | assert (*it == 0x0448); 944 | assert (it != it2); 945 | utf8::iterator<char*> endit (threechars + 9, threechars, threechars + 9); 946 | assert (++it == endit); 947 | assert (*(--it) == 0x0448); 948 | assert ((*it--) == 0x0448); 949 | assert (*it == 0x65e5); 950 | assert (--it == utf8::iterator<char*>(threechars, threechars, threechars + 9)); 951 | assert (*it == 0x10346); 952 |953 |
954 | The purpose of utf8::iterator
adapter is to enable easy iteration as well as the use of STL
955 | algorithms with UTF-8 encoded strings. Increment and decrement operators are implemented in terms of
956 | utf8::next()
and utf8::prior()
functions.
957 |
959 | Note that utf8::iterator
adapter is a checked iterator. It operates on the range specified in
960 | the constructor; any attempt to go out of that range will result in an exception. Even the comparison operators
961 | require both iterator object to be constructed against the same range - otherwise an exception is thrown. Typically,
962 | the range will be determined by sequence container functions begin
and end
, i.e.:
963 |
965 | std::string s = "example";
966 | utf8::iterator i (s.begin(), s.begin(), s.end());
967 |
968 | 975 | Available in version 1.0 and later. 976 |
977 |978 | Encodes a 32 bit code point as a UTF-8 sequence of octets and appends the sequence 979 | to a UTF-8 string. 980 |
981 |982 | template <typename octet_iterator> 984 | octet_iterator append(uint32_t cp, octet_iterator result); 985 | 986 |987 |
988 | cp
: A 32 bit integer representing a code point to append to the
989 | sequence.
990 | result
: An output iterator to the place in the sequence where to
991 | append the code point.
992 | Return value: An iterator pointing to the place
993 | after the newly appended sequence.
994 |
996 | Example of use: 997 |
998 |999 | unsigned char u[5] = {0,0,0,0,0}; 1002 | unsigned char* end = unchecked::append(0x0448, u); 1004 | assert (u[0] == 0xd1 && u[1] == 0x88 && u[2] == 0 && u[3] == 0 && u[4] == 0); 1010 |1011 |
1012 | This is a faster but less safe version of utf8::append
. It does not
1013 | check for validity of the supplied code point, and may produce an invalid UTF-8
1014 | sequence.
1015 |
1020 | Available in version 1.0 and later. 1021 |
1022 |1023 | Given the iterator to the beginning of a UTF-8 sequence, it returns the code point 1024 | and moves the iterator to the next position. 1025 |
1026 |1027 | template <typename octet_iterator> 1029 | uint32_t next(octet_iterator& it); 1030 | 1031 |1032 |
1033 | it
: a reference to an iterator pointing to the beginning of an UTF-8
1034 | encoded code point. After the function returns, it is incremented to point to the
1035 | beginning of the next code point.
1036 | Return value: the 32 bit representation of the
1037 | processed UTF-8 code point.
1038 |
1040 | Example of use: 1041 |
1042 |1043 | char* twochars = "\xe6\x97\xa5\xd1\x88"; 1045 | char* w = twochars; 1046 | int cp = unchecked::next(w); 1047 | assert (cp == 0x65e5); 1048 | assert (w == twochars + 3); 1049 |1050 |
1051 | This is a faster but less safe version of utf8::next
. It does not
1052 | check for validity of the supplied UTF-8 sequence.
1053 |
1058 | Available in version 2.1 and later. 1059 |
1060 |1061 | Given the iterator to the beginning of a UTF-8 sequence, it returns the code point. 1062 |
1063 |1064 | template <typename octet_iterator> 1066 | uint32_t peek_next(octet_iterator it); 1067 | 1068 |1069 |
1070 | it
: an iterator pointing to the beginning of an UTF-8
1071 | encoded code point.
1072 | Return value: the 32 bit representation of the
1073 | processed UTF-8 code point.
1074 |
1076 | Example of use: 1077 |
1078 |1079 | char* twochars = "\xe6\x97\xa5\xd1\x88"; 1081 | char* w = twochars; 1082 | int cp = unchecked::peek_next(w); 1083 | assert (cp == 0x65e5); 1084 | assert (w == twochars); 1085 |1086 |
1087 | This is a faster but less safe version of utf8::peek_next
. It does not
1088 | check for validity of the supplied UTF-8 sequence.
1089 |
1094 | Available in version 1.02 and later. 1095 |
1096 |1097 | Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it 1098 | decreases the iterator until it hits the beginning of the previous UTF-8 encoded 1099 | code point and returns the 32 bits representation of the code point. 1100 |
1101 |1102 | template <typename octet_iterator> 1104 | uint32_t prior(octet_iterator& it); 1105 | 1106 |1107 |
1108 | it
: a reference pointing to an octet within a UTF-8 encoded string.
1109 | After the function returns, it is decremented to point to the beginning of the
1110 | previous code point.
1111 | Return value: the 32 bit representation of the
1112 | previous code point.
1113 |
1115 | Example of use: 1116 |
1117 |1118 | char* twochars = "\xe6\x97\xa5\xd1\x88"; 1120 | char* w = twochars + 3; 1121 | int cp = unchecked::prior (w); 1122 | assert (cp == 0x65e5); 1123 | assert (w == twochars); 1124 |1125 |
1126 | This is a faster but less safe version of utf8::prior
. It does not
1127 | check for validity of the supplied UTF-8 sequence and offers no boundary checking.
1128 |
1133 | Deprecated in version 1.02 and later. 1134 |
1135 |1136 | Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it 1137 | decreases the iterator until it hits the beginning of the previous UTF-8 encoded 1138 | code point and returns the 32 bits representation of the code point. 1139 |
1140 |1141 | template <typename octet_iterator> 1143 | uint32_t previous(octet_iterator& it); 1144 | 1145 |1146 |
1147 | it
: a reference pointing to an octet within a UTF-8 encoded string.
1148 | After the function returns, it is decremented to point to the beginning of the
1149 | previous code point.
1150 | Return value: the 32 bit representation of the
1151 | previous code point.
1152 |
1154 | Example of use: 1155 |
1156 |1157 | char* twochars = "\xe6\x97\xa5\xd1\x88"; 1159 | char* w = twochars + 3; 1160 | int cp = unchecked::previous (w); 1161 | assert (cp == 0x65e5); 1162 | assert (w == twochars); 1163 |1164 |
1165 | The reason this function is deprecated is just the consistency with the "checked"
1166 | versions, where prior
should be used instead of previous
.
1167 | In fact, unchecked::previous
behaves exactly the same as
1168 | unchecked::prior
1169 |
1171 | This is a faster but less safe version of utf8::previous
. It does not
1172 | check for validity of the supplied UTF-8 sequence and offers no boundary checking.
1173 |
1178 | Available in version 1.0 and later. 1179 |
1180 |1181 | Advances an iterator by the specified number of code points within an UTF-8 1182 | sequence. 1183 |
1184 |1185 | template <typename octet_iterator, typename distance_type> 1187 | void advance (octet_iterator& it, distance_type n); 1188 | 1189 |1190 |
1191 | it
: a reference to an iterator pointing to the beginning of an UTF-8
1192 | encoded code point. After the function returns, it is incremented to point to the
1193 | nth following code point.
1194 | n
: a positive integer that shows how many code points we want to
1195 | advance.
1196 |
1198 | Example of use: 1199 |
1200 |1201 | char* twochars = "\xe6\x97\xa5\xd1\x88"; 1203 | char* w = twochars; 1204 | unchecked::advance (w, 2); 1205 | assert (w == twochars + 5); 1206 |1207 |
1208 | This function works only "forward". In case of a negative n
, there is
1209 | no effect.
1210 |
1212 | This is a faster but less safe version of utf8::advance
. It does not
1213 | check for validity of the supplied UTF-8 sequence and offers no boundary checking.
1214 |
1219 | Available in version 1.0 and later. 1220 |
1221 |1222 | Given the iterators to two UTF-8 encoded code points in a seqence, returns the 1223 | number of code points between them. 1224 |
1225 |1226 | template <typename octet_iterator> 1228 | typename std::iterator_traits<octet_iterator>::difference_type distance (octet_iterator first, octet_iterator last); 1230 |1231 |
1232 | first
: an iterator to a beginning of a UTF-8 encoded code point.
1233 | last
: an iterator to a "post-end" of the last UTF-8 encoded code
1234 | point in the sequence we are trying to determine the length. It can be the
1235 | beginning of a new code point, or not.
1236 | Return value the distance between the iterators,
1237 | in code points.
1238 |
1240 | Example of use: 1241 |
1242 |1243 | char* twochars = "\xe6\x97\xa5\xd1\x88"; 1245 | size_t dist = utf8::unchecked::distance(twochars, twochars + 5); 1247 | assert (dist == 2); 1248 |1249 |
1250 | This is a faster but less safe version of utf8::distance
. It does not
1251 | check for validity of the supplied UTF-8 sequence.
1252 |
1257 | Available in version 1.0 and later. 1258 |
1259 |1260 | Converts a UTF-16 encoded string to UTF-8. 1261 |
1262 |1263 | template <typename u16bit_iterator, typename octet_iterator> 1266 | octet_iterator utf16to8 (u16bit_iterator start, u16bit_iterator end, octet_iterator result); 1267 | 1268 |1269 |
1270 | start
: an iterator pointing to the beginning of the UTF-16 encoded
1271 | string to convert.
1272 | end
: an iterator pointing to pass-the-end of the UTF-16 encoded
1273 | string to convert.
1274 | result
: an output iterator to the place in the UTF-8 string where to
1275 | append the result of conversion.
1276 | Return value: An iterator pointing to the place
1277 | after the appended UTF-8 string.
1278 |
1280 | Example of use: 1281 |
1282 |1283 | unsigned short utf16string[] = {0x41, 0x0448, 0x65e5, 0xd834, 0xdd1e}; 1287 | vector<unsigned char> utf8result; 1288 | unchecked::utf16to8(utf16string, utf16string + 5, back_inserter(utf8result)); 1290 | assert (utf8result.size() == 10); 1291 |1292 |
1293 | This is a faster but less safe version of utf8::utf16to8
. It does not
1294 | check for validity of the supplied UTF-16 sequence.
1295 |
1300 | Available in version 1.0 and later. 1301 |
1302 |1303 | Converts an UTF-8 encoded string to UTF-16 1304 |
1305 |1306 | template <typename u16bit_iterator, typename octet_iterator> 1308 | u16bit_iterator utf8to16 (octet_iterator start, octet_iterator end, u16bit_iterator result); 1309 | 1310 |1311 |
1312 | start
: an iterator pointing to the beginning of the UTF-8 encoded
1313 | string to convert. < br /> end
: an iterator pointing to
1314 | pass-the-end of the UTF-8 encoded string to convert.
1315 | result
: an output iterator to the place in the UTF-16 string where to
1316 | append the result of conversion.
1317 | Return value: An iterator pointing to the place
1318 | after the appended UTF-16 string.
1319 |
1321 | Example of use: 1322 |
1323 |1324 | char utf8_with_surrogates[] = "\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e"; 1326 | vector <unsigned short> utf16result; 1327 | unchecked::utf8to16(utf8_with_surrogates, utf8_with_surrogates + 9, back_inserter(utf16result)); 1329 | assert (utf16result.size() == 4); 1330 | assert (utf16result[2] == 0xd834); 1332 | assert (utf16result[3] == 0xdd1e); 1334 |1335 |
1336 | This is a faster but less safe version of utf8::utf8to16
. It does not
1337 | check for validity of the supplied UTF-8 sequence.
1338 |
1343 | Available in version 1.0 and later. 1344 |
1345 |1346 | Converts a UTF-32 encoded string to UTF-8. 1347 |
1348 |1349 | template <typename octet_iterator, typename u32bit_iterator> 1352 | octet_iterator utf32to8 (u32bit_iterator start, u32bit_iterator end, octet_iterator result); 1353 | 1354 |1355 |
1356 | start
: an iterator pointing to the beginning of the UTF-32 encoded
1357 | string to convert.
1358 | end
: an iterator pointing to pass-the-end of the UTF-32 encoded
1359 | string to convert.
1360 | result
: an output iterator to the place in the UTF-8 string where to
1361 | append the result of conversion.
1362 | Return value: An iterator pointing to the place
1363 | after the appended UTF-8 string.
1364 |
1366 | Example of use: 1367 |
1368 |1369 | int utf32string[] = {0x448, 0x65e5, 0x10346, 0}; 1372 | vector<unsigned char> utf8result; 1373 | utf32to8(utf32string, utf32string + 3, back_inserter(utf8result)); 1375 | assert (utf8result.size() == 9); 1376 |1377 |
1378 | This is a faster but less safe version of utf8::utf32to8
. It does not
1379 | check for validity of the supplied UTF-32 sequence.
1380 |
1385 | Available in version 1.0 and later. 1386 |
1387 |1388 | Converts a UTF-8 encoded string to UTF-32. 1389 |
1390 |1391 | template <typename octet_iterator, typename u32bit_iterator> 1393 | u32bit_iterator utf8to32 (octet_iterator start, octet_iterator end, u32bit_iterator result); 1394 | 1395 |1396 |
1397 | start
: an iterator pointing to the beginning of the UTF-8 encoded
1398 | string to convert.
1399 | end
: an iterator pointing to pass-the-end of the UTF-8 encoded string
1400 | to convert.
1401 | result
: an output iterator to the place in the UTF-32 string where to
1402 | append the result of conversion.
1403 | Return value: An iterator pointing to the place
1404 | after the appended UTF-32 string.
1405 |
1407 | Example of use: 1408 |
1409 |1410 | char* twochars = "\xe6\x97\xa5\xd1\x88"; 1412 | vector<int> utf32result; 1413 | unchecked::utf8to32(twochars, twochars + 5, back_inserter(utf32result)); 1415 | assert (utf32result.size() == 2); 1416 |1417 |
1418 | This is a faster but less safe version of utf8::utf8to32
. It does not
1419 | check for validity of the supplied UTF-8 sequence.
1420 |
1428 | Available in version 2.0 and later. 1429 |
1430 |1431 | Adapts the underlying octet iterator to iterate over the sequence of code points, 1432 | rather than raw octets. 1433 |
1434 |1435 | template <typename octet_iterator> 1436 | class iterator; 1437 |1438 | 1439 |
iterator();
octet_iterator
is
1442 | constructed with its default constructor.
1443 | explicit iterator (const octet_iterator& octet_it);
1444 |
octet_iterator
with octet_it
1446 | octet_iterator base () const;
octet_iterator
.
1448 | uint32_t operator * () const;
octet_iterator
is pointing to and returns the code point.
1450 | bool operator == (const iterator& rhs)
1451 | const;
bool operator != (const iterator& rhs)
1454 | const;
iterator& operator ++ ();
iterator operator ++ (int);
iterator& operator -- ();
iterator operator -- (int);
1466 | Example of use: 1467 |
1468 |1469 | char* threechars = "\xf0\x90\x8d\x86\xe6\x97\xa5\xd1\x88"; 1470 | utf8::unchecked::iterator<char*> un_it(threechars); 1471 | utf8::unchecked::iterator<char*> un_it2 = un_it; 1472 | assert (un_it2 == un_it); 1473 | assert (*un_it == 0x10346); 1474 | assert (*(++un_it) == 0x65e5); 1475 | assert ((*un_it++) == 0x65e5); 1476 | assert (*un_it == 0x0448); 1477 | assert (un_it != un_it2); 1478 | utf8::::unchecked::iterator<char*> un_endit (threechars + 9); 1479 | assert (++un_it == un_endit); 1480 | assert (*(--un_it) == 0x0448); 1481 | assert ((*un_it--) == 0x0448); 1482 | assert (*un_it == 0x65e5); 1483 | assert (--un_it == utf8::unchecked::iterator<char*>(threechars)); 1484 | assert (*un_it == 0x10346); 1485 |1486 |
1487 | This is an unchecked version of utf8::iterator
. It is faster in many cases, but offers
1488 | no validity or range checks.
1489 |
1497 | The library was designed to be: 1498 |
1499 |1523 | In case you want to look into other means of working with UTF-8 strings from C++, 1524 | here is the list of solutions I am aware of: 1525 |
1526 |std::string
. If you prefer to have yet another string class in your
1538 | code, it may be worth a look. Be aware of the licensing issues, though.
1539 | 1551 | Until Unicode becomes officially recognized by the C++ Standard Library, we need to 1552 | use other means to work with UTF-8 strings. Template functions I describe in this 1553 | article may be a good step in this direction. 1554 |
1555 |