├── doc ├── ReleaseNotes └── utf8cpp.html └── source ├── utf8.h └── utf8 ├── checked.h ├── core.h └── unchecked.h /doc/ReleaseNotes: -------------------------------------------------------------------------------- 1 | utf8 cpp library 2 | Release 2.1 3 | 4 | This is a minor feature release - added the function peek_next. 5 | 6 | Changes from version 2.o 7 | - Implemented feature request [ 1770746 ] "Provide a const version of next() (some sort of a peek() ) 8 | 9 | Files included in the release: utf8.h, core.h, checked.h, unchecked.h, utf8cpp.html, ReleaseNotes 10 | -------------------------------------------------------------------------------- /doc/utf8cpp.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 6 | 8 | 9 | 10 | 11 | UTF8-CPP: UTF-8 with C++ in a Portable Way 12 | 13 | 42 | 43 | 44 |

45 | UTF8-CPP: UTF-8 with C++ in a Portable Way 46 |

47 |

48 | The Sourceforge project page 49 |

50 |
51 |

52 | Table of Contents 53 |

54 | 88 |
89 |

90 | Introduction 91 |

92 |

93 | Many C++ developers miss an easy and portable way of handling Unicode encoded 94 | strings. C++ Standard is currently Unicode agnostic, and while some work is being 95 | done to introduce Unicode to the next incarnation called C++0x, for the moment 96 | nothing of the sort is available. In the meantime, developers use 3rd party 97 | libraries like ICU, OS specific capabilities, or simply roll out their own 98 | solutions. 99 |

100 |

101 | In order to easily handle UTF-8 encoded Unicode strings, I have come up with a small 102 | generic library. For anybody used to work with STL algorithms and iterators, it should be 103 | easy and natural to use. The code is freely available for any purpose - check out 104 | the license at the beginning of the utf8.h file. If you run into 105 | bugs or performance issues, please let me know and I'll do my best to address them. 106 |

107 |

108 | The purpose of this article is not to offer an introduction to Unicode in general, 109 | and UTF-8 in particular. If you are not familiar with Unicode, be sure to check out 110 | Unicode Home Page or some other source of 111 | information for Unicode. Also, it is not my aim to advocate the use of UTF-8 112 | encoded strings in C++ programs; if you want to handle UTF-8 encoded strings from 113 | C++, I am sure you have good reasons for it. 114 |

115 |

116 | Examples of use 117 |

118 |

119 | To illustrate the use of this utf8 library, we shall open a file containing UTF-8 120 | encoded text, check whether it starts with a byte order mark, read each line into a 121 | std::string, check it for validity, convert the text to UTF-16, and 122 | back to UTF-8: 123 |

124 |
 125 | #include <fstream>
 126 | #include <iostream>
 127 | #include <string>
 128 | #include <vector>
 129 | #include "utf8.h"
 130 | using namespace std;
 131 | int main()
 132 | {
 133 |     if (argc != 2) {
 134 |         cout << "\nUsage: docsample filename\n";
 135 |         return 0;
 136 |     }
 137 |     const char* test_file_path = argv[1];
 138 |     // Open the test file (must be UTF-8 encoded)
 139 |     ifstream fs8(test_file_path);
 140 |     if (!fs8.is_open()) {
 141 |     cout << "Could not open " << test_file_path << endl;
 143 |     return 0;
 144 |     }
 145 |     // Read the first line of the file
 146 |     unsigned line_count = 1;
 147 |     string line;
 148 |     if (!getline(fs8, line)) 
 149 |         return 0;
 150 |     // Look for utf-8 byte-order mark at the beginning
 151 |     if (line.size() > 2) {
 152 |         if (utf8::is_bom(line.c_str()))
 153 |             cout << "There is a byte order mark at the beginning of the file\n";
 155 |     }
 156 |     // Play with all the lines in the file
 157 |     do {
 158 |        // check for invalid utf-8 (for a simple yes/no check, there is also utf8::is_valid function)
 159 |         string::iterator end_it = utf8::find_invalid(line.begin(), line.end());
 160 |         if (end_it != line.end()) {
 161 |             cout << "Invalid UTF-8 encoding detected at line " << line_count << "\n";
 164 |             cout << "This part is fine: " << string(line.begin(), end_it) << "\n";
 167 |         }
 168 |         // Get the line length (at least for the valid part)
 169 |         int length = utf8::distance(line.begin(), end_it);
 170 |         cout << "Length of line " << line_count << " is " << length <<  "\n";
 173 |         // Convert it to utf-16
 174 |         vector<unsigned short> utf16line;
 175 |         utf8::utf8to16(line.begin(), end_it, back_inserter(utf16line));
 176 |         // And back to utf-8
 177 |         string utf8line; 
 178 |         utf8::utf16to8(utf16line.begin(), utf16line.end(), back_inserter(utf8line));
 179 |         // Confirm that the conversion went OK:
 180 |         if (utf8line != string(line.begin(), end_it))
 181 |             cout << "Error in UTF-16 conversion at line: " << line_count << "\n";        
 184 |         getline(fs8, line);
 185 |         line_count++;
 186 |     } while (!fs8.eof());
 187 |     return 0;
 188 | }
 189 | 
190 |

191 | In the previous code sample, we have seen the use of the following functions from 192 | utf8 namespace: first we used is_bom function to detect 193 | UTF-8 byte order mark at the beginning of the file; then for each line we performed 194 | a detection of invalid UTF-8 sequences with find_invalid; the number 195 | of characters (more precisely - the number of Unicode code points) in each line was 196 | determined with a use of utf8::distance; finally, we have converted 197 | each line to UTF-16 encoding with utf8to16 and back to UTF-8 with 198 | utf16to8. 199 |

200 |

201 | Reference 202 |

203 |

204 | Functions From utf8 Namespace 205 |

206 |

207 | utf8::append 208 |

209 |

210 | Available in version 1.0 and later. 211 |

212 |

213 | Encodes a 32 bit code point as a UTF-8 sequence of octets and appends the sequence 214 | to a UTF-8 string. 215 |

216 |
 217 | template <typename octet_iterator>
 219 | octet_iterator append(uint32_t cp, octet_iterator result);
 220 |    
 221 | 
222 |

223 | cp: A 32 bit integer representing a code point to append to the 224 | sequence.
225 | result: An output iterator to the place in the sequence where to 226 | append the code point.
227 | Return value: An iterator pointing to the place 228 | after the newly appended sequence. 229 |

230 |

231 | Example of use: 232 |

233 |
 234 | unsigned char u[5] = {0,0,0,0,0};
 237 | unsigned char* end = append(0x0448, u);
 239 | assert (u[0] == 0xd1 && u[1] == 0x88 && u[2] == 0 && u[3] == 0 && u[4] == 0);
 245 | 
246 |

247 | Note that append does not allocate any memory - it is the burden of 248 | the caller to make sure there is enough memory allocated for the operation. To make 249 | things more interesting, append can add anywhere between 1 and 4 250 | octets to the sequence. In practice, you would most often want to use 251 | std::back_inserter to ensure that the necessary memory is allocated. 252 |

253 |

254 | In case of an invalid code point, a utf8::invalid_code_point exception 255 | is thrown. 256 |

257 |

258 | utf8::next 259 |

260 |

261 | Available in version 1.0 and later. 262 |

263 |

264 | Given the iterator to the beginning of the UTF-8 sequence, it returns the code 265 | point and moves the iterator to the next position. 266 |

267 |
 268 | template <typename octet_iterator> 
 270 | uint32_t next(octet_iterator& it, octet_iterator end);
 271 |    
 272 | 
273 |

274 | it: a reference to an iterator pointing to the beginning of an UTF-8 275 | encoded code point. After the function returns, it is incremented to point to the 276 | beginning of the next code point.
277 | end: end of the UTF-8 sequence to be processed. If it 278 | gets equal to end during the extraction of a code point, an 279 | utf8::not_enough_room exception is thrown.
280 | Return value: the 32 bit representation of the 281 | processed UTF-8 code point. 282 |

283 |

284 | Example of use: 285 |

286 |
 287 | char* twochars = "\xe6\x97\xa5\xd1\x88";
 289 | char* w = twochars;
 290 | int cp = next(w, twochars + 6);
 291 | assert (cp == 0x65e5);
 292 | assert (w == twochars + 3);
 293 | 
294 |

295 | This function is typically used to iterate through a UTF-8 encoded string. 296 |

297 |

298 | In case of an invalid UTF-8 seqence, a utf8::invalid_utf8 exception is 299 | thrown. 300 |

301 |

302 | utf8::peek_next 303 |

304 |

305 | Available in version 2.1 and later. 306 |

307 |

308 | Given the iterator to the beginning of the UTF-8 sequence, it returns the code 309 | point for the following sequence without changing the value of the iterator. 310 |

311 |
 312 | template <typename octet_iterator> 
 314 | uint32_t peek_next(octet_iterator it, octet_iterator end);
 315 |    
 316 | 
317 |

318 | it: an iterator pointing to the beginning of an UTF-8 319 | encoded code point.
320 | end: end of the UTF-8 sequence to be processed. If it 321 | gets equal to end during the extraction of a code point, an 322 | utf8::not_enough_room exception is thrown.
323 | Return value: the 32 bit representation of the 324 | processed UTF-8 code point. 325 |

326 |

327 | Example of use: 328 |

329 |
 330 | char* twochars = "\xe6\x97\xa5\xd1\x88";
 332 | char* w = twochars;
 333 | int cp = peek_next(w, twochars + 6);
 334 | assert (cp == 0x65e5);
 335 | assert (w == twochars);
 336 | 
337 |

338 | In case of an invalid UTF-8 seqence, a utf8::invalid_utf8 exception is 339 | thrown. 340 |

341 |

342 | utf8::prior 343 |

344 |

345 | Available in version 1.02 and later. 346 |

347 |

348 | Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it 349 | decreases the iterator until it hits the beginning of the previous UTF-8 encoded 350 | code point and returns the 32 bits representation of the code point. 351 |

352 |
 353 | template <typename octet_iterator> 
 355 | uint32_t prior(octet_iterator& it, octet_iterator start);
 356 |    
 357 | 
358 |

359 | it: a reference pointing to an octet within a UTF-8 encoded string. 360 | After the function returns, it is decremented to point to the beginning of the 361 | previous code point.
362 | start: an iterator to the beginning of the sequence where the search 363 | for the beginning of a code point is performed. It is a 364 | safety measure to prevent passing the beginning of the string in the search for a 365 | UTF-8 lead octet.
366 | Return value: the 32 bit representation of the 367 | previous code point. 368 |

369 |

370 | Example of use: 371 |

372 |
 373 | char* twochars = "\xe6\x97\xa5\xd1\x88";
 375 | unsigned char* w = twochars + 3;
 377 | int cp = prior (w, twochars);
 378 | assert (cp == 0x65e5);
 379 | assert (w == twochars);
 380 | 
381 |

382 | This function has two purposes: one is two iterate backwards through a UTF-8 383 | encoded string. Note that it is usually a better idea to iterate forward instead, 384 | since utf8::next is faster. The second purpose is to find a beginning 385 | of a UTF-8 sequence if we have a random position within a string. 386 |

387 |

388 | it will typically point to the beginning of 389 | a code point, and start will point to the 390 | beginning of the string to ensure we don't go backwards too far. it is 391 | decreased until it points to a lead UTF-8 octet, and then the UTF-8 sequence 392 | beginning with that octet is decoded to a 32 bit representation and returned. 393 |

394 |

395 | In case pass_end is reached before a UTF-8 lead octet is hit, or if an 396 | invalid UTF-8 sequence is started by the lead octet, an invalid_utf8 397 | exception is thrown. 398 |

399 |

400 | utf8::previous 401 |

402 |

403 | Deprecated in version 1.02 and later. 404 |

405 |

406 | Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it 407 | decreases the iterator until it hits the beginning of the previous UTF-8 encoded 408 | code point and returns the 32 bits representation of the code point. 409 |

410 |
 411 | template <typename octet_iterator> 
 413 | uint32_t previous(octet_iterator& it, octet_iterator pass_start);
 414 |    
 415 | 
416 |

417 | it: a reference pointing to an octet within a UTF-8 encoded string. 418 | After the function returns, it is decremented to point to the beginning of the 419 | previous code point.
420 | pass_start: an iterator to the point in the sequence where the search 421 | for the beginning of a code point is aborted if no result was reached. It is a 422 | safety measure to prevent passing the beginning of the string in the search for a 423 | UTF-8 lead octet.
424 | Return value: the 32 bit representation of the 425 | previous code point. 426 |

427 |

428 | Example of use: 429 |

430 |
 431 | char* twochars = "\xe6\x97\xa5\xd1\x88";
 433 | unsigned char* w = twochars + 3;
 435 | int cp = previous (w, twochars - 1);
 437 | assert (cp == 0x65e5);
 438 | assert (w == twochars);
 439 | 
440 |

441 | utf8::previous is deprecated, and utf8::prior should 442 | be used instead, although the existing code can continue using this function. 443 | The problem is the parameter pass_start that points to the position 444 | just before the beginning of the sequence. Standard containers don't have the 445 | concept of "pass start" and the function can not be used with their iterators. 446 |

447 |

448 | it will typically point to the beginning of 449 | a code point, and pass_start will point to the octet just before the 450 | beginning of the string to ensure we don't go backwards too far. it is 451 | decreased until it points to a lead UTF-8 octet, and then the UTF-8 sequence 452 | beginning with that octet is decoded to a 32 bit representation and returned. 453 |

454 |

455 | In case pass_end is reached before a UTF-8 lead octet is hit, or if an 456 | invalid UTF-8 sequence is started by the lead octet, an invalid_utf8 457 | exception is thrown 458 |

459 |

460 | utf8::advance 461 |

462 |

463 | Available in version 1.0 and later. 464 |

465 |

466 | Advances an iterator by the specified number of code points within an UTF-8 467 | sequence. 468 |

469 |
 470 | template <typename octet_iterator, typename distance_type> 
 472 | void advance (octet_iterator& it, distance_type n, octet_iterator end);
 474 |    
 475 | 
476 |

477 | it: a reference to an iterator pointing to the beginning of an UTF-8 478 | encoded code point. After the function returns, it is incremented to point to the 479 | nth following code point.
480 | n: a positive integer that shows how many code points we want to 481 | advance.
482 | end: end of the UTF-8 sequence to be processed. If it 483 | gets equal to end during the extraction of a code point, an 484 | utf8::not_enough_room exception is thrown.
485 |

486 |

487 | Example of use: 488 |

489 |
 490 | char* twochars = "\xe6\x97\xa5\xd1\x88";
 492 | unsigned char* w = twochars;
 493 | advance (w, 2, twochars + 6);
 494 | assert (w == twochars + 5);
 495 | 
496 |

497 | This function works only "forward". In case of a negative n, there is 498 | no effect. 499 |

500 |

501 | In case of an invalid code point, a utf8::invalid_code_point exception 502 | is thrown. 503 |

504 |

505 | utf8::distance 506 |

507 |

508 | Available in version 1.0 and later. 509 |

510 |

511 | Given the iterators to two UTF-8 encoded code points in a seqence, returns the 512 | number of code points between them. 513 |

514 |
 515 | template <typename octet_iterator> 
 517 | typename std::iterator_traits<octet_iterator>::difference_type distance (octet_iterator first, octet_iterator last);
 519 |    
 520 | 
521 |

522 | first: an iterator to a beginning of a UTF-8 encoded code point.
523 | last: an iterator to a "post-end" of the last UTF-8 encoded code 524 | point in the sequence we are trying to determine the length. It can be the 525 | beginning of a new code point, or not.
526 | Return value the distance between the iterators, 527 | in code points. 528 |

529 |

530 | Example of use: 531 |

532 |
 533 | char* twochars = "\xe6\x97\xa5\xd1\x88";
 535 | size_t dist = utf8::distance(twochars, twochars + 5);
 536 | assert (dist == 2);
 537 | 
538 |

539 | This function is used to find the length (in code points) of a UTF-8 encoded 540 | string. The reason it is called distance, rather than, say, 541 | length is mainly because developers are used that length is an 542 | O(1) function. Computing the length of an UTF-8 string is a linear operation, and 543 | it looked better to model it after std::distance algorithm. 544 |

545 |

546 | In case of an invalid UTF-8 seqence, a utf8::invalid_utf8 exception is 547 | thrown. If last does not point to the past-of-end of a UTF-8 seqence, 548 | a utf8::not_enough_room exception is thrown. 549 |

550 |

551 | utf8::utf16to8 552 |

553 |

554 | Available in version 1.0 and later. 555 |

556 |

557 | Converts a UTF-16 encoded string to UTF-8. 558 |

559 |
 560 | template <typename u16bit_iterator, typename octet_iterator>
 563 | octet_iterator utf16to8 (u16bit_iterator start, u16bit_iterator end, octet_iterator result);
 564 |    
 565 | 
566 |

567 | start: an iterator pointing to the beginning of the UTF-16 encoded 568 | string to convert.
569 | end: an iterator pointing to pass-the-end of the UTF-16 encoded 570 | string to convert.
571 | result: an output iterator to the place in the UTF-8 string where to 572 | append the result of conversion.
573 | Return value: An iterator pointing to the place 574 | after the appended UTF-8 string. 575 |

576 |

577 | Example of use: 578 |

579 |
 580 | unsigned short utf16string[] = {0x41, 0x0448, 0x65e5, 0xd834, 0xdd1e};
 584 | vector<unsigned char> utf8result;
 585 | utf16to8(utf16string, utf16string + 5, back_inserter(utf8result));
 587 | assert (utf8result.size() == 10);    
 588 | 
589 |

590 | In case of invalid UTF-16 sequence, a utf8::invalid_utf16 exception is 591 | thrown. 592 |

593 |

594 | utf8::utf8to16 595 |

596 |

597 | Available in version 1.0 and later. 598 |

599 |

600 | Converts an UTF-8 encoded string to UTF-16 601 |

602 |
 603 | template <typename u16bit_iterator, typename octet_iterator>
 605 | u16bit_iterator utf8to16 (octet_iterator start, octet_iterator end, u16bit_iterator result);
 606 |    
 607 | 
608 |

609 | start: an iterator pointing to the beginning of the UTF-8 encoded 610 | string to convert. < br /> end: an iterator pointing to 611 | pass-the-end of the UTF-8 encoded string to convert.
612 | result: an output iterator to the place in the UTF-16 string where to 613 | append the result of conversion.
614 | Return value: An iterator pointing to the place 615 | after the appended UTF-16 string. 616 |

617 |

618 | Example of use: 619 |

620 |
 621 | char utf8_with_surrogates[] = "\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e";
 623 | vector <unsigned short> utf16result;
 624 | utf8to16(utf8_with_surrogates, utf8_with_surrogates + 9, back_inserter(utf16result));
 626 | assert (utf16result.size() == 4);
 627 | assert (utf16result[2] == 0xd834);
 629 | assert (utf16result[3] == 0xdd1e);
 631 | 
632 |

633 | In case of an invalid UTF-8 seqence, a utf8::invalid_utf8 exception is 634 | thrown. If end does not point to the past-of-end of a UTF-8 seqence, a 635 | utf8::not_enough_room exception is thrown. 636 |

637 |

638 | utf8::utf32to8 639 |

640 |

641 | Available in version 1.0 and later. 642 |

643 |

644 | Converts a UTF-32 encoded string to UTF-8. 645 |

646 |
 647 | template <typename octet_iterator, typename u32bit_iterator>
 649 | octet_iterator utf32to8 (u32bit_iterator start, u32bit_iterator end, octet_iterator result);
 650 |    
 651 | 
652 |

653 | start: an iterator pointing to the beginning of the UTF-32 encoded 654 | string to convert.
655 | end: an iterator pointing to pass-the-end of the UTF-32 encoded 656 | string to convert.
657 | result: an output iterator to the place in the UTF-8 string where to 658 | append the result of conversion.
659 | Return value: An iterator pointing to the place 660 | after the appended UTF-8 string. 661 |

662 |

663 | Example of use: 664 |

665 |
 666 | int utf32string[] = {0x448, 0x65E5, 0x10346, 0};
 669 | vector<unsigned char> utf8result;
 670 | utf32to8(utf32string, utf32string + 3, back_inserter(utf8result));
 672 | assert (utf8result.size() == 9);
 673 | 
674 |

675 | In case of invalid UTF-32 string, a utf8::invalid_code_point exception 676 | is thrown. 677 |

678 |

679 | utf8::utf8to32 680 |

681 |

682 | Available in version 1.0 and later. 683 |

684 |

685 | Converts a UTF-8 encoded string to UTF-32. 686 |

687 |
 688 | template <typename octet_iterator, typename u32bit_iterator>
 691 | u32bit_iterator utf8to32 (octet_iterator start, octet_iterator end, u32bit_iterator result);
 692 |    
 693 | 
694 |

695 | start: an iterator pointing to the beginning of the UTF-8 encoded 696 | string to convert.
697 | end: an iterator pointing to pass-the-end of the UTF-8 encoded string 698 | to convert.
699 | result: an output iterator to the place in the UTF-32 string where to 700 | append the result of conversion.
701 | Return value: An iterator pointing to the place 702 | after the appended UTF-32 string. 703 |

704 |

705 | Example of use: 706 |

707 |
 708 | char* twochars = "\xe6\x97\xa5\xd1\x88";
 710 | vector<int> utf32result;
 711 | utf8to32(twochars, twochars + 5, back_inserter(utf32result));
 713 | assert (utf32result.size() == 2);
 714 | 
715 |

716 | In case of an invalid UTF-8 seqence, a utf8::invalid_utf8 exception is 717 | thrown. If end does not point to the past-of-end of a UTF-8 seqence, a 718 | utf8::not_enough_room exception is thrown. 719 |

720 |

721 | utf8::find_invalid 722 |

723 |

724 | Available in version 1.0 and later. 725 |

726 |

727 | Detects an invalid sequence within a UTF-8 string. 728 |

729 |
 730 | template <typename octet_iterator> 
 732 | octet_iterator find_invalid(octet_iterator start, octet_iterator end);
 733 | 
734 |

735 | start: an iterator pointing to the beginning of the UTF-8 string to 736 | test for validity.
737 | end: an iterator pointing to pass-the-end of the UTF-8 string to test 738 | for validity.
739 | Return value: an iterator pointing to the first 740 | invalid octet in the UTF-8 string. In case none were found, equals 741 | end. 742 |

743 |

744 | Example of use: 745 |

746 |
 747 | char utf_invalid[] = "\xe6\x97\xa5\xd1\x88\xfa";
 749 | char* invalid = find_invalid(utf_invalid, utf_invalid + 6);
 752 | assert (invalid == utf_invalid + 5);
 753 | 
754 |

755 | This function is typically used to make sure a UTF-8 string is valid before 756 | processing it with other functions. It is especially important to call it if before 757 | doing any of the unchecked operations on it. 758 |

759 |

760 | utf8::is_valid 761 |

762 |

763 | Available in version 1.0 and later. 764 |

765 |

766 | Checks whether a sequence of octets is a valid UTF-8 string. 767 |

768 |
 769 | template <typename octet_iterator> 
 771 | bool is_valid(octet_iterator start, octet_iterator end);
 772 |    
 773 | 
774 |

775 | start: an iterator pointing to the beginning of the UTF-8 string to 776 | test for validity.
777 | end: an iterator pointing to pass-the-end of the UTF-8 string to test 778 | for validity.
779 | Return value: true if the sequence 780 | is a valid UTF-8 string; false if not. 781 |

782 | Example of use: 783 |
 784 | char utf_invalid[] = "\xe6\x97\xa5\xd1\x88\xfa";
 786 | bool bvalid = is_valid(utf_invalid, utf_invalid + 6);
 788 | assert (bvalid == false);
 789 | 
790 |

791 | is_valid is a shorthand for find_invalid(start, end) == 792 | end;. You may want to use it to make sure that a byte seqence is a valid 793 | UTF-8 string without the need to know where it fails if it is not valid. 794 |

795 |

796 | utf8::replace_invalid 797 |

798 |

799 | Available in version 2.0 and later. 800 |

801 |

802 | Replaces all invalid UTF-8 sequences within a string with a replacement marker. 803 |

804 |
 805 | template <typename octet_iterator, typename output_iterator>
 808 | output_iterator replace_invalid(octet_iterator start, octet_iterator end, output_iterator out, uint32_t replacement);
 809 | template <typename octet_iterator, typename output_iterator>
 812 | output_iterator replace_invalid(octet_iterator start, octet_iterator end, output_iterator out);
 813 |    
 814 | 
815 |

816 | start: an iterator pointing to the beginning of the UTF-8 string to 817 | look for invalid UTF-8 sequences.
818 | end: an iterator pointing to pass-the-end of the UTF-8 string to look 819 | for invalid UTF-8 sequences.
820 | out: An output iterator to the range where the result of replacement 821 | is stored.
822 | replacement: A Unicode code point for the replacement marker. The 823 | version without this parameter assumes the value 0xfffd
824 | Return value: An iterator pointing to the place 825 | after the UTF-8 string with replaced invalid sequences. 826 |

827 |

828 | Example of use: 829 |

830 |
 831 | char invalid_sequence[] = "a\x80\xe0\xa0\xc0\xaf\xed\xa0\x80z";
 833 | vector<char> replace_invalid_result;
 834 | replace_invalid (invalid_sequence, invalid_sequence + sizeof(invalid_sequence), back_inserter(replace_invalid_result), '?');
 836 | bvalid = is_valid(replace_invalid_result.begin(), replace_invalid_result.end());
 837 | assert (bvalid);
 838 | char* fixed_invalid_sequence = "a????z";
 840 | assert (std::equal(replace_invalid_result.begin(), replace_invalid_result.end(), fixed_invalid_sequence));
 841 | 
842 |

843 | replace_invalid does not perform in-place replacement of invalid 844 | sequences. Rather, it produces a copy of the original string with the invalid 845 | sequences replaced with a replacement marker. Therefore, out must not 846 | be in the [start, end] range. 847 |

848 |

849 | If end does not point to the past-of-end of a UTF-8 sequence, a 850 | utf8::not_enough_room exception is thrown. 851 |

852 |

853 | utf8::is_bom 854 |

855 |

856 | Available in version 1.0 and later. 857 |

858 |

859 | Checks whether a sequence of three octets is a UTF-8 byte order mark (BOM) 860 |

861 |
 862 | template <typename octet_iterator> 
 864 | bool is_bom (octet_iterator it);
 865 | 
866 |

867 | it: beginning of the 3-octet sequence to check
868 | Return value: true if the sequence 869 | is UTF-8 byte order mark; false if not. 870 |

871 |

872 | Example of use: 873 |

874 |
 875 | unsigned char byte_order_mark[] = {0xef, 0xbb, 0xbf};
 878 | bool bbom = is_bom(byte_order_mark);
 879 | assert (bbom == true);
 880 | 
881 |

882 | The typical use of this function is to check the first three bytes of a file. If 883 | they form the UTF-8 BOM, we want to skip them before processing the actual UTF-8 884 | encoded text. 885 |

886 |

887 | Types From utf8 Namespace 888 |

889 |

890 | utf8::iterator 891 |

892 |

893 | Available in version 2.0 and later. 894 |

895 |

896 | Adapts the underlying octet iterator to iterate over the sequence of code points, 897 | rather than raw octets. 898 |

899 |
 900 | template <typename octet_iterator>
 901 | class iterator;
 902 | 
903 | 904 |
Member functions
905 |
906 |
iterator();
the deafult constructor; the underlying octet_iterator is 907 | constructed with its default constructor. 908 |
explicit iterator (const octet_iterator& octet_it, 909 | const octet_iterator& range_start, 910 | const octet_iterator& range_end);
a constructor 911 | that initializes the underlying octet_iterator with octet_it 912 | and sets the range in which the iterator is considered valid. 913 |
octet_iterator base () const;
returns the 914 | underlying octet_iterator. 915 |
uint32_t operator * () const;
decodes the utf-8 sequence 916 | the underlying octet_iterator is pointing to and returns the code point. 917 |
bool operator == (const iterator& rhs) 918 | const;
returns true 919 | if the two underlaying iterators are equal. 920 |
bool operator != (const iterator& rhs) 921 | const;
returns true 922 | if the two underlaying iterators are not equal. 923 |
iterator& operator ++ ();
the prefix increment - moves 924 | the iterator to the next UTF-8 encoded code point. 925 |
iterator operator ++ (int);
926 | the postfix increment - moves the iterator to the next UTF-8 encoded code point and returns the current one. 927 |
iterator& operator -- ();
the prefix decrement - moves 928 | the iterator to the previous UTF-8 encoded code point. 929 |
iterator operator -- (int);
930 | the postfix decrement - moves the iterator to the previous UTF-8 encoded code point and returns the current one. 931 |
932 |

933 | Example of use: 934 |

935 |
 936 | char* threechars = "\xf0\x90\x8d\x86\xe6\x97\xa5\xd1\x88";
 937 | utf8::iterator<char*> it(threechars, threechars, threechars + 9);
 938 | utf8::iterator<char*> it2 = it;
 939 | assert (it2 == it);
 940 | assert (*it == 0x10346);
 941 | assert (*(++it) == 0x65e5);
 942 | assert ((*it++) == 0x65e5);
 943 | assert (*it == 0x0448);
 944 | assert (it != it2);
 945 | utf8::iterator<char*> endit (threechars + 9, threechars, threechars + 9);  
 946 | assert (++it == endit);
 947 | assert (*(--it) == 0x0448);
 948 | assert ((*it--) == 0x0448);
 949 | assert (*it == 0x65e5);
 950 | assert (--it == utf8::iterator<char*>(threechars, threechars, threechars + 9));
 951 | assert (*it == 0x10346);
 952 | 
953 |

954 | The purpose of utf8::iterator adapter is to enable easy iteration as well as the use of STL 955 | algorithms with UTF-8 encoded strings. Increment and decrement operators are implemented in terms of 956 | utf8::next() and utf8::prior() functions. 957 |

958 |

959 | Note that utf8::iterator adapter is a checked iterator. It operates on the range specified in 960 | the constructor; any attempt to go out of that range will result in an exception. Even the comparison operators 961 | require both iterator object to be constructed against the same range - otherwise an exception is thrown. Typically, 962 | the range will be determined by sequence container functions begin and end, i.e.: 963 |

964 |
 965 | std::string s = "example";
 966 | utf8::iterator i (s.begin(), s.begin(), s.end());
 967 | 
968 |

969 | Functions From utf8::unchecked Namespace 970 |

971 |

972 | utf8::unchecked::append 973 |

974 |

975 | Available in version 1.0 and later. 976 |

977 |

978 | Encodes a 32 bit code point as a UTF-8 sequence of octets and appends the sequence 979 | to a UTF-8 string. 980 |

981 |
 982 | template <typename octet_iterator>
 984 | octet_iterator append(uint32_t cp, octet_iterator result);
 985 |    
 986 | 
987 |

988 | cp: A 32 bit integer representing a code point to append to the 989 | sequence.
990 | result: An output iterator to the place in the sequence where to 991 | append the code point.
992 | Return value: An iterator pointing to the place 993 | after the newly appended sequence. 994 |

995 |

996 | Example of use: 997 |

998 |
 999 | unsigned char u[5] = {0,0,0,0,0};
1002 | unsigned char* end = unchecked::append(0x0448, u);
1004 | assert (u[0] == 0xd1 && u[1] == 0x88 && u[2] == 0 && u[3] == 0 && u[4] == 0);
1010 | 
1011 |

1012 | This is a faster but less safe version of utf8::append. It does not 1013 | check for validity of the supplied code point, and may produce an invalid UTF-8 1014 | sequence. 1015 |

1016 |

1017 | utf8::unchecked::next 1018 |

1019 |

1020 | Available in version 1.0 and later. 1021 |

1022 |

1023 | Given the iterator to the beginning of a UTF-8 sequence, it returns the code point 1024 | and moves the iterator to the next position. 1025 |

1026 |
1027 | template <typename octet_iterator>
1029 | uint32_t next(octet_iterator& it);
1030 |    
1031 | 
1032 |

1033 | it: a reference to an iterator pointing to the beginning of an UTF-8 1034 | encoded code point. After the function returns, it is incremented to point to the 1035 | beginning of the next code point.
1036 | Return value: the 32 bit representation of the 1037 | processed UTF-8 code point. 1038 |

1039 |

1040 | Example of use: 1041 |

1042 |
1043 | char* twochars = "\xe6\x97\xa5\xd1\x88";
1045 | char* w = twochars;
1046 | int cp = unchecked::next(w);
1047 | assert (cp == 0x65e5);
1048 | assert (w == twochars + 3);
1049 | 
1050 |

1051 | This is a faster but less safe version of utf8::next. It does not 1052 | check for validity of the supplied UTF-8 sequence. 1053 |

1054 |

1055 | utf8::unchecked::peek_next 1056 |

1057 |

1058 | Available in version 2.1 and later. 1059 |

1060 |

1061 | Given the iterator to the beginning of a UTF-8 sequence, it returns the code point. 1062 |

1063 |
1064 | template <typename octet_iterator>
1066 | uint32_t peek_next(octet_iterator it);
1067 |    
1068 | 
1069 |

1070 | it: an iterator pointing to the beginning of an UTF-8 1071 | encoded code point.
1072 | Return value: the 32 bit representation of the 1073 | processed UTF-8 code point. 1074 |

1075 |

1076 | Example of use: 1077 |

1078 |
1079 | char* twochars = "\xe6\x97\xa5\xd1\x88";
1081 | char* w = twochars;
1082 | int cp = unchecked::peek_next(w);
1083 | assert (cp == 0x65e5);
1084 | assert (w == twochars);
1085 | 
1086 |

1087 | This is a faster but less safe version of utf8::peek_next. It does not 1088 | check for validity of the supplied UTF-8 sequence. 1089 |

1090 |

1091 | utf8::unchecked::prior 1092 |

1093 |

1094 | Available in version 1.02 and later. 1095 |

1096 |

1097 | Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it 1098 | decreases the iterator until it hits the beginning of the previous UTF-8 encoded 1099 | code point and returns the 32 bits representation of the code point. 1100 |

1101 |
1102 | template <typename octet_iterator>
1104 | uint32_t prior(octet_iterator& it);
1105 |    
1106 | 
1107 |

1108 | it: a reference pointing to an octet within a UTF-8 encoded string. 1109 | After the function returns, it is decremented to point to the beginning of the 1110 | previous code point.
1111 | Return value: the 32 bit representation of the 1112 | previous code point. 1113 |

1114 |

1115 | Example of use: 1116 |

1117 |
1118 | char* twochars = "\xe6\x97\xa5\xd1\x88";
1120 | char* w = twochars + 3;
1121 | int cp = unchecked::prior (w);
1122 | assert (cp == 0x65e5);
1123 | assert (w == twochars);
1124 | 
1125 |

1126 | This is a faster but less safe version of utf8::prior. It does not 1127 | check for validity of the supplied UTF-8 sequence and offers no boundary checking. 1128 |

1129 |

1130 | utf8::unchecked::previous (deprecated, see utf8::unchecked::prior) 1131 |

1132 |

1133 | Deprecated in version 1.02 and later. 1134 |

1135 |

1136 | Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it 1137 | decreases the iterator until it hits the beginning of the previous UTF-8 encoded 1138 | code point and returns the 32 bits representation of the code point. 1139 |

1140 |
1141 | template <typename octet_iterator>
1143 | uint32_t previous(octet_iterator& it);
1144 |    
1145 | 
1146 |

1147 | it: a reference pointing to an octet within a UTF-8 encoded string. 1148 | After the function returns, it is decremented to point to the beginning of the 1149 | previous code point.
1150 | Return value: the 32 bit representation of the 1151 | previous code point. 1152 |

1153 |

1154 | Example of use: 1155 |

1156 |
1157 | char* twochars = "\xe6\x97\xa5\xd1\x88";
1159 | char* w = twochars + 3;
1160 | int cp = unchecked::previous (w);
1161 | assert (cp == 0x65e5);
1162 | assert (w == twochars);
1163 | 
1164 |

1165 | The reason this function is deprecated is just the consistency with the "checked" 1166 | versions, where prior should be used instead of previous. 1167 | In fact, unchecked::previous behaves exactly the same as 1168 | unchecked::prior 1169 |

1170 |

1171 | This is a faster but less safe version of utf8::previous. It does not 1172 | check for validity of the supplied UTF-8 sequence and offers no boundary checking. 1173 |

1174 |

1175 | utf8::unchecked::advance 1176 |

1177 |

1178 | Available in version 1.0 and later. 1179 |

1180 |

1181 | Advances an iterator by the specified number of code points within an UTF-8 1182 | sequence. 1183 |

1184 |
1185 | template <typename octet_iterator, typename distance_type>
1187 | void advance (octet_iterator& it, distance_type n);
1188 |    
1189 | 
1190 |

1191 | it: a reference to an iterator pointing to the beginning of an UTF-8 1192 | encoded code point. After the function returns, it is incremented to point to the 1193 | nth following code point.
1194 | n: a positive integer that shows how many code points we want to 1195 | advance.
1196 |

1197 |

1198 | Example of use: 1199 |

1200 |
1201 | char* twochars = "\xe6\x97\xa5\xd1\x88";
1203 | char* w = twochars;
1204 | unchecked::advance (w, 2);
1205 | assert (w == twochars + 5);
1206 | 
1207 |

1208 | This function works only "forward". In case of a negative n, there is 1209 | no effect. 1210 |

1211 |

1212 | This is a faster but less safe version of utf8::advance. It does not 1213 | check for validity of the supplied UTF-8 sequence and offers no boundary checking. 1214 |

1215 |

1216 | utf8::unchecked::distance 1217 |

1218 |

1219 | Available in version 1.0 and later. 1220 |

1221 |

1222 | Given the iterators to two UTF-8 encoded code points in a seqence, returns the 1223 | number of code points between them. 1224 |

1225 |
1226 | template <typename octet_iterator>
1228 | typename std::iterator_traits<octet_iterator>::difference_type distance (octet_iterator first, octet_iterator last);
1230 | 
1231 |

1232 | first: an iterator to a beginning of a UTF-8 encoded code point.
1233 | last: an iterator to a "post-end" of the last UTF-8 encoded code 1234 | point in the sequence we are trying to determine the length. It can be the 1235 | beginning of a new code point, or not.
1236 | Return value the distance between the iterators, 1237 | in code points. 1238 |

1239 |

1240 | Example of use: 1241 |

1242 |
1243 | char* twochars = "\xe6\x97\xa5\xd1\x88";
1245 | size_t dist = utf8::unchecked::distance(twochars, twochars + 5);
1247 | assert (dist == 2);
1248 | 
1249 |

1250 | This is a faster but less safe version of utf8::distance. It does not 1251 | check for validity of the supplied UTF-8 sequence. 1252 |

1253 |

1254 | utf8::unchecked::utf16to8 1255 |

1256 |

1257 | Available in version 1.0 and later. 1258 |

1259 |

1260 | Converts a UTF-16 encoded string to UTF-8. 1261 |

1262 |
1263 | template <typename u16bit_iterator, typename octet_iterator>
1266 | octet_iterator utf16to8 (u16bit_iterator start, u16bit_iterator end, octet_iterator result);
1267 |    
1268 | 
1269 |

1270 | start: an iterator pointing to the beginning of the UTF-16 encoded 1271 | string to convert.
1272 | end: an iterator pointing to pass-the-end of the UTF-16 encoded 1273 | string to convert.
1274 | result: an output iterator to the place in the UTF-8 string where to 1275 | append the result of conversion.
1276 | Return value: An iterator pointing to the place 1277 | after the appended UTF-8 string. 1278 |

1279 |

1280 | Example of use: 1281 |

1282 |
1283 | unsigned short utf16string[] = {0x41, 0x0448, 0x65e5, 0xd834, 0xdd1e};
1287 | vector<unsigned char> utf8result;
1288 | unchecked::utf16to8(utf16string, utf16string + 5, back_inserter(utf8result));
1290 | assert (utf8result.size() == 10);    
1291 | 
1292 |

1293 | This is a faster but less safe version of utf8::utf16to8. It does not 1294 | check for validity of the supplied UTF-16 sequence. 1295 |

1296 |

1297 | utf8::unchecked::utf8to16 1298 |

1299 |

1300 | Available in version 1.0 and later. 1301 |

1302 |

1303 | Converts an UTF-8 encoded string to UTF-16 1304 |

1305 |
1306 | template <typename u16bit_iterator, typename octet_iterator>
1308 | u16bit_iterator utf8to16 (octet_iterator start, octet_iterator end, u16bit_iterator result);
1309 |    
1310 | 
1311 |

1312 | start: an iterator pointing to the beginning of the UTF-8 encoded 1313 | string to convert. < br /> end: an iterator pointing to 1314 | pass-the-end of the UTF-8 encoded string to convert.
1315 | result: an output iterator to the place in the UTF-16 string where to 1316 | append the result of conversion.
1317 | Return value: An iterator pointing to the place 1318 | after the appended UTF-16 string. 1319 |

1320 |

1321 | Example of use: 1322 |

1323 |
1324 | char utf8_with_surrogates[] = "\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e";
1326 | vector <unsigned short> utf16result;
1327 | unchecked::utf8to16(utf8_with_surrogates, utf8_with_surrogates + 9, back_inserter(utf16result));
1329 | assert (utf16result.size() == 4);
1330 | assert (utf16result[2] == 0xd834);
1332 | assert (utf16result[3] == 0xdd1e);
1334 | 
1335 |

1336 | This is a faster but less safe version of utf8::utf8to16. It does not 1337 | check for validity of the supplied UTF-8 sequence. 1338 |

1339 |

1340 | utf8::unchecked::utf32to8 1341 |

1342 |

1343 | Available in version 1.0 and later. 1344 |

1345 |

1346 | Converts a UTF-32 encoded string to UTF-8. 1347 |

1348 |
1349 | template <typename octet_iterator, typename u32bit_iterator>
1352 | octet_iterator utf32to8 (u32bit_iterator start, u32bit_iterator end, octet_iterator result);
1353 |    
1354 | 
1355 |

1356 | start: an iterator pointing to the beginning of the UTF-32 encoded 1357 | string to convert.
1358 | end: an iterator pointing to pass-the-end of the UTF-32 encoded 1359 | string to convert.
1360 | result: an output iterator to the place in the UTF-8 string where to 1361 | append the result of conversion.
1362 | Return value: An iterator pointing to the place 1363 | after the appended UTF-8 string. 1364 |

1365 |

1366 | Example of use: 1367 |

1368 |
1369 | int utf32string[] = {0x448, 0x65e5, 0x10346, 0};
1372 | vector<unsigned char> utf8result;
1373 | utf32to8(utf32string, utf32string + 3, back_inserter(utf8result));
1375 | assert (utf8result.size() == 9);
1376 | 
1377 |

1378 | This is a faster but less safe version of utf8::utf32to8. It does not 1379 | check for validity of the supplied UTF-32 sequence. 1380 |

1381 |

1382 | utf8::unchecked::utf8to32 1383 |

1384 |

1385 | Available in version 1.0 and later. 1386 |

1387 |

1388 | Converts a UTF-8 encoded string to UTF-32. 1389 |

1390 |
1391 | template <typename octet_iterator, typename u32bit_iterator>
1393 | u32bit_iterator utf8to32 (octet_iterator start, octet_iterator end, u32bit_iterator result);
1394 |    
1395 | 
1396 |

1397 | start: an iterator pointing to the beginning of the UTF-8 encoded 1398 | string to convert.
1399 | end: an iterator pointing to pass-the-end of the UTF-8 encoded string 1400 | to convert.
1401 | result: an output iterator to the place in the UTF-32 string where to 1402 | append the result of conversion.
1403 | Return value: An iterator pointing to the place 1404 | after the appended UTF-32 string. 1405 |

1406 |

1407 | Example of use: 1408 |

1409 |
1410 | char* twochars = "\xe6\x97\xa5\xd1\x88";
1412 | vector<int> utf32result;
1413 | unchecked::utf8to32(twochars, twochars + 5, back_inserter(utf32result));
1415 | assert (utf32result.size() == 2);
1416 | 
1417 |

1418 | This is a faster but less safe version of utf8::utf8to32. It does not 1419 | check for validity of the supplied UTF-8 sequence. 1420 |

1421 |

1422 | Types From utf8::unchecked Namespace 1423 |

1424 |

1425 | utf8::iterator 1426 |

1427 |

1428 | Available in version 2.0 and later. 1429 |

1430 |

1431 | Adapts the underlying octet iterator to iterate over the sequence of code points, 1432 | rather than raw octets. 1433 |

1434 |
1435 | template <typename octet_iterator>
1436 | class iterator;
1437 | 
1438 | 1439 |
Member functions
1440 |
1441 |
iterator();
the deafult constructor; the underlying octet_iterator is 1442 | constructed with its default constructor. 1443 |
explicit iterator (const octet_iterator& octet_it); 1444 |
a constructor 1445 | that initializes the underlying octet_iterator with octet_it 1446 |
octet_iterator base () const;
returns the 1447 | underlying octet_iterator. 1448 |
uint32_t operator * () const;
decodes the utf-8 sequence 1449 | the underlying octet_iterator is pointing to and returns the code point. 1450 |
bool operator == (const iterator& rhs) 1451 | const;
returns true 1452 | if the two underlaying iterators are equal. 1453 |
bool operator != (const iterator& rhs) 1454 | const;
returns true 1455 | if the two underlaying iterators are not equal. 1456 |
iterator& operator ++ ();
the prefix increment - moves 1457 | the iterator to the next UTF-8 encoded code point. 1458 |
iterator operator ++ (int);
1459 | the postfix increment - moves the iterator to the next UTF-8 encoded code point and returns the current one. 1460 |
iterator& operator -- ();
the prefix decrement - moves 1461 | the iterator to the previous UTF-8 encoded code point. 1462 |
iterator operator -- (int);
1463 | the postfix decrement - moves the iterator to the previous UTF-8 encoded code point and returns the current one. 1464 |
1465 |

1466 | Example of use: 1467 |

1468 |
1469 | char* threechars = "\xf0\x90\x8d\x86\xe6\x97\xa5\xd1\x88";
1470 | utf8::unchecked::iterator<char*> un_it(threechars);
1471 | utf8::unchecked::iterator<char*> un_it2 = un_it;
1472 | assert (un_it2 == un_it);
1473 | assert (*un_it == 0x10346);
1474 | assert (*(++un_it) == 0x65e5);
1475 | assert ((*un_it++) == 0x65e5);
1476 | assert (*un_it == 0x0448);
1477 | assert (un_it != un_it2);
1478 | utf8::::unchecked::iterator<char*> un_endit (threechars + 9);  
1479 | assert (++un_it == un_endit);
1480 | assert (*(--un_it) == 0x0448);
1481 | assert ((*un_it--) == 0x0448);
1482 | assert (*un_it == 0x65e5);
1483 | assert (--un_it == utf8::unchecked::iterator<char*>(threechars));
1484 | assert (*un_it == 0x10346);
1485 | 
1486 |

1487 | This is an unchecked version of utf8::iterator. It is faster in many cases, but offers 1488 | no validity or range checks. 1489 |

1490 |

1491 | Points of interest 1492 |

1493 |

1494 | Design goals and decisions 1495 |

1496 |

1497 | The library was designed to be: 1498 |

1499 |
    1500 |
  1. 1501 | Generic: for better or worse, there are many C++ string classes out there, and 1502 | the library should work with as many of them as possible. 1503 |
  2. 1504 |
  3. 1505 | Portable: the library should be portable both accross different platforms and 1506 | compilers. The only non-portable code is a small section that declares unsigned 1507 | integers of different sizes: three typedefs. They can be changed by the users of 1508 | the library if they don't match their platform. The default setting should work 1509 | for Windows (both 32 and 64 bit), and most 32 bit and 64 bit Unix derivatives. 1510 |
  4. 1511 |
  5. 1512 | Lightweight: follow the "pay only for what you use" guidline. 1513 |
  6. 1514 |
  7. 1515 | Unintrusive: avoid forcing any particular design or even programming style on the 1516 | user. This is a library, not a framework. 1517 |
  8. 1518 |
1519 |

1520 | Alternatives 1521 |

1522 |

1523 | In case you want to look into other means of working with UTF-8 strings from C++, 1524 | here is the list of solutions I am aware of: 1525 |

1526 |
    1527 |
  1. 1528 | ICU Library. It is very powerful, 1529 | complete, feature-rich, mature, and widely used. Also big, intrusive, 1530 | non-generic, and doesn't play well with the Standard Library. I definitelly 1531 | recommend looking at ICU even if you don't plan to use it. 1532 |
  2. 1533 |
  3. 1534 | Glib::ustring. 1536 | A class specifically made to work with UTF-8 strings, and also feel like 1537 | std::string. If you prefer to have yet another string class in your 1538 | code, it may be worth a look. Be aware of the licensing issues, though. 1539 |
  4. 1540 |
  5. 1541 | Platform dependent solutions: Windows and POSIX have functions to convert strings 1542 | from one encoding to another. That is only a subset of what my library offers, 1543 | but if that is all you need it may be good enough, especially given the fact that 1544 | these functions are mature and tested in production. 1545 |
  6. 1546 |
1547 |

1548 | Conclusion 1549 |

1550 |

1551 | Until Unicode becomes officially recognized by the C++ Standard Library, we need to 1552 | use other means to work with UTF-8 strings. Template functions I describe in this 1553 | article may be a good step in this direction. 1554 |

1555 | 1558 |
    1559 |
  1. 1560 | The Unicode Consortium. 1561 |
  2. 1562 |
  3. 1563 | ICU Library. 1564 |
  4. 1565 |
  5. 1566 | UTF-8 at Wikipedia 1567 |
  6. 1568 |
  7. 1569 | UTF-8 and Unicode FAQ for 1570 | Unix/Linux 1571 |
  8. 1572 |
1573 | 1574 | 1575 | -------------------------------------------------------------------------------- /source/utf8.h: -------------------------------------------------------------------------------- 1 | // Copyright 2006 Nemanja Trifunovic 2 | 3 | /* 4 | Permission is hereby granted, free of charge, to any person or organization 5 | obtaining a copy of the software and accompanying documentation covered by 6 | this license (the "Software") to use, reproduce, display, distribute, 7 | execute, and transmit the Software, and to prepare derivative works of the 8 | Software, and to permit third-parties to whom the Software is furnished to 9 | do so, all subject to the following: 10 | 11 | The copyright notices in the Software and this entire statement, including 12 | the above license grant, this restriction and the following disclaimer, 13 | must be included in all copies of the Software, in whole or in part, and 14 | all derivative works of the Software, unless such copies or derivative 15 | works are solely in the form of machine-executable object code generated by 16 | a source language processor. 17 | 18 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 19 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 20 | FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT 21 | SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE 22 | FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, 23 | ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 24 | DEALINGS IN THE SOFTWARE. 25 | */ 26 | 27 | 28 | #ifndef UTF8_FOR_CPP_2675DCD0_9480_4c0c_B92A_CC14C027B731 29 | #define UTF8_FOR_CPP_2675DCD0_9480_4c0c_B92A_CC14C027B731 30 | 31 | #include "utf8/checked.h" 32 | #include "utf8/unchecked.h" 33 | 34 | #endif // header guard 35 | -------------------------------------------------------------------------------- /source/utf8/checked.h: -------------------------------------------------------------------------------- 1 | // Copyright 2006 Nemanja Trifunovic 2 | 3 | /* 4 | Permission is hereby granted, free of charge, to any person or organization 5 | obtaining a copy of the software and accompanying documentation covered by 6 | this license (the "Software") to use, reproduce, display, distribute, 7 | execute, and transmit the Software, and to prepare derivative works of the 8 | Software, and to permit third-parties to whom the Software is furnished to 9 | do so, all subject to the following: 10 | 11 | The copyright notices in the Software and this entire statement, including 12 | the above license grant, this restriction and the following disclaimer, 13 | must be included in all copies of the Software, in whole or in part, and 14 | all derivative works of the Software, unless such copies or derivative 15 | works are solely in the form of machine-executable object code generated by 16 | a source language processor. 17 | 18 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 19 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 20 | FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT 21 | SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE 22 | FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, 23 | ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 24 | DEALINGS IN THE SOFTWARE. 25 | */ 26 | 27 | 28 | #ifndef UTF8_FOR_CPP_CHECKED_H_2675DCD0_9480_4c0c_B92A_CC14C027B731 29 | #define UTF8_FOR_CPP_CHECKED_H_2675DCD0_9480_4c0c_B92A_CC14C027B731 30 | 31 | #include "core.h" 32 | #include 33 | 34 | namespace utf8 35 | { 36 | template 37 | octet_iterator append(uint32_t cp, octet_iterator result); 38 | 39 | // Exceptions that may be thrown from the library functions. 40 | class invalid_code_point : public std::exception { 41 | uint32_t cp; 42 | public: 43 | invalid_code_point(uint32_t _cp) : cp(_cp) {} 44 | virtual const char* what() const throw() { return "Invalid code point"; } 45 | uint32_t code_point() const {return cp;} 46 | }; 47 | 48 | class invalid_utf8 : public std::exception { 49 | uint8_t u8; 50 | public: 51 | invalid_utf8 (uint8_t u) : u8(u) {} 52 | virtual const char* what() const throw() { return "Invalid UTF-8"; } 53 | uint8_t utf8_octet() const {return u8;} 54 | }; 55 | 56 | class invalid_utf16 : public std::exception { 57 | uint16_t u16; 58 | public: 59 | invalid_utf16 (uint16_t u) : u16(u) {} 60 | virtual const char* what() const throw() { return "Invalid UTF-16"; } 61 | uint16_t utf16_word() const {return u16;} 62 | }; 63 | 64 | class not_enough_room : public std::exception { 65 | public: 66 | virtual const char* what() const throw() { return "Not enough space"; } 67 | }; 68 | 69 | /// The library API - functions intended to be called by the users 70 | 71 | template 72 | output_iterator replace_invalid(octet_iterator start, octet_iterator end, output_iterator out, uint32_t replacement) 73 | { 74 | while (start != end) { 75 | octet_iterator sequence_start = start; 76 | internal::utf_error err_code = internal::validate_next(start, end); 77 | switch (err_code) { 78 | case internal::OK : 79 | for (octet_iterator it = sequence_start; it != start; ++it) 80 | *out++ = *it; 81 | break; 82 | case internal::NOT_ENOUGH_ROOM: 83 | throw not_enough_room(); 84 | case internal::INVALID_LEAD: 85 | append (replacement, out); 86 | ++start; 87 | break; 88 | case internal::INCOMPLETE_SEQUENCE: 89 | case internal::OVERLONG_SEQUENCE: 90 | case internal::INVALID_CODE_POINT: 91 | append (replacement, out); 92 | ++start; 93 | // just one replacement mark for the sequence 94 | while (internal::is_trail(*start) && start != end) 95 | ++start; 96 | break; 97 | } 98 | } 99 | return out; 100 | } 101 | 102 | template 103 | inline output_iterator replace_invalid(octet_iterator start, octet_iterator end, output_iterator out) 104 | { 105 | static const uint32_t replacement_marker = internal::mask16(0xfffd); 106 | return replace_invalid(start, end, out, replacement_marker); 107 | } 108 | 109 | template 110 | octet_iterator append(uint32_t cp, octet_iterator result) 111 | { 112 | if (!internal::is_code_point_valid(cp)) 113 | throw invalid_code_point(cp); 114 | 115 | if (cp < 0x80) // one octet 116 | *(result++) = static_cast(cp); 117 | else if (cp < 0x800) { // two octets 118 | *(result++) = static_cast((cp >> 6) | 0xc0); 119 | *(result++) = static_cast((cp & 0x3f) | 0x80); 120 | } 121 | else if (cp < 0x10000) { // three octets 122 | *(result++) = static_cast((cp >> 12) | 0xe0); 123 | *(result++) = static_cast(((cp >> 6) & 0x3f) | 0x80); 124 | *(result++) = static_cast((cp & 0x3f) | 0x80); 125 | } 126 | else if (cp <= internal::CODE_POINT_MAX) { // four octets 127 | *(result++) = static_cast((cp >> 18) | 0xf0); 128 | *(result++) = static_cast(((cp >> 12)& 0x3f) | 0x80); 129 | *(result++) = static_cast(((cp >> 6) & 0x3f) | 0x80); 130 | *(result++) = static_cast((cp & 0x3f) | 0x80); 131 | } 132 | else 133 | throw invalid_code_point(cp); 134 | 135 | return result; 136 | } 137 | 138 | template 139 | uint32_t next(octet_iterator& it, octet_iterator end) 140 | { 141 | uint32_t cp = 0; 142 | internal::utf_error err_code = internal::validate_next(it, end, &cp); 143 | switch (err_code) { 144 | case internal::OK : 145 | break; 146 | case internal::NOT_ENOUGH_ROOM : 147 | throw not_enough_room(); 148 | case internal::INVALID_LEAD : 149 | case internal::INCOMPLETE_SEQUENCE : 150 | case internal::OVERLONG_SEQUENCE : 151 | throw invalid_utf8(*it); 152 | case internal::INVALID_CODE_POINT : 153 | throw invalid_code_point(cp); 154 | } 155 | return cp; 156 | } 157 | 158 | template 159 | uint32_t peek_next(octet_iterator it, octet_iterator end) 160 | { 161 | return next(it, end); 162 | } 163 | 164 | template 165 | uint32_t prior(octet_iterator& it, octet_iterator start) 166 | { 167 | octet_iterator end = it; 168 | while (internal::is_trail(*(--it))) 169 | if (it < start) 170 | throw invalid_utf8(*it); // error - no lead byte in the sequence 171 | octet_iterator temp = it; 172 | return next(temp, end); 173 | } 174 | 175 | /// Deprecated in versions that include "prior" 176 | template 177 | uint32_t previous(octet_iterator& it, octet_iterator pass_start) 178 | { 179 | octet_iterator end = it; 180 | while (internal::is_trail(*(--it))) 181 | if (it == pass_start) 182 | throw invalid_utf8(*it); // error - no lead byte in the sequence 183 | octet_iterator temp = it; 184 | return next(temp, end); 185 | } 186 | 187 | template 188 | void advance (octet_iterator& it, distance_type n, octet_iterator end) 189 | { 190 | for (distance_type i = 0; i < n; ++i) 191 | next(it, end); 192 | } 193 | 194 | template 195 | typename std::iterator_traits::difference_type 196 | distance (octet_iterator first, octet_iterator last) 197 | { 198 | typename std::iterator_traits::difference_type dist; 199 | for (dist = 0; first < last; ++dist) 200 | next(first, last); 201 | return dist; 202 | } 203 | 204 | template 205 | octet_iterator utf16to8 (u16bit_iterator start, u16bit_iterator end, octet_iterator result) 206 | { 207 | while (start != end) { 208 | uint32_t cp = internal::mask16(*start++); 209 | // Take care of surrogate pairs first 210 | if (internal::is_surrogate(cp)) { 211 | if (start != end) { 212 | uint32_t trail_surrogate = internal::mask16(*start++); 213 | if (trail_surrogate >= internal::TRAIL_SURROGATE_MIN && trail_surrogate <= internal::TRAIL_SURROGATE_MAX) 214 | cp = (cp << 10) + trail_surrogate + internal::SURROGATE_OFFSET; 215 | else 216 | throw invalid_utf16(static_cast(trail_surrogate)); 217 | } 218 | else 219 | throw invalid_utf16(static_cast(*start)); 220 | 221 | } 222 | result = append(cp, result); 223 | } 224 | return result; 225 | } 226 | 227 | template 228 | u16bit_iterator utf8to16 (octet_iterator start, octet_iterator end, u16bit_iterator result) 229 | { 230 | while (start != end) { 231 | uint32_t cp = next(start, end); 232 | if (cp > 0xffff) { //make a surrogate pair 233 | *result++ = static_cast((cp >> 10) + internal::LEAD_OFFSET); 234 | *result++ = static_cast((cp & 0x3ff) + internal::TRAIL_SURROGATE_MIN); 235 | } 236 | else 237 | *result++ = static_cast(cp); 238 | } 239 | return result; 240 | } 241 | 242 | template 243 | octet_iterator utf32to8 (u32bit_iterator start, u32bit_iterator end, octet_iterator result) 244 | { 245 | while (start != end) 246 | result = append(*(start++), result); 247 | 248 | return result; 249 | } 250 | 251 | template 252 | u32bit_iterator utf8to32 (octet_iterator start, octet_iterator end, u32bit_iterator result) 253 | { 254 | while (start < end) 255 | (*result++) = next(start, end); 256 | 257 | return result; 258 | } 259 | 260 | // The iterator class 261 | template 262 | class iterator : public std::iterator { 263 | octet_iterator it; 264 | octet_iterator range_start; 265 | octet_iterator range_end; 266 | public: 267 | iterator () {} 268 | explicit iterator (const octet_iterator& octet_it, 269 | const octet_iterator& _range_start, 270 | const octet_iterator& _range_end) : 271 | it(octet_it), range_start(_range_start), range_end(_range_end) 272 | { 273 | if (it < range_start || it > range_end) 274 | throw std::out_of_range("Invalid utf-8 iterator position"); 275 | } 276 | // the default "big three" are OK 277 | octet_iterator base () const { return it; } 278 | uint32_t operator * () const 279 | { 280 | octet_iterator temp = it; 281 | return next(temp, range_end); 282 | } 283 | bool operator == (const iterator& rhs) const 284 | { 285 | if (range_start != rhs.range_start || range_end != rhs.range_end) 286 | throw std::logic_error("Comparing utf-8 iterators defined with different ranges"); 287 | return (it == rhs.it); 288 | } 289 | bool operator != (const iterator& rhs) const 290 | { 291 | return !(operator == (rhs)); 292 | } 293 | iterator& operator ++ () 294 | { 295 | next(it, range_end); 296 | return *this; 297 | } 298 | iterator operator ++ (int) 299 | { 300 | iterator temp = *this; 301 | next(it, range_end); 302 | return temp; 303 | } 304 | iterator& operator -- () 305 | { 306 | prior(it, range_start); 307 | return *this; 308 | } 309 | iterator operator -- (int) 310 | { 311 | iterator temp = *this; 312 | prior(it, range_start); 313 | return temp; 314 | } 315 | }; // class iterator 316 | 317 | } // namespace utf8 318 | 319 | #endif //header guard 320 | 321 | 322 | -------------------------------------------------------------------------------- /source/utf8/core.h: -------------------------------------------------------------------------------- 1 | // Copyright 2006 Nemanja Trifunovic 2 | 3 | /* 4 | Permission is hereby granted, free of charge, to any person or organization 5 | obtaining a copy of the software and accompanying documentation covered by 6 | this license (the "Software") to use, reproduce, display, distribute, 7 | execute, and transmit the Software, and to prepare derivative works of the 8 | Software, and to permit third-parties to whom the Software is furnished to 9 | do so, all subject to the following: 10 | 11 | The copyright notices in the Software and this entire statement, including 12 | the above license grant, this restriction and the following disclaimer, 13 | must be included in all copies of the Software, in whole or in part, and 14 | all derivative works of the Software, unless such copies or derivative 15 | works are solely in the form of machine-executable object code generated by 16 | a source language processor. 17 | 18 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 19 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 20 | FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT 21 | SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE 22 | FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, 23 | ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 24 | DEALINGS IN THE SOFTWARE. 25 | */ 26 | 27 | 28 | #ifndef UTF8_FOR_CPP_CORE_H_2675DCD0_9480_4c0c_B92A_CC14C027B731 29 | #define UTF8_FOR_CPP_CORE_H_2675DCD0_9480_4c0c_B92A_CC14C027B731 30 | 31 | #include 32 | 33 | namespace utf8 34 | { 35 | // The typedefs for 8-bit, 16-bit and 32-bit unsigned integers 36 | // You may need to change them to match your system. 37 | // These typedefs have the same names as ones from cstdint, or boost/cstdint 38 | typedef unsigned char uint8_t; 39 | typedef unsigned short uint16_t; 40 | typedef unsigned int uint32_t; 41 | 42 | // Helper code - not intended to be directly called by the library users. May be changed at any time 43 | namespace internal 44 | { 45 | // Unicode constants 46 | // Leading (high) surrogates: 0xd800 - 0xdbff 47 | // Trailing (low) surrogates: 0xdc00 - 0xdfff 48 | const uint16_t LEAD_SURROGATE_MIN = 0xd800u; 49 | const uint16_t LEAD_SURROGATE_MAX = 0xdbffu; 50 | const uint16_t TRAIL_SURROGATE_MIN = 0xdc00u; 51 | const uint16_t TRAIL_SURROGATE_MAX = 0xdfffu; 52 | const uint16_t LEAD_OFFSET = LEAD_SURROGATE_MIN - (0x10000 >> 10); 53 | const uint32_t SURROGATE_OFFSET = 0x10000u - (LEAD_SURROGATE_MIN << 10) - TRAIL_SURROGATE_MIN; 54 | 55 | // Maximum valid value for a Unicode code point 56 | const uint32_t CODE_POINT_MAX = 0x0010ffffu; 57 | 58 | template 59 | inline uint8_t mask8(octet_type oc) 60 | { 61 | return static_cast(0xff & oc); 62 | } 63 | template 64 | inline uint16_t mask16(u16_type oc) 65 | { 66 | return static_cast(0xffff & oc); 67 | } 68 | template 69 | inline bool is_trail(octet_type oc) 70 | { 71 | return ((mask8(oc) >> 6) == 0x2); 72 | } 73 | 74 | template 75 | inline bool is_surrogate(u16 cp) 76 | { 77 | return (cp >= LEAD_SURROGATE_MIN && cp <= TRAIL_SURROGATE_MAX); 78 | } 79 | 80 | template 81 | inline bool is_code_point_valid(u32 cp) 82 | { 83 | return (cp <= CODE_POINT_MAX && !is_surrogate(cp) && cp != 0xfffe && cp != 0xffff); 84 | } 85 | 86 | template 87 | inline typename std::iterator_traits::difference_type 88 | sequence_length(octet_iterator lead_it) 89 | { 90 | uint8_t lead = mask8(*lead_it); 91 | if (lead < 0x80) 92 | return 1; 93 | else if ((lead >> 5) == 0x6) 94 | return 2; 95 | else if ((lead >> 4) == 0xe) 96 | return 3; 97 | else if ((lead >> 3) == 0x1e) 98 | return 4; 99 | else 100 | return 0; 101 | } 102 | 103 | enum utf_error {OK, NOT_ENOUGH_ROOM, INVALID_LEAD, INCOMPLETE_SEQUENCE, OVERLONG_SEQUENCE, INVALID_CODE_POINT}; 104 | 105 | template 106 | utf_error validate_next(octet_iterator& it, octet_iterator end, uint32_t* code_point) 107 | { 108 | uint32_t cp = mask8(*it); 109 | // Check the lead octet 110 | typedef typename std::iterator_traits::difference_type octet_difference_type; 111 | octet_difference_type length = sequence_length(it); 112 | 113 | // "Shortcut" for ASCII characters 114 | if (length == 1) { 115 | if (end - it > 0) { 116 | if (code_point) 117 | *code_point = cp; 118 | ++it; 119 | return OK; 120 | } 121 | else 122 | return NOT_ENOUGH_ROOM; 123 | } 124 | 125 | // Do we have enough memory? 126 | if (std::distance(it, end) < length) 127 | return NOT_ENOUGH_ROOM; 128 | 129 | // Check trail octets and calculate the code point 130 | switch (length) { 131 | case 0: 132 | return INVALID_LEAD; 133 | break; 134 | case 2: 135 | if (is_trail(*(++it))) { 136 | cp = ((cp << 6) & 0x7ff) + ((*it) & 0x3f); 137 | } 138 | else { 139 | --it; 140 | return INCOMPLETE_SEQUENCE; 141 | } 142 | break; 143 | case 3: 144 | if (is_trail(*(++it))) { 145 | cp = ((cp << 12) & 0xffff) + ((mask8(*it) << 6) & 0xfff); 146 | if (is_trail(*(++it))) { 147 | cp += (*it) & 0x3f; 148 | } 149 | else { 150 | std::advance(it, -2); 151 | return INCOMPLETE_SEQUENCE; 152 | } 153 | } 154 | else { 155 | --it; 156 | return INCOMPLETE_SEQUENCE; 157 | } 158 | break; 159 | case 4: 160 | if (is_trail(*(++it))) { 161 | cp = ((cp << 18) & 0x1fffff) + ((mask8(*it) << 12) & 0x3ffff); 162 | if (is_trail(*(++it))) { 163 | cp += (mask8(*it) << 6) & 0xfff; 164 | if (is_trail(*(++it))) { 165 | cp += (*it) & 0x3f; 166 | } 167 | else { 168 | std::advance(it, -3); 169 | return INCOMPLETE_SEQUENCE; 170 | } 171 | } 172 | else { 173 | std::advance(it, -2); 174 | return INCOMPLETE_SEQUENCE; 175 | } 176 | } 177 | else { 178 | --it; 179 | return INCOMPLETE_SEQUENCE; 180 | } 181 | break; 182 | } 183 | // Is the code point valid? 184 | if (!is_code_point_valid(cp)) { 185 | for (octet_difference_type i = 0; i < length - 1; ++i) 186 | --it; 187 | return INVALID_CODE_POINT; 188 | } 189 | 190 | if (code_point) 191 | *code_point = cp; 192 | 193 | if (cp < 0x80) { 194 | if (length != 1) { 195 | std::advance(it, -(length-1)); 196 | return OVERLONG_SEQUENCE; 197 | } 198 | } 199 | else if (cp < 0x800) { 200 | if (length != 2) { 201 | std::advance(it, -(length-1)); 202 | return OVERLONG_SEQUENCE; 203 | } 204 | } 205 | else if (cp < 0x10000) { 206 | if (length != 3) { 207 | std::advance(it, -(length-1)); 208 | return OVERLONG_SEQUENCE; 209 | } 210 | } 211 | 212 | ++it; 213 | return OK; 214 | } 215 | 216 | template 217 | inline utf_error validate_next(octet_iterator& it, octet_iterator end) { 218 | return validate_next(it, end, 0); 219 | } 220 | 221 | } // namespace internal 222 | 223 | /// The library API - functions intended to be called by the users 224 | 225 | // Byte order mark 226 | const uint8_t bom[] = {0xef, 0xbb, 0xbf}; 227 | 228 | template 229 | octet_iterator find_invalid(octet_iterator start, octet_iterator end) 230 | { 231 | octet_iterator result = start; 232 | while (result != end) { 233 | internal::utf_error err_code = internal::validate_next(result, end); 234 | if (err_code != internal::OK) 235 | return result; 236 | } 237 | return result; 238 | } 239 | 240 | template 241 | inline bool is_valid(octet_iterator start, octet_iterator end) 242 | { 243 | return (find_invalid(start, end) == end); 244 | } 245 | 246 | template 247 | inline bool is_bom (octet_iterator it) 248 | { 249 | return ( 250 | (internal::mask8(*it++)) == bom[0] && 251 | (internal::mask8(*it++)) == bom[1] && 252 | (internal::mask8(*it)) == bom[2] 253 | ); 254 | } 255 | } // namespace utf8 256 | 257 | #endif // header guard 258 | 259 | 260 | -------------------------------------------------------------------------------- /source/utf8/unchecked.h: -------------------------------------------------------------------------------- 1 | // Copyright 2006 Nemanja Trifunovic 2 | 3 | /* 4 | Permission is hereby granted, free of charge, to any person or organization 5 | obtaining a copy of the software and accompanying documentation covered by 6 | this license (the "Software") to use, reproduce, display, distribute, 7 | execute, and transmit the Software, and to prepare derivative works of the 8 | Software, and to permit third-parties to whom the Software is furnished to 9 | do so, all subject to the following: 10 | 11 | The copyright notices in the Software and this entire statement, including 12 | the above license grant, this restriction and the following disclaimer, 13 | must be included in all copies of the Software, in whole or in part, and 14 | all derivative works of the Software, unless such copies or derivative 15 | works are solely in the form of machine-executable object code generated by 16 | a source language processor. 17 | 18 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 19 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 20 | FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT 21 | SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE 22 | FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, 23 | ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 24 | DEALINGS IN THE SOFTWARE. 25 | */ 26 | 27 | 28 | #ifndef UTF8_FOR_CPP_UNCHECKED_H_2675DCD0_9480_4c0c_B92A_CC14C027B731 29 | #define UTF8_FOR_CPP_UNCHECKED_H_2675DCD0_9480_4c0c_B92A_CC14C027B731 30 | 31 | #include "core.h" 32 | 33 | namespace utf8 34 | { 35 | namespace unchecked 36 | { 37 | template 38 | octet_iterator append(uint32_t cp, octet_iterator result) 39 | { 40 | if (cp < 0x80) // one octet 41 | *(result++) = static_cast(cp); 42 | else if (cp < 0x800) { // two octets 43 | *(result++) = static_cast((cp >> 6) | 0xc0); 44 | *(result++) = static_cast((cp & 0x3f) | 0x80); 45 | } 46 | else if (cp < 0x10000) { // three octets 47 | *(result++) = static_cast((cp >> 12) | 0xe0); 48 | *(result++) = static_cast(((cp >> 6) & 0x3f) | 0x80); 49 | *(result++) = static_cast((cp & 0x3f) | 0x80); 50 | } 51 | else { // four octets 52 | *(result++) = static_cast((cp >> 18) | 0xf0); 53 | *(result++) = static_cast(((cp >> 12)& 0x3f) | 0x80); 54 | *(result++) = static_cast(((cp >> 6) & 0x3f) | 0x80); 55 | *(result++) = static_cast((cp & 0x3f) | 0x80); 56 | } 57 | return result; 58 | } 59 | 60 | template 61 | uint32_t next(octet_iterator& it) 62 | { 63 | uint32_t cp = internal::mask8(*it); 64 | typename std::iterator_traits::difference_type length = utf8::internal::sequence_length(it); 65 | switch (length) { 66 | case 1: 67 | break; 68 | case 2: 69 | it++; 70 | cp = ((cp << 6) & 0x7ff) + ((*it) & 0x3f); 71 | break; 72 | case 3: 73 | ++it; 74 | cp = ((cp << 12) & 0xffff) + ((internal::mask8(*it) << 6) & 0xfff); 75 | ++it; 76 | cp += (*it) & 0x3f; 77 | break; 78 | case 4: 79 | ++it; 80 | cp = ((cp << 18) & 0x1fffff) + ((internal::mask8(*it) << 12) & 0x3ffff); 81 | ++it; 82 | cp += (internal::mask8(*it) << 6) & 0xfff; 83 | ++it; 84 | cp += (*it) & 0x3f; 85 | break; 86 | } 87 | ++it; 88 | return cp; 89 | } 90 | 91 | template 92 | uint32_t peek_next(octet_iterator it) 93 | { 94 | return next(it); 95 | } 96 | 97 | template 98 | uint32_t prior(octet_iterator& it) 99 | { 100 | while (internal::is_trail(*(--it))) ; 101 | octet_iterator temp = it; 102 | return next(temp); 103 | } 104 | 105 | // Deprecated in versions that include prior, but only for the sake of consistency (see utf8::previous) 106 | template 107 | inline uint32_t previous(octet_iterator& it) 108 | { 109 | return prior(it); 110 | } 111 | 112 | template 113 | void advance (octet_iterator& it, distance_type n) 114 | { 115 | for (distance_type i = 0; i < n; ++i) 116 | next(it); 117 | } 118 | 119 | template 120 | typename std::iterator_traits::difference_type 121 | distance (octet_iterator first, octet_iterator last) 122 | { 123 | typename std::iterator_traits::difference_type dist; 124 | for (dist = 0; first < last; ++dist) 125 | next(first); 126 | return dist; 127 | } 128 | 129 | template 130 | octet_iterator utf16to8 (u16bit_iterator start, u16bit_iterator end, octet_iterator result) 131 | { 132 | while (start != end) { 133 | uint32_t cp = internal::mask16(*start++); 134 | // Take care of surrogate pairs first 135 | if (internal::is_surrogate(cp)) { 136 | uint32_t trail_surrogate = internal::mask16(*start++); 137 | cp = (cp << 10) + trail_surrogate + internal::SURROGATE_OFFSET; 138 | } 139 | result = append(cp, result); 140 | } 141 | return result; 142 | } 143 | 144 | template 145 | u16bit_iterator utf8to16 (octet_iterator start, octet_iterator end, u16bit_iterator result) 146 | { 147 | while (start != end) { 148 | uint32_t cp = next(start); 149 | if (cp > 0xffff) { //make a surrogate pair 150 | *result++ = static_cast((cp >> 10) + internal::LEAD_OFFSET); 151 | *result++ = static_cast((cp & 0x3ff) + internal::TRAIL_SURROGATE_MIN); 152 | } 153 | else 154 | *result++ = static_cast(cp); 155 | } 156 | return result; 157 | } 158 | 159 | template 160 | octet_iterator utf32to8 (u32bit_iterator start, u32bit_iterator end, octet_iterator result) 161 | { 162 | while (start != end) 163 | result = append(*(start++), result); 164 | 165 | return result; 166 | } 167 | 168 | template 169 | u32bit_iterator utf8to32 (octet_iterator start, octet_iterator end, u32bit_iterator result) 170 | { 171 | while (start < end) 172 | (*result++) = next(start); 173 | 174 | return result; 175 | } 176 | 177 | // The iterator class 178 | template 179 | class iterator : public std::iterator { 180 | octet_iterator it; 181 | public: 182 | iterator () {} 183 | explicit iterator (const octet_iterator& octet_it): it(octet_it) {} 184 | // the default "big three" are OK 185 | octet_iterator base () const { return it; } 186 | uint32_t operator * () const 187 | { 188 | octet_iterator temp = it; 189 | return next(temp); 190 | } 191 | bool operator == (const iterator& rhs) const 192 | { 193 | return (it == rhs.it); 194 | } 195 | bool operator != (const iterator& rhs) const 196 | { 197 | return !(operator == (rhs)); 198 | } 199 | iterator& operator ++ () 200 | { 201 | std::advance(it, internal::sequence_length(it)); 202 | return *this; 203 | } 204 | iterator operator ++ (int) 205 | { 206 | iterator temp = *this; 207 | std::advance(it, internal::sequence_length(it)); 208 | return temp; 209 | } 210 | iterator& operator -- () 211 | { 212 | prior(it); 213 | return *this; 214 | } 215 | iterator operator -- (int) 216 | { 217 | iterator temp = *this; 218 | prior(it); 219 | return temp; 220 | } 221 | }; // class iterator 222 | 223 | } // namespace utf8::unchecked 224 | } // namespace utf8 225 | 226 | 227 | #endif // header guard 228 | 229 | --------------------------------------------------------------------------------