To the extent possible under law, the editors have waived all copyright
777 | and related or neighboring rights to this work.
778 |
779 |
780 |
781 |
Abstract
782 |
WTF-8 (Wobbly Transformation Format − 8-bit) is a superset of UTF-8 that encodes surrogate code points if they are not in a pair. It represents, in a way compatible with UTF-8, text from systems such as JavaScript and Windows that use UTF-16 internally but don’t enforce the well-formedness invariant that surrogates must be paired.
WTF-8 is a hack intended to be used internally in self-contained systems
836 | with components that need to support potentially ill-formed UTF-16 for legacy reasons.
837 |
Any WTF-8 data must be converted
838 | to a Unicode encoding at the system’s boundary
839 | before being emitted. UTF-8 is recommended. WTF-8must not be used to represent text
840 | in a file format or for transmission over the Internet.
When Unicode 1.0 was published in 1991,
847 | it defined 65536 code points from U+0000 to U+FFFF
848 | and assigned characters to around half of them.
849 | Many software implementations chose the obvious memory representation
850 | for Unicode text of 16 bits per code point / character.
851 |
At the time, “Unicode” was synonymous with that particular encoding.
852 | To disambiguate, that encoding is now called UCS-2.
853 |
As subsequent versions of Unicode assigned more characters,
854 | it became apparent that 65536 code points would not be sufficient.
855 | Unicode was extended to 1114112 code points from U+0000 to U+10FFFF,
856 | and the UTF-16 encoding was introduced.
857 | This encoding preserves compatibility with existing 16-bit based systems
858 | and represents new (supplementary) code points
859 | as a pair of “surrogates”.
Meanwhile, 16-bit based systems had little to no incentive
867 | to do anything about surrogates:
868 | For several years,
869 | Unicode did not assign any character to supplementary code points,
870 | and then (until emoji) only comparatively rare characters.
871 | Additionally, the Unicode Standard does not require conforming implementations
872 | to maintain well-formedness of UTF-16 strings.
873 |
As a result, surrogates do occur in practice and need to be preserved.
874 | For example:
875 |
876 |
In ECMAScript (a.k.a. JavaScript), a String value is defined as a sequence of 16-bit integers
877 | that usually represents UTF-16 text
878 | but may or may not be well-formed.
879 |
A Unicode code point is any value in the Unicode codespace;
901 | that is, the range of integers from 0 to 1114111.
902 | It is noted with a “U+” prefix and four to six hexadecimal digits:
903 | the first and last code points are U+0000 and U+10FFFF.
904 |
The Basic Multilingual Plane is
905 | the range of code points from U+0000 to U+FFFF.
A supplementary code point is a code point not in the Basic Multilingual Plane.
908 | That is, a code point in the range from U+10000 to U+10FFFF.
909 |
A Unicode scalar value is a code point that is not a surrogate code point.
910 | That is, a code point in the range from U+0000 to U+D7FF,
911 | or in the range U+E000 to U+10FFFF.
Note: this specification is only concerned with the UTF-16 encoding form (based on 16-bit code units),
920 | and not with the encoding scheme (based on bytes, with UTF-16BE and UTF-16LE variants).
The replacement character is the code point U+FFFD REPLACEMENT CHARACTER (�).
932 | It is used as a substitute to replace ill-formed sub-sequences
933 | during a conversion.
934 |
A 16-bit code unit is a 16-bit integer used in UTF-16.
935 | It is noted with a “0x” prefix and four hexadecimal digits:
936 | the first and last 16-bit code units are 0x0000 and 0xFFFF.
937 |
Note: The byte serialization or memory representation of a 16-bit code unit (little-endian or big-endian)
938 | is out of scope for this specification.
939 |
When an algorithm iterates over a sequence (“For every i in …”), consuming the next item means advancing in the sequence
940 | such that that item will be skipped during the following iteration of the loop:
941 | the item after the next becomes the next item.
Note: A surrogate byte sequence (and therefore any byte sequence described in this section)
972 | is ill-formed in UTF-8.
973 | Decoders are required to treat it as an error.
WTF-16 is sometimes used as a shorter name for potentially ill-formed UTF-16,
1021 | especially in the context of systems were originally designed for UCS-2 and later upgraded to UTF-16 but never enforced well-formedness,
1022 | either by neglect or because of backward-compatibility constraints.
Note: If the input is restricted to Unicode text,
1044 | this is identical to encoding to UTF-16 and the resulting sequence is well-formed in UTF-16.
1045 |
1046 |
If, on the other hand, the input contains a surrogate code point pair,
1047 | the conversion will be incorrect and
1048 | the resulting sequence will not represent the original code points.
1049 |
This situation should be considered an error,
1050 | but this specification does not define how to handle it.
1051 | Possibilities include aborting the conversion,
1052 | or replacing one of the surrogate code points of the pair
1053 | with a replacement character.
For the purpose of this specification, generalized UTF-8 is an encoding of sequences of code points (not restricted to Unicode scalar values)
1079 | using 8-bit bytes,
1080 | based on the same underlying algorithm as UTF-8.
1081 | It is a strict superset of UTF-8 (like UTF-8 is a strict superset of ASCII).
1082 |
Each code point is encoded as a sequence of one to four bytes:
1083 |
1084 |
1085 |
Table 2. Bit distribution
1086 |
Bytes noted in binary, most significant bit first. x bits represent the least significant bits of the code points.
It is composed of a byte sequence that is well-formed in generalized UTF-8 followed by a byte sequence in the following table.
1124 |
1125 |
1126 |
1127 |
Table 3. Well-formed byte sequences representing a single code point
1128 |
Bytes noted in hexadecimal.
1129 |
1130 |
1131 |
1132 |
Code point
1133 |
First byte
1134 |
Second byte
1135 |
Third byte
1136 |
Fourth byte
1137 |
1138 |
U+0000 to U+007F
1139 |
00 to 7F
1140 |
1141 |
1142 |
1143 |
1144 |
U+0080 to U+07FF
1145 |
C2 to DF
1146 |
80 to BF
1147 |
1148 |
1149 |
1150 |
U+0800 to U+0FFF
1151 |
E0
1152 |
A0 to BF
1153 |
80 to BF
1154 |
1155 |
1156 |
U+1000 to U+FFFF
1157 |
E1 to EF
1158 |
80 to BF
1159 |
80 to BF
1160 |
1161 |
1162 |
U+10000 to U+3FFFF
1163 |
F0
1164 |
90 to BF
1165 |
80 to BF
1166 |
80 to BF
1167 |
1168 |
U+40000 to U+FFFFF
1169 |
F1 to F3
1170 |
80 to BF
1171 |
80 to BF
1172 |
80 to BF
1173 |
1174 |
U+100000 to U+10FFFF
1175 |
F4
1176 |
80 to 8F
1177 |
80 to BF
1178 |
80 to BF
1179 |
1180 |
6. The WTF-8 encoding
1181 |
WTF-8 (Wobbly Transformation Format − 8-bit)
1182 | is an encoding of code point sequences
1183 | that do not contain any surrogate code point pair using 8-bit bytes.
Note: This conversion never fails
1301 | and, if the input is well-formed in WTF-8, is lossless.
1302 |
6.4. Converting between WTF-8 and UTF-8
1303 |
Since WTF-8 is a superset of UTF-8,
1304 | any sequence of byte that is well-formed in UTF-8 is also well-formed in WTF-8 and represents the same text.
1305 | To convert from UTF-8 to WTF-8,
1306 | return the input unchanged.
1307 |
Note: This conversion never fails and is lossless.
wtf-8.js implements this specification in JavaScript.
1352 |
rust-wtf8 implements this specification in Rust.
1353 |
On Windows (which uses potentially ill-formed UTF-16 in its APIs),
1354 | the Rust standard library uses WTF-8 internally for OS strings,
1355 | but does not expose the WTF-8 byte sequences.
1356 |
Thanks for feedback and contributions from
1360 | Anne van Kesteren,
1361 | David Baron,
1362 | Dylan Petonke,
1363 | Guillaume Knispel,
1364 | Henri Sivonen,
1365 | Jacob Lifshay,
1366 | James Graham,
1367 | Lily Ballard,
1368 | Mathias Bynens,
1369 | Ms2ger,
1370 | Sam Tobin-Hochstadt,
1371 | Tab Atkins.
1372 |
1373 |
Conformance
1374 |
Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology.
1375 | The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL”
1376 | in the normative parts of this document
1377 | are to be interpreted as described in RFC 2119.
1378 | However, for readability,
1379 | these words do not appear in all uppercase letters in this specification.
1380 |
All of the text of this specification is normative
1381 | except sections explicitly marked as non-normative, examples, and notes. [RFC2119]
1382 |
Examples in this specification are introduced with the words “for example”
1383 | or are set apart from the normative text with class="example", like this:
1384 |
This is an example of an informative example.
1385 |
Informative notes begin with the word “Note”
1386 | and are set apart from the normative text with class="note", like this:
3 | Shortname: wtf-8
4 | Status: LS
5 | Boilerplate: omit feedback-header
6 | !Issue tracking: On GitHub
7 | !Change history: On GitHub
8 | !Last updated: [DATE]
9 | Editor: Simon Sapin, Mozilla https://www.mozilla.org/, https://exyr.org/
10 | Abstract: WTF-8 (Wobbly Transformation Format − 8-bit) is a superset of UTF-8 that encodes surrogate code points if they are not in a pair. It represents, in a way compatible with UTF-8, text from systems such as JavaScript and Windows that use UTF-16 internally but don’t enforce the well-formedness invariant that surrogates must be paired.
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 | Intended audience
19 |
20 | WTF-8 is a hack intended to be used internally in self-contained systems
21 | with components that need to support potentially ill-formed UTF-16
22 | for legacy reasons.
23 |
24 | Any WTF-8 data must be converted
25 | to a Unicode encoding at the system’s boundary
26 | before being emitted.
27 | UTF-8 is recommended.
28 | WTF-8must not be used to represent text
29 | in a file format or for transmission over the Internet.
30 |
31 | In particular,
32 | the Encoding Standard [[ENCODING]]
33 | defines UTF-8 and other encodings for the Web.
34 | There is no and will not be any
35 |
36 | encoding label [[ENCODING]] or
37 |
38 | IANA charset alias [[CHARSETS]]
39 | for WTF-8.
40 |
41 |
42 |
43 | Background and motivation
44 |
45 | This section is non-normative.
46 |
47 | When Unicode 1.0
48 | was published in 1991,
49 | it defined 65536 code points from U+0000 to U+FFFF
50 | and assigned characters to around half of them.
51 | Many software implementations chose the obvious memory representation
52 | for Unicode text of 16 bits per code point / character.
53 |
54 | At the time, “Unicode” was synonymous with that particular encoding.
55 | To disambiguate, that encoding is now called UCS-2.
56 |
57 | As subsequent versions of Unicode assigned more characters,
58 | it became apparent that 65536 code points would not be sufficient.
59 | Unicode was extended to 1114112 code points from U+0000 to U+10FFFF,
60 | and the UTF-16 encoding was introduced.
61 | This encoding preserves compatibility with existing 16-bit based systems
62 | and represents new (supplementary) code points
63 | as a pair of “surrogates”.
64 |
65 | UTF-16 is designed to represent any Unicode text,
66 | but it can not represent a surrogate code point pair
67 | since the corresponding surrogate 16-bit code unit pairs
68 | would instead represent a supplementary code point.
69 | Therefore, the concept of Unicode scalar value was introduced
70 | and Unicode text was restricted to not contain any surrogate code point.
71 | (This was presumably deemed simpler that only restricting pairs.)
72 |
73 | UTF-16 was redefined to be ill-formed
74 | if it contains unpaired surrogate 16-bit code units.
75 | UTF-8 was similarly redefined to be ill-formed
76 | if it contains surrogate byte sequences.
77 |
78 | Meanwhile, 16-bit based systems had little to no incentive
79 | to do anything about surrogates:
80 | For several years,
81 | Unicode did not assign any character to supplementary code points,
82 | and then (until emoji) only comparatively rare characters.
83 | Additionally, the Unicode Standard does not require conforming implementations
84 | to maintain well-formedness of UTF-16 strings.
85 |
86 | As a result, surrogates do occur in practice and need to be preserved.
87 | For example:
88 |
89 |
90 |
91 | In ECMAScript (a.k.a. JavaScript),
92 |
93 | a String value
94 | is defined as a sequence of 16-bit integers
95 | that usually represents UTF-16 text
96 | but may or may not be well-formed.
97 |
147 |
148 | These definitions correspond to those of the
149 | Glossary of Unicode Terms. [[!UNICODE]]
150 |
151 | A Unicode code point is any value in the Unicode codespace;
152 | that is, the range of integers from 0 to 1114111.
153 | It is noted with a “U+” prefix and four to six hexadecimal digits:
154 | the first and last code points are U+0000 and U+10FFFF.
155 |
156 | The Basic Multilingual Plane is
157 | the range of code points from U+0000 to U+FFFF.
158 |
159 | A BMP code point is a code point in the Basic Multilingual Plane.
160 |
161 | A supplementary code point
162 | is a code point not in the Basic Multilingual Plane.
163 | That is, a code point in the range from U+10000 to U+10FFFF.
164 |
165 | A Unicode scalar value
166 | is a code point that is not a surrogate code point.
167 | That is, a code point in the range from U+0000 to U+D7FF,
168 | or in the range U+E000 to U+10FFFF.
169 |
170 | A BMP scalar value is a Unicode scalar value
171 | in the Basic Multilingual Plane.
172 | That is, a code point in the range from U+0000 to U+D7FF,
173 | or in the range U+E000 to U+FFFF.
174 |
175 | Unicode text is a sequence of Unicode scalar values.
176 |
177 | UTF-8 is an encoding of Unicode text using 8-bit bytes.
178 | Each Unicode scalar value is represented as a sequence of one to four bytes.
179 |
180 | UTF-16 is an encoding of Unicode text using 16-bit code units.
181 | BMP scalar values are represented as a single 16-bit code unit with the same value.
182 | Supplementary code points are represented as a surrogate 16-bit code unit pair.
183 |
184 | Note: this specification is only concerned with the UTF-16
185 | encoding form
186 | (based on 16-bit code units),
187 | and not with the
188 | encoding scheme
189 | (based on bytes, with UTF-16BE and UTF-16LE variants).
190 |
191 | A string is well-formed
192 | (not ill-formed)
193 | in a given encoding if it follows the specification of that encoding.
194 | [[!UNICODE]] defines
195 |
196 | Well-Formed Code Unit Sequence
197 | for UTF-8 and UTF-16.
198 |
199 |
208 |
209 | The replacement character is the code point U+FFFD
210 | REPLACEMENT CHARACTER (�).
211 | It is used as a substitute to replace ill-formed sub-sequences
212 | during a conversion.
213 |
214 | A 16-bit code unit is a 16-bit integer used in UTF-16.
215 | It is noted with a “0x” prefix and four hexadecimal digits:
216 | the first and last 16-bit code units are 0x0000 and 0xFFFF.
217 |
218 | Note: The byte serialization or memory representation of a 16-bit code unit
219 | (little-endian or big-endian)
220 | is out of scope for this specification.
221 |
222 | When an algorithm iterates over a sequence (“For every i in …”),
223 | consuming the next item means advancing in the sequence
224 | such that that item will be skipped during the following iteration of the loop:
225 | the item after the next becomes the next item.
226 |
227 |
228 |
229 | The following algorithm prints “1”, “2”, and “4”.
230 |
231 | For every digit i in “1234”, run these substeps:
232 |
233 |
234 |
390 |
391 | Note: If the input is restricted to Unicode text,
392 | this is identical to encoding to UTF-16
393 | and the resulting sequence is well-formed in UTF-16.
394 |
395 |
396 |
397 | If, on the other hand, the input contains a surrogate code point pair,
398 | the conversion will be incorrect and
399 | the resulting sequence will not represent the original code points.
400 |
401 | This situation should be considered an error,
402 | but this specification does not define how to handle it.
403 | Possibilities include aborting the conversion,
404 | or replacing one of the surrogate code points of the pair
405 | with a replacement character.
406 |
433 | Otherwise,
434 | append to result a code point of value U.
435 |
436 |
Return result.
437 |
438 |
439 | Note: By construction,
440 | the resulting sequence does not contain a surrogate code point pair.
441 |
442 | Note: If the input is well-formed in UTF-16,
443 | this is identical to decoding UTF-16
444 | and the resulting sequence is Unicode text.
445 |
446 |
447 |
448 | Generalized UTF-8
449 |
450 | For the purpose of this specification,
451 | generalized UTF-8 is an encoding of sequences of code points
452 | (not restricted to Unicode scalar values)
453 | using 8-bit bytes,
454 | based on the same underlying algorithm as UTF-8.
455 | It is a strict superset of UTF-8
456 | (like UTF-8 is a strict superset of ASCII).
457 |
458 | Each code point is encoded as a sequence of one to four bytes:
459 |
460 |
461 |
462 |
463 | Table 2. Bit distribution
464 |
465 | Bytes noted in binary, most significant bit first.
466 | x bits represent the least significant bits of the code points.
467 |
468 |
635 | Append to result two bytes of values:
636 |
637 |
0xC0 | (P >> 6)
638 |
0x80 | (P & 0x3F)
639 |
640 |
U+0800 to U+FFFF
641 |
642 | Append to result three bytes of values
643 |
644 |
0xE0 | (P >> 12)
645 |
0x80 | ((P >> 6) & 0x3F)
646 |
0x80 | (P & 0x3F)
647 |
648 |
U+10000 to U+10FFFF
649 |
650 | Append to result four bytes of values
651 |
652 |
0xF0 | (P >> 18)
653 |
0x80 | ((P >> 12) & 0x3F)
654 |
0x80 | ((P >> 6) & 0x3F)
655 |
0x80 | (P & 0x3F)
656 |
657 |
658 |
659 |
Return result.
660 |
661 |
662 | Note: If the input contains a surrogate code point pair,
663 | the resulting byte sequence will be not represent
664 | the original sequence of code points.
665 | Instead, it will represent the same code points
666 | as if had been encoded in potentially ill-formed UTF-16.
667 | This is also consistent with
668 | encoding each code point to WTF-8 individually,
669 | and concatenating
670 | the resulting WTF-8 byte sequences.
671 |
672 |
673 |
Let result be a sequence of code points, initially empty.
692 |
For every byte B of the input, depending on B:
693 |
694 |
0x00 to 0x7F
695 |
Append to result a code point of value B.
696 |
697 |
0xC2 to 0xDF
698 |
699 | Let B2 be the next byte and consume it.
700 |
701 | Append to result a code point of value
702 |
703 | ((B & 0x1F) << 6) +
704 | (B2 & 0x3F)
705 |
706 |
0xE0 to 0xEF
707 |
708 | Let B2 and B3 be the next two bytes,
709 | and consume them.
710 |
711 | Append to result a code point of value
712 |
713 | ((B & 0x0F) << 12) +
714 | ((B2 & 0x3F) << 6) +
715 | (B3 & 0x3F)
716 |
717 |
0xF0 to 0xF4
718 |
719 | Let B2, B3, and B4 be the next three bytes,
720 | and consume them.
721 |
722 | Append to result a code point of value
723 |
724 | ((B & 0x07) << 18) +
725 | ((B2 & 0x3F) << 12) +
726 | ((B3 & 0x3F) << 6) +
727 | (B4 & 0x3F)
728 |
729 |
Return result.
730 |
731 |
732 | Note: If the input is also well-formed in UTF-8,
733 | this is identical to decoding UTF-8
734 | and the resulting sequence is Unicode text.
735 |
736 |
737 |
738 | Converting between WTF-8 and potentially ill-formed UTF-16
747 |
748 | Note: This conversion never fails and is lossless.
749 |
750 | To convert from WTF-8 to potentially ill-formed UTF-16,
751 | run these steps:
752 |
753 |
757 |
758 | Note: This conversion never fails
759 | and, if the input is well-formed in WTF-8, is lossless.
760 |
761 |
762 |
763 | Converting between WTF-8 and UTF-8
764 |
765 | Since WTF-8 is a superset of UTF-8,
766 | any sequence of byte that is well-formed in UTF-8
767 | is also well-formed in WTF-8 and represents the same text.
768 | To convert from UTF-8 to WTF-8,
769 | return the input unchanged.
770 |
771 | Note: This conversion never fails and is lossless.
772 |
773 | To convert lossily from WTF-8 to UTF-8,
774 | replace any surrogate byte sequence
775 | with the sequence of three bytes <0xEF, 0xBF, 0xBD>,
776 | the UTF-8 encoding of the replacement character.
777 |
778 | Note: Since surrogate byte sequences are also three bytes long,
779 | this conversion can be done in place.
780 |
781 | Note: This conversion never fails but is lossy.
782 |
783 | To convert strictly from WTF-8 to UTF-8, run these steps:
784 |
785 |
786 |
789 |
790 | Note: This conversion is lossless when it succeeds, but it can fail.
791 |
792 |
793 |
794 | Concatenating WTF-8 strings
795 |
796 | Concatenating WTF-8 strings requires extra care to preserve well-formedness.
797 |
798 | To concatenate two WTF-8 strings, run these steps:
799 |
800 |
801 |
812 | Let supplementary be the
813 | encoding to WTF-8
814 | of a single code point of value
815 | 0x10000 + ((lead - 0xD800) << 10) + (trail - 0xDC00)
816 |
817 | Let left be substring of the left input string
818 | that removes the three final bytes.
819 |
820 | Let right be substring of the right input string
821 | that removes the three initial bytes.
822 |
823 | Return the concatenation of
824 | left, supplementary, and right.
825 |
826 |
Otherwise, return the concatenation of the two input byte sequences
827 |
828 |
829 | Note: This is equivalent to
830 | converting both strings
831 |
832 | to potentially ill-formed UTF-16,
833 | concatenating the resulting 16-bit code unit sequences,
834 | then converting the concatenation back
835 | to WTF-8.
836 |
837 |
838 |
839 | Implementations
840 |
841 | This section is non-normative.
842 |
843 |
858 | wtf-8.js
859 | implements this specification in JavaScript.
860 |
861 | rust-wtf8
862 | implements this specification in Rust.
863 |
864 | On Windows (which uses potentially ill-formed UTF-16 in its APIs),
865 | the Rust standard library
866 | uses WTF-8
867 | internally for OS strings,
868 | but does not expose the WTF-8 byte sequences.
869 |
870 |
871 |
872 |
873 | Acknowledgments
874 |
875 | Thanks to Coralie Mercier for
876 |
877 | coining the name WTF-8.
878 |
879 | Thanks for feedback and contributions from
880 | Anne van Kesteren,
881 | David Baron,
882 | Dylan Petonke,
883 | Guillaume Knispel,
884 | Henri Sivonen,
885 | Jacob Lifshay,
886 | James Graham,
887 | Lily Ballard,
888 | Mathias Bynens,
889 | Ms2ger,
890 | Sam Tobin-Hochstadt,
891 | Tab Atkins.
892 |
--------------------------------------------------------------------------------
/logo-wtf-8.svg:
--------------------------------------------------------------------------------
1 |
20 |
--------------------------------------------------------------------------------
/zero.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SimonSapin/wtf-8/cd80a83d87385ce5efb08554e86dce114fde5026/zero.png
--------------------------------------------------------------------------------