├── Makefile ├── README.md ├── copyright.include ├── header.include ├── index.html ├── index.src.html ├── logo-wtf-8.svg └── zero.png /Makefile: -------------------------------------------------------------------------------- 1 | # https://github.com/tabatkins/bikeshed 2 | index.html: index.src.html Makefile *.include 3 | bikeshed spec $< $@ 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | The WTF-8 encoding specification 2 | -------------------------------- 3 | 4 | [Latest version](https://simonsapin.github.io/wtf-8/) 5 | -------------------------------------------------------------------------------- /copyright.include: -------------------------------------------------------------------------------- 1 |

3 | To the extent possible under law, the editors have waived all copyright 4 | and related or neighboring rights to this work. 5 | -------------------------------------------------------------------------------- /header.include: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | [TITLE] 6 | 574 | 575 | 576 |

577 |

578 |

[TITLE]

579 |

580 |

581 |

582 |

583 |

584 | 585 |

586 | 587 |

588 | 589 |

590 | -------------------------------------------------------------------------------- /index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | The WTF-8 encoding 5 | 573 | 574 | 575 | 585 | 614 | 651 | 697 | 759 | 760 |

761 |

762 |

The WTF-8 encoding

763 |

764 |

Editor: 766 |: Simon Sapin (Mozilla) 767 |
Issue tracking: 768 |: On GitHub 769 |
Change history: 770 |: On GitHub 771 |
Last updated: 772 |: 23 February 2022 773 |

774 |

775 |

776 |

To the extent possible under law, the editors have waived all copyright 777 | and related or neighboring rights to this work.

778 |

779 |

780 |

781 |

Abstract

782 |

WTF-8 (Wobbly Transformation Format − 8-bit) is a superset of UTF-8 that encodes surrogate code points if they are not in a pair. It represents, in a way compatible with UTF-8, text from systems such as JavaScript and Windows that use UTF-16 internally but don’t enforce the well-formedness invariant that surrogates must be paired.

783 |

784 |

785 |

786 |

787 |

1 Intended audience 789 |
790 | 2 Background and motivation 791 |
1. 2.1 Differences with CESU-8 793 |
794 |
795 | 3 Terminology 796 |
1. 3.1 Surrogate code points 798 |
2. 3.2 Surrogate 16-bit code units 799 |
3. 3.3 Surrogate byte sequences 800 |
801 |
802 | 4 Potentially ill-formed UTF-16 803 |
1. 4.1 Encoding 805 |
2. 4.2 Decoding 806 |
807 |
5 Generalized UTF-8 808 |
809 | 6 The WTF-8 encoding 810 |
1. 6.1 Encoding 812 |
2. 6.2 Decoding 813 |
3. 6.3 Converting between WTF-8 and potentially ill-formed UTF-16 814 |
4. 6.4 Converting between WTF-8 and UTF-8 815 |
5. 6.5 Concatenating WTF-8 strings 816 |
817 |
7 Implementations 818 |
8 Acknowledgments 819 |
Conformance 820 |
821 | Index 822 |
1. Terms defined by this specification 824 |
825 |
826 | References 827 |
1. Normative References 829 |
2. Informative References 830 |
831 |

832 |

833 |

834 |

1. Intended audience

835 |

WTF-8 is a hack intended to be used internally in self-contained systems 836 | with components that need to support potentially ill-formed UTF-16 for legacy reasons.

837 |

Any WTF-8 data must be converted 838 | to a Unicode encoding at the system’s boundary 839 | before being emitted. UTF-8 is recommended. WTF-8 must not be used to represent text 840 | in a file format or for transmission over the Internet.

841 |

In particular, 842 | the Encoding Standard [ENCODING] defines UTF-8 and other encodings for the Web. 843 | There is no and will not be any encoding label [ENCODING] or IANA charset alias [CHARSETS] for WTF-8.

844 |

2. Background and motivation

845 |

This section is non-normative.

846 |

When Unicode 1.0 was published in 1991, 847 | it defined 65536 code points from U+0000 to U+FFFF 848 | and assigned characters to around half of them. 849 | Many software implementations chose the obvious memory representation 850 | for Unicode text of 16 bits per code point / character.

851 |

At the time, “Unicode” was synonymous with that particular encoding. 852 | To disambiguate, that encoding is now called UCS-2.

853 |

As subsequent versions of Unicode assigned more characters, 854 | it became apparent that 65536 code points would not be sufficient. 855 | Unicode was extended to 1114112 code points from U+0000 to U+10FFFF, 856 | and the UTF-16 encoding was introduced. 857 | This encoding preserves compatibility with existing 16-bit based systems 858 | and represents new (supplementary) code points 859 | as a pair of “surrogates”.

860 |

UTF-16 is designed to represent any Unicode text, 861 | but it can not represent a surrogate code point pair since the corresponding surrogate 16-bit code unit pairs would instead represent a supplementary code point. 862 | Therefore, the concept of Unicode scalar value was introduced 863 | and Unicode text was restricted to not contain any surrogate code point. 864 | (This was presumably deemed simpler that only restricting pairs.)

865 |

UTF-16 was redefined to be ill-formed if it contains unpaired surrogate 16-bit code units. UTF-8 was similarly redefined to be ill-formed if it contains surrogate byte sequences.

866 |

Meanwhile, 16-bit based systems had little to no incentive 867 | to do anything about surrogates: 868 | For several years, 869 | Unicode did not assign any character to supplementary code points, 870 | and then (until emoji) only comparatively rare characters. 871 | Additionally, the Unicode Standard does not require conforming implementations 872 | to maintain well-formedness of UTF-16 strings.

873 |

As a result, surrogates do occur in practice and need to be preserved. 874 | For example:

875 |

In ECMAScript (a.k.a. JavaScript), a String value is defined as a sequence of 16-bit integers 877 | that usually represents UTF-16 text 878 | but may or may not be well-formed. 879 |
Windows applications normally use UTF-16, but the file system treats path and file names as an opaque sequence of WCHARs (16-bit 880 | code units). 881 |

882 |

We say that strings in these systems are encoded in potentially ill-formed UTF-16 or WTF-16.

883 |

Unpaired surrogate 16-bit code units are the only case 884 | where an arbitrary sequence of 16-bit code units is ill-formed in UTF-16. UTF-8, however, is more complex 885 | and maintaining its well-formedness is arguably more valuable.

886 |

This specification defines WTF-8, 887 | a superset of UTF-8 that can losslessly represent 888 | arbitrary sequences of 16-bit code unit (even if ill-formed in UTF-16) 889 | but preserves the other well-formedness constraints of UTF-8.

890 |

2.1. Differences with CESU-8

891 |

Unicode defines a Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8). WTF-8 is different from CESU-8.

892 |

CESU-8 encodes supplementary code points as surrogate pair byte sequences of six bytes, 893 | whereas WTF-8, like UTF-8, encodes them as sequences of four bytes. 894 | Therefore, CESU-8 is not a superset of UTF-8.

895 |

CESU-8 is also a mapping on UTF-16 code units. 896 | Therefore unpaired surrogate byte sequences are ill-formed in CESU-8, 897 | whereas supporting them is the entire point of WTF-8.

898 |

3. Terminology

899 |

These definitions correspond to those of the Glossary of Unicode Terms. [UNICODE]

900 |

A Unicode code point is any value in the Unicode codespace; 901 | that is, the range of integers from 0 to 1114111. 902 | It is noted with a “U+” prefix and four to six hexadecimal digits: 903 | the first and last code points are U+0000 and U+10FFFF.

904 |

The Basic Multilingual Plane is 905 | the range of code points from U+0000 to U+FFFF.

906 |

A BMP code point is a code point in the Basic Multilingual Plane.

907 |

A supplementary code point is a code point not in the Basic Multilingual Plane. 908 | That is, a code point in the range from U+10000 to U+10FFFF.

909 |

A Unicode scalar value is a code point that is not a surrogate code point. 910 | That is, a code point in the range from U+0000 to U+D7FF, 911 | or in the range U+E000 to U+10FFFF.

912 |

A BMP scalar value is a Unicode scalar value in the Basic Multilingual Plane. 913 | That is, a code point in the range from U+0000 to U+D7FF, 914 | or in the range U+E000 to U+FFFF.

915 |

Unicode text is a sequence of Unicode scalar values.

916 |

UTF-8 is an encoding of Unicode text using 8-bit bytes. 917 | Each Unicode scalar value is represented as a sequence of one to four bytes.

918 |

UTF-16 is an encoding of Unicode text using 16-bit code units. BMP scalar values are represented as a single 16-bit code unit with the same value. Supplementary code points are represented as a surrogate 16-bit code unit pair.

919 |

Note: this specification is only concerned with the UTF-16 encoding form (based on 16-bit code units), 920 | and not with the encoding scheme (based on bytes, with UTF-16BE and UTF-16LE variants).

921 |

A string is well-formed (not ill-formed) 922 | in a given encoding if it follows the specification of that encoding. [UNICODE] defines Well-Formed Code Unit Sequence for UTF-8 and UTF-16.

923 |

924 | In particular: 925 |

Unpaired surrogate 16-bit code units are ill-formed in UTF-16. 927 |
Surrogate byte sequences are ill-formed in UTF-8. 928 |
Surrogate pair byte sequences are ill-formed in WTF-8. 929 |

930 |

931 |

The replacement character is the code point U+FFFD REPLACEMENT CHARACTER (�). 932 | It is used as a substitute to replace ill-formed sub-sequences 933 | during a conversion.

934 |

A 16-bit code unit is a 16-bit integer used in UTF-16. 935 | It is noted with a “0x” prefix and four hexadecimal digits: 936 | the first and last 16-bit code units are 0x0000 and 0xFFFF.

937 |

Note: The byte serialization or memory representation of a 16-bit code unit (little-endian or big-endian) 938 | is out of scope for this specification.

939 |

When an algorithm iterates over a sequence (“For every i in …”), consuming the next item means advancing in the sequence 940 | such that that item will be skipped during the following iteration of the loop: 941 | the item after the next becomes the next item.

942 |

943 | 944 |

The following algorithm prints “1”, “2”, and “4”.

945 |

For every digit i in “1234”, run these substeps:

946 |

Print i 948 |
If i is 2, consume the next digit. 949 |

950 |

951 |

3.1. Surrogate code points

952 |

A lead surrogate code point or high surrogate code point is a code point in the range from U+D800 to U+DBFF.

953 |

A trail surrogate code point or low surrogate code point is a code point in the range from U+DC00 to U+DFFF.

954 |

A surrogate code point is 955 | either a lead surrogate code point or a trail surrogate code point. 956 | That is, a code point in the range from U+D800 to U+DFFF.

957 |

A surrogate code point pair is a sequence of 958 | a lead surrogate code point followed by a trail surrogate code point.

959 |

An unpaired surrogate code point is a surrogate code point that is not part of a surrogate code point pair.

960 |

3.2. Surrogate 16-bit code units

961 |

A lead surrogate 16-bit code unit or high surrogate 16-bit code unit is a 16-bit code unit in the range from 0xD800 to 0xDBFF.

962 |

A trail surrogate 16-bit code unit or low surrogate 16-bit code unit is a 16-bit code unit in the range from 0xDC00 to 0xDFFF.

963 |

A surrogate 16-bit code unit is 964 | either a lead surrogate 16-bit code unit or a trail surrogate 16-bit code unit. 965 | That is, a 16-bit code unit in the range from 0xD800 to 0xDFFF.

966 |

A surrogate 16-bit code unit pair is a sequence of 967 | a lead surrogate 16-bit code unit followed by a trail surrogate 16-bit code unit. 968 | In UTF-16, it represents a supplementary code point.

969 |

An unpaired surrogate 16-bit code unit is a surrogate 16-bit code unit that is not part of a surrogate 16-bit code unit pair.

970 |

3.3. Surrogate byte sequences

971 |

Note: A surrogate byte sequence (and therefore any byte sequence described in this section) 972 | is ill-formed in UTF-8. 973 | Decoders are required to treat it as an error.

974 |

A lead surrogate byte sequence or high surrogate byte sequence is a sequence of three bytes 975 | that represents a lead surrogate code point in generalized UTF-8.

976 |

A trail surrogate byte sequence or low surrogate byte sequence is a sequence of three bytes 977 | that represents a trail surrogate code point in generalized UTF-8.

978 |

A surrogate byte sequence is 979 | either a lead surrogate byte sequence or a trail surrogate byte sequence. 980 | That is, a sequence of three bytes 981 | that represents a surrogate code point in generalized UTF-8.

982 | 983 | 987 | 988 | 989 | 994 | 999 | 1004 |

984 |
Table 1. Surrogate byte sequences
985 |
Bytes noted in hexadecimal.
986 |
990 \|	First byte 991 \|	Second byte 992 \|	Third byte 993 \|
Lead surrogate byte sequence 995 \|	ED 996 \|	A0 to AF 997 \|	80 to BF 998 \|
Trail surrogate byte sequence 1000 \|	ED 1001 \|	B0 to BF 1002 \|	80 to BF 1003 \|
Surrogate byte sequence 1005 \|	ED 1006 \|	A0 to BF 1007 \|	80 to BF 1008 \|

1009 |

A surrogate pair byte sequence is a sequence six bytes composed of 1010 | a lead surrogate byte sequence followed by a trail surrogate byte sequence.

1011 |

An unpaired surrogate byte sequence is a surrogate byte sequence that is not part of a surrogate pair byte sequence.

1012 |

4. Potentially ill-formed UTF-16

1013 |

A sequence of 16-bit code units is potentially ill-formed UTF-16 if it is intended to be interpreted as UTF-16, 1014 | but is not necessarily well-formed in UTF-16. 1015 | It effectively encodes a sequence of code points that do not contain any surrogate code point pair.

1016 |

Note: Like UTF-16, potentially ill-formed UTF-16 can not represent a surrogate code point pair since the corresponding surrogate 16-bit code unit pair would instead 1017 | represent a supplementary code point. 1018 | Unlike well-formed UTF-16, it might contain isolated surrogate code points.

1019 |

Any sequence of 16-bit code units has an interpretation as potentially ill-formed UTF-16.

1020 |

WTF-16 is sometimes used as a shorter name for potentially ill-formed UTF-16, 1021 | especially in the context of systems were originally designed for UCS-2 and later upgraded to UTF-16 but never enforced well-formedness, 1022 | either by neglect or because of backward-compatibility constraints.

1023 |

4.1. Encoding

1024 |

To encode from code points to potentially ill-formed UTF-16, 1025 | run these steps:

1026 |

Let result be a sequence of 16-bit code units, initially empty. 1028 |
1029 | For every code point P of the input, run these substeps: 1030 |
1. 1032 | If P is a supplementary code point, 1033 | append to result two 16-bit code units of values: 1034 |
  1. ((P - 0x10000) >> 10) + 0xD800 1036 |
  2. ((P - 0x10000) & 0x3FF) + 0xDC00 1037 |
  1038 |
2. Otherwise (P is a BMP code point), 1039 | append to result a 16-bit code unit of value P. 1040 |
1041 |
Return result. 1042 |

1043 |

Note: If the input is restricted to Unicode text, 1044 | this is identical to encoding to UTF-16 and the resulting sequence is well-formed in UTF-16.

1045 |

1046 |

If, on the other hand, the input contains a surrogate code point pair, 1047 | the conversion will be incorrect and 1048 | the resulting sequence will not represent the original code points.

1049 |

This situation should be considered an error, 1050 | but this specification does not define how to handle it. 1051 | Possibilities include aborting the conversion, 1052 | or replacing one of the surrogate code points of the pair 1053 | with a replacement character.

1054 |

1055 |

4.2. Decoding

1056 |

To decode from potentially ill-formed UTF-16 to code points, 1057 | run these steps:

1058 |

Let result be a sequence of code points, initially empty. 1060 |
1061 | For every 16-bit code unit U of the input, run these substeps: 1062 |
1. If U is a lead surrogate 16-bit code unit, U is not the last 16-bit code unit of the input, 1064 | and the next 16-bit code unit of the input next is a trail surrogate 16-bit code unit, 1065 | then consume next and append to result a code point of value 0x10000 + 1066 | ((U - 0xD800) << 10) + 1067 | (next - 0xDC00). 1068 |
2. Otherwise, 1069 | append to result a code point of value U. 1070 |
1071 |
Return result. 1072 |

1073 |

Note: By construction, 1074 | the resulting sequence does not contain a surrogate code point pair.

1075 |

Note: If the input is well-formed in UTF-16, 1076 | this is identical to decoding UTF-16 and the resulting sequence is Unicode text.

1077 |

5. Generalized UTF-8

1078 |

For the purpose of this specification, generalized UTF-8 is an encoding of sequences of code points (not restricted to Unicode scalar values) 1079 | using 8-bit bytes, 1080 | based on the same underlying algorithm as UTF-8. 1081 | It is a strict superset of UTF-8 (like UTF-8 is a strict superset of ASCII).

1082 |

Each code point is encoded as a sequence of one to four bytes:

1083 | 1084 | 1088 | 1089 | 1090 | 1096 | 1102 | 1108 | 1114 |

1085 |
Table 2. Bit distribution
1086 |
Bytes noted in binary, most significant bit first. `x` bits represent the least significant bits of the code points.
1087 |
Code point 1091 \|	First byte 1092 \|	Second byte 1093 \|	Third byte 1094 \|	Fourth byte 1095 \|
U+0000 to U+007F 1097 \|	0xxxxxxx 1098 \|	1099 \|	1100 \|	1101 \|
U+0080 to U+07FF 1103 \|	110xxxxx 1104 \|	10xxxxxx 1105 \|	1106 \|	1107 \|
U+0800 to U+FFFF 1109 \|	1110xxxx 1110 \|	10xxxxxx 1111 \|	10xxxxxx 1112 \|	1113 \|
U+10000 to U+10FFFF 1115 \|	11110xxx 1116 \|	10xxxxxx 1117 \|	10xxxxxx 1118 \|	10xxxxxx 1119 \|

1120 |

A byte sequence is well-formed in generalized UTF-8 if and only if:

1121 |

It is the empty string, or 1123 |
It is composed of a byte sequence that is well-formed in generalized UTF-8 followed by a byte sequence in the following table. 1124 |

1125 | 1126 | 1130 | 1131 | 1132 | 1138 | 1144 | 1150 | 1156 | 1162 | 1168 | 1174 |

1127 |
Table 3. Well-formed byte sequences representing a single code point
1128 |
Bytes noted in hexadecimal.
1129 |
Code point 1133 \|	First byte 1134 \|	Second byte 1135 \|	Third byte 1136 \|	Fourth byte 1137 \|
U+0000 to U+007F 1139 \|	00 to 7F 1140 \|	1141 \|	1142 \|	1143 \|
U+0080 to U+07FF 1145 \|	C2 to DF 1146 \|	80 to BF 1147 \|	1148 \|	1149 \|
U+0800 to U+0FFF 1151 \|	E0 1152 \|	A0 to BF 1153 \|	80 to BF 1154 \|	1155 \|
U+1000 to U+FFFF 1157 \|	E1 to EF 1158 \|	80 to BF 1159 \|	80 to BF 1160 \|	1161 \|
U+10000 to U+3FFFF 1163 \|	F0 1164 \|	90 to BF 1165 \|	80 to BF 1166 \|	80 to BF 1167 \|
U+40000 to U+FFFFF 1169 \|	F1 to F3 1170 \|	80 to BF 1171 \|	80 to BF 1172 \|	80 to BF 1173 \|
U+100000 to U+10FFFF 1175 \|	F4 1176 \|	80 to 8F 1177 \|	80 to BF 1178 \|	80 to BF 1179 \|

1180 |

6. The WTF-8 encoding

1181 |

WTF-8 (Wobbly Transformation Format − 8-bit) 1182 | is an encoding of code point sequences 1183 | that do not contain any surrogate code point pair using 8-bit bytes.

1184 |

Note: Like UTF-8 is artificially restricted to Unicode text in order to match UTF-16, WTF-8 is artificially restricted to exclude surrogate code point pairs in order to match potentially ill-formed UTF-16.

1185 |

It is identical to generalized UTF-8, 1186 | with the additional well-formedness constraint that 1187 | a surrogate pair byte sequence is ill-formed. 1188 | It is a strict subset of generalized UTF-8 and a strict superset of UTF-8.

1189 |

Note: Similarly, UTF-8 is a strict superset of ASCII.

1190 |

WTF-8 must not be used for interchange. 1191 | See Intended audience.

1192 |

6.1. Encoding

1193 |

To encode from code points to well-formed WTF-8, 1194 | run these steps:

1195 |

Let result be a sequence of bytes, initially empty. 1197 |
1198 | For every code point P of the input, run these substeps: 1199 |
1. 1201 |
  If P is a lead surrogate code point, P is not the last code point of the input, 1202 | and the next code point is a trail surrogate code point, consume the next code point and 1203 | set P’s value to: 0x10000 + 1204 | ((P - 0xD800) << 10) + 1205 | (next - 0xDC00).
  1206 |
2. 1207 | Depending on P: 1208 |
  1209 |
  U+0000 to U+007F 1210 |
  Append to result one byte of value P. 1211 |
  U+0080 to U+07FF 1212 |
  1213 | Append to result two bytes of values: 1214 |
  1. 0xC0 | (P >> 6) 1216 |
  2. 0x80 | (P & 0x3F) 1217 |
  1218 |
  U+0800 to U+FFFF 1219 |
  1220 | Append to result three bytes of values 1221 |
  1. 0xE0 | (P >> 12) 1223 |
  2. 0x80 | ((P >> 6) & 0x3F) 1224 |
  3. 0x80 | (P & 0x3F) 1225 |
  1226 |
  U+10000 to U+10FFFF 1227 |
  1228 | Append to result four bytes of values 1229 |
  1. 0xF0 | (P >> 18) 1231 |
  2. 0x80 | ((P >> 12) & 0x3F) 1232 |
  3. 0x80 | ((P >> 6) & 0x3F) 1233 |
  4. 0x80 | (P & 0x3F) 1234 |
  1235 |
  1236 |
1237 |
Return result. 1238 |

1239 |

Note: If the input contains a surrogate code point pair, 1240 | the resulting byte sequence will be not represent 1241 | the original sequence of code points. 1242 | Instead, it will represent the same code points as if had been encoded in potentially ill-formed UTF-16. 1243 | This is also consistent with 1244 | encoding each code point to WTF-8 individually, 1245 | and concatenating the resulting WTF-8 byte sequences.

1246 |

6.2. Decoding

1247 |

To decode from well-formed WTF-8 to code points, 1248 | run these steps:

1249 |

Note: Since WTF-8 must not be used for interchange 1250 | (see Intended audience), 1251 | this algorithm is deliberately not defined for arbitrary byte sequences. 1252 | It is only defined for byte sequences known to be well-formed in WTF-8, 1253 | such as sequences encoded from code points, converted from UTF-16, or concatenated from sequences themselves well-formed in WTF-8.

1254 |

Let result be a sequence of code points, initially empty. 1256 |
1257 | For every byte B of the input, depending on B: 1258 |
1259 |
0x00 to 0x7F 1260 |
Append to result a code point of value B. 1261 |
0xC2 to 0xDF 1262 |
1263 | Let B2 be the next byte and consume it. 1264 |
Append to result a code point of value ((B & 0x1F) << 6) + 1265 | (B2 & 0x3F)
1266 |
0xE0 to 0xEF 1267 |
1268 | Let B2 and B3 be the next two bytes, 1269 | and consume them. 1270 |
Append to result a code point of value ((B & 0x0F) << 12) + 1271 | ((B2 & 0x3F) << 6) + 1272 | (B3 & 0x3F)
1273 |
0xF0 to 0xF4 1274 |
1275 | Let B2, B3, and B4 be the next three bytes, 1276 | and consume them. 1277 |
Append to result a code point of value ((B & 0x07) << 18) + 1278 | ((B2 & 0x3F) << 12) + 1279 | ((B3 & 0x3F) << 6) + 1280 | (B4 & 0x3F)
1281 |
1282 |
Return result. 1283 |

1284 |

Note: If the input is also well-formed in UTF-8, 1285 | this is identical to decoding UTF-8 and the resulting sequence is Unicode text.

1286 |

6.3. Converting between WTF-8 and potentially ill-formed UTF-16

1287 |

To convert from potentially ill-formed UTF-16 to WTF-8, 1288 | run these steps:

1289 |

Decode from potentially ill-formed UTF-16 to code points 1291 |
Encode from code points to well-formed WTF-8 1292 |

1293 |

Note: This conversion never fails and is lossless.

1294 |

To convert from WTF-8 to potentially ill-formed UTF-16, 1295 | run these steps:

1296 |

Decode from well-formed WTF-8 to code points 1298 |
Encode from code points to potentially ill-formed UTF-16 1299 |

1300 |

Note: This conversion never fails 1301 | and, if the input is well-formed in WTF-8, is lossless.

1302 |

6.4. Converting between WTF-8 and UTF-8

1303 |

Since WTF-8 is a superset of UTF-8, 1304 | any sequence of byte that is well-formed in UTF-8 is also well-formed in WTF-8 and represents the same text. 1305 | To convert from UTF-8 to WTF-8 , 1306 | return the input unchanged.

1307 |

Note: This conversion never fails and is lossless.

1308 |

To convert lossily from WTF-8 to UTF-8 , 1309 | replace any surrogate byte sequence with the sequence of three bytes <0xEF, 0xBF, 0xBD>, 1310 | the UTF-8 encoding of the replacement character.

1311 |

Note: Since surrogate byte sequences are also three bytes long, 1312 | this conversion can be done in place.

1313 |

Note: This conversion never fails but is lossy.

1314 |

To convert strictly from WTF-8 to UTF-8 , run these steps:

1315 |

If the input contains a surrogate byte sequence, return failure. 1317 |
Otherwise, return the input unchanged. 1318 |

1319 |

Note: This conversion is lossless when it succeeds, but it can fail.

1320 |

6.5. Concatenating WTF-8 strings

1321 |

Concatenating WTF-8 strings requires extra care to preserve well-formedness.

1322 |

To concatenate two WTF-8 strings, run these steps:

1323 |

1325 | If the left input string ends with a lead surrogate byte sequence and the right input string starts with a trail surrogate byte sequence, 1326 | run these substeps: 1327 |
1. Let lead and trail be two code points, 1329 | the respective results of decoding from WTF-8 these two surrogate byte sequences. 1330 |
2. Let supplementary be the encoding to WTF-8 of a single code point of value 0x10000 + ((lead - 0xD800) << 10) + (trail - 0xDC00) 1331 |
3. Let left be substring of the left input string 1332 | that removes the three final bytes. 1333 |
4. Let right be substring of the right input string 1334 | that removes the three initial bytes. 1335 |
5. Return the concatenation of left, supplementary, and right. 1336 |
1337 |
Otherwise, return the concatenation of the two input byte sequences 1338 |

1339 |

Note: This is equivalent to 1340 | converting both strings to potentially ill-formed UTF-16, 1341 | concatenating the resulting 16-bit code unit sequences, 1342 | then converting the concatenation back to WTF-8.

1343 |

7. Implementations

1344 |

This section is non-normative.

1345 |

Scheme 48 uses an encoding they call UTF-8of16 to encode filenames 1347 | and other operating system strings on Windows. 1348 | This encoding is identical to WTF-8. 1349 |
Racket uses an encoding they call platform-UTF-8 to encode filenames on Windows. 1350 | This encoding is identical to WTF-8. 1351 |
wtf-8.js implements this specification in JavaScript. 1352 |
rust-wtf8 implements this specification in Rust. 1353 |
On Windows (which uses potentially ill-formed UTF-16 in its APIs), 1354 | the Rust standard library uses WTF-8 internally for OS strings, 1355 | but does not expose the WTF-8 byte sequences. 1356 |

1357 |

8. Acknowledgments

1358 |

Thanks to Coralie Mercier for coining the name WTF-8.

1359 |

1372 |

1373 |

Conformance

1374 |

Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. 1375 | The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” 1376 | in the normative parts of this document 1377 | are to be interpreted as described in RFC 2119. 1378 | However, for readability, 1379 | these words do not appear in all uppercase letters in this specification.

1380 |

All of the text of this specification is normative 1381 | except sections explicitly marked as non-normative, examples, and notes. [RFC2119]

1382 |

Examples in this specification are introduced with the words “for example” 1383 | or are set apart from the normative text with class="example", like this:

1384 |

This is an example of an informative example.

1385 |

Informative notes begin with the word “Note” 1386 | and are set apart from the normative text with class="note", like this:

1387 |

Note, this is an informative note.

1388 |

1389 | 1518 |

Index

1519 |

Terms defined by this specification

1520 |

16-bit code unit, in §3 1522 |
Basic Multilingual Plane, in §3 1523 |
BMP code point, in §3 1524 |
BMP scalar value, in §3 1525 |
code point, in §3 1526 |
concatenate two WTF-8 strings, in §6.5 1527 |
consume, in §3 1528 |
convert from potentially ill-formed UTF-16 to WTF-8, in §6.3 1529 |
convert from UTF-8 to WTF-8, in §6.4 1530 |
convert from WTF-8 to potentially ill-formed UTF-16, in §6.3 1531 |
convert lossily from WTF-8 to UTF-8, in §6.4 1532 |
convert strictly from WTF-8 to UTF-8, in §6.4 1533 |
decode from potentially ill-formed UTF-16 to code points, in §4.2 1534 |
decode from well-formed WTF-8 to code points, in §6.2 1535 |
encode from code points to potentially ill-formed UTF-16, in §4.1 1536 |
encode from code points to well-formed WTF-8, in §6.1 1537 |
generalized UTF-8, in §5 1538 |
high surrogate 16-bit code unit, in §3.2 1539 |
high surrogate byte sequence, in §3.3 1540 |
high surrogate code point, in §3.1 1541 |
ill-formed, in §3 1542 |
ill-formedness, in §3 1543 |
lead surrogate 16-bit code unit, in §3.2 1544 |
lead surrogate byte sequence, in §3.3 1545 |
lead surrogate code point, in §3.1 1546 |
low surrogate 16-bit code unit, in §3.2 1547 |
low surrogate byte sequence, in §3.3 1548 |
low surrogate code point, in §3.1 1549 |
potentially ill-formed UTF-16, in §4 1550 |
replacement character, in §3 1551 |
supplementary code point, in §3 1552 |
surrogate 16-bit code unit, in §3.2 1553 |
surrogate 16-bit code unit pair, in §3.2 1554 |
surrogate byte sequence, in §3.3 1555 |
surrogate code point, in §3.1 1556 |
surrogate code point pair, in §3.1 1557 |
surrogate pair byte sequence, in §3.3 1558 |
trail surrogate 16-bit code unit, in §3.2 1559 |
trail surrogate byte sequence, in §3.3 1560 |
trail surrogate code point, in §3.1 1561 |
UCS-2, in §2 1562 |
Unicode scalar value, in §3 1563 |
Unicode text, in §3 1564 |
unpaired surrogate 16-bit code unit, in §3.2 1565 |
unpaired surrogate byte sequence, in §3.3 1566 |
unpaired surrogate code point, in §3.1 1567 |
UTF-16, in §3 1568 |
UTF-8, in §3 1569 |
well-formed, in §3 1570 |
well-formedness, in §3 1571 |
WTF-16, in §4 1572 |
WTF-8, in §6 1573 |

1574 |

References

1575 |

Normative References

1576 |

[RFC2119] 1578 |: S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://tools.ietf.org/html/rfc2119 1579 |
[UNICODE] 1580 |: The Unicode Standard. URL: http://www.unicode.org/versions/latest/ 1581 |

1582 |

Informative References

1583 |

[CHARSETS] 1585 |: Character sets. URL: https://www.iana.org/assignments/character-sets 1586 |
[ENCODING] 1587 |: Anne van Kesteren. Encoding Standard. Living Standard. URL: https://encoding.spec.whatwg.org/ 1588 |

1589 | 1596 | 1623 | 1630 | 1637 | 1654 | 1665 | 1672 | 1691 | 1714 | 1735 | 1762 | 1777 | 1786 | 1805 | 1818 | 1829 | 1840 | 1857 | 1876 | 1885 | 1894 | 1901 | 1914 | 1923 | 1932 | 1941 | 1956 | 1969 | 1976 | 1999 | 2006 | 2013 | 2020 | 2031 | 2058 | 2069 | 2078 | 2087 | 2094 | 2103 | -------------------------------------------------------------------------------- /index.src.html: -------------------------------------------------------------------------------- 1 |

The WTF-8 encoding

2 |

  3 | Shortname: wtf-8
  4 | Status: LS
  5 | Boilerplate: omit feedback-header
  6 | !Issue tracking: On GitHub
  7 | !Change history: On GitHub
  8 | !Last updated: [DATE]
  9 | Editor: Simon Sapin, Mozilla https://www.mozilla.org/, https://exyr.org/
 10 | Abstract: WTF-8 (Wobbly Transformation Format − 8-bit) is a superset of UTF-8 that encodes surrogate code points if they are not in a pair. It represents, in a way compatible with UTF-8, text from systems such as JavaScript and Windows that use UTF-16 internally but don’t enforce the well-formedness invariant that surrogates must be paired.
 11 |

12 | 13 | 14 |

15 | 16 | 17 |

18 | Intended audience

43 | Background and motivation

44 | 45 | This section is non-normative. 46 | 47 | When Unicode 1.0 48 | was published in 1991, 49 | it defined 65536 code points from U+0000 to U+FFFF 50 | and assigned characters to around half of them. 51 | Many software implementations chose the obvious memory representation 52 | for Unicode text of 16 bits per code point / character. 53 | 54 | At the time, “Unicode” was synonymous with that particular encoding. 55 | To disambiguate, that encoding is now called UCS-2. 56 | 57 | As subsequent versions of Unicode assigned more characters, 58 | it became apparent that 65536 code points would not be sufficient. 59 | Unicode was extended to 1114112 code points from U+0000 to U+10FFFF, 60 | and the UTF-16 encoding was introduced. 61 | This encoding preserves compatibility with existing 16-bit based systems 62 | and represents new (supplementary) code points 63 | as a pair of “surrogates”. 64 | 65 | UTF-16 is designed to represent any Unicode text, 66 | but it can not represent a surrogate code point pair 67 | since the corresponding surrogate 16-bit code unit pairs 68 | would instead represent a supplementary code point. 69 | Therefore, the concept of Unicode scalar value was introduced 70 | and Unicode text was restricted to not contain any surrogate code point. 71 | (This was presumably deemed simpler that only restricting pairs.) 72 | 73 | UTF-16 was redefined to be ill-formed 74 | if it contains unpaired surrogate 16-bit code units. 75 | UTF-8 was similarly redefined to be ill-formed 76 | if it contains surrogate byte sequences. 77 | 78 | Meanwhile, 16-bit based systems had little to no incentive 79 | to do anything about surrogates: 80 | For several years, 81 | Unicode did not assign any character to supplementary code points, 82 | and then (until emoji) only comparatively rare characters. 83 | Additionally, the Unicode Standard does not require conforming implementations 84 | to maintain well-formedness of UTF-16 strings. 85 | 86 | As a result, surrogates do occur in practice and need to be preserved. 87 | For example: 88 | 89 |

91 | In ECMAScript (a.k.a. JavaScript), 92 | 93 | a String value 94 | is defined as a sequence of 16-bit integers 95 | that usually represents UTF-16 text 96 | but may or may not be well-formed. 97 |
98 | 99 | Windows applications normally use UTF-16, but 100 | 101 | the file system treats path and file names as an opaque sequence of WCHARs 102 | (16-bit 103 | code units). 104 |

121 | Differences with CESU-8

122 | 123 | Unicode defines a 124 | Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8). 125 | WTF-8 is different from CESU-8. 126 | 127 | CESU-8 encodes supplementary code points as surrogate pair byte sequences 128 | of six bytes, 129 | whereas WTF-8, like UTF-8, encodes them as sequences of four bytes. 130 | Therefore, CESU-8 is not a superset of UTF-8. 131 | 132 | CESU-8 is also a mapping on UTF-16 code units. 133 | Therefore unpaired surrogate byte sequences are ill-formed in CESU-8, 134 | whereas supporting them is the entire point of WTF-8. 135 | 136 | 143 | 144 | 145 |

146 | Terminology

200 | In particular: 201 | 202 |

Unpaired surrogate 16-bit code units are ill-formed in UTF-16. 204 |
Surrogate byte sequences are ill-formed in UTF-8. 205 |
Surrogate pair byte sequences are ill-formed in WTF-8. 206 |

207 |

228 | 229 | The following algorithm prints “1”, “2”, and “4”. 230 | 231 | For every digit i in “1234”, run these substeps: 232 | 233 |

Print i 235 |
If i is 2, consume the next digit. 236 |

237 |

238 | 239 | 240 |

241 | Surrogate code points

242 | 243 | A lead surrogate code point or high surrogate code point 244 | is a code point in the range from U+D800 to U+DBFF. 245 | 246 | A trail surrogate code point or low surrogate code point 247 | is a code point in the range from U+DC00 to U+DFFF. 248 | 249 | A surrogate code point is 250 | either a lead surrogate code point or a trail surrogate code point. 251 | That is, a code point in the range from U+D800 to U+DFFF. 252 | 253 | A surrogate code point pair is a sequence of 254 | a lead surrogate code point followed by a trail surrogate code point. 255 | 256 | An unpaired surrogate code point is a surrogate code point 257 | that is not part of a surrogate code point pair. 258 | 259 | 260 |

261 | Surrogate 16-bit code units

282 | Surrogate byte sequences

283 | 284 | Note: A surrogate byte sequence 285 | (and therefore any byte sequence described in this section) 286 | is ill-formed in UTF-8. 287 | Decoders are required to treat it as an error. 288 | 289 | A lead surrogate byte sequence or high surrogate byte sequence 290 | is a sequence of three bytes 291 | that represents a lead surrogate code point in generalized UTF-8. 292 | 293 | 294 | A trail surrogate byte sequence or low surrogate byte sequence 295 | is a sequence of three bytes 296 | that represents a trail surrogate code point in generalized UTF-8. 297 | 298 | A surrogate byte sequence is 299 | either a lead surrogate byte sequence or a trail surrogate byte sequence. 300 | That is, a sequence of three bytes 301 | that represents a surrogate code point in generalized UTF-8. 302 | 303 | 304 | 311 | 316 | 317 | 322 | 323 | 328 | 329 | 334 |

305 | 306 | Table 1. Surrogate byte sequences 307 | 308 | Bytes noted in hexadecimal. 309 | 310 |
312 \|	First byte 313 \|	Second byte 314 \|	Third byte 315 \|
Lead surrogate byte sequence 318 \|	ED 319 \|	A0 to AF 320 \|	80 to BF 321 \|
Trail surrogate byte sequence 324 \|	ED 325 \|	B0 to BF 326 \|	80 to BF 327 \|
Surrogate byte sequence 330 \|	ED 331 \|	A0 to BF 332 \|	80 to BF 333 \|

335 | 336 | A surrogate pair byte sequence is a sequence six bytes composed of 337 | a lead surrogate byte sequence followed by a trail surrogate byte sequence. 338 | 339 | An unpaired surrogate byte sequence is a surrogate byte sequence 340 | that is not part of a surrogate pair byte sequence. 341 | 342 | 343 |

344 | Potentially ill-formed UTF-16

345 | 346 | A sequence of 16-bit code units is potentially ill-formed UTF-16 347 | if it is intended to be interpreted as UTF-16, 348 | but is not necessarily well-formed in UTF-16. 349 | It effectively encodes a sequence of code points 350 | that do not contain any surrogate code point pair. 351 | 352 | Note: Like UTF-16, 353 | potentially ill-formed UTF-16 can not represent a surrogate code point pair 354 | since the corresponding surrogate 16-bit code unit pair would instead 355 | represent a supplementary code point. 356 | Unlike well-formed UTF-16, it might contain isolated surrogate code points. 357 | 358 | Any sequence of 16-bit code units 359 | has an interpretation as potentially ill-formed UTF-16. 360 | 361 | WTF-16 is sometimes used as a shorter name for potentially ill-formed UTF-16, 362 | especially in the context of systems that were originally designed for UCS-2 363 | and later upgraded to UTF-16 but never enforced well-formedness, 364 | either by neglect or because of backward-compatibility constraints. 365 | 366 | 367 |

368 | Encoding

369 | 370 | To encode from code points to 371 | potentially ill-formed UTF-16, 372 | run these steps: 373 | 374 |

Let result be a sequence of 16-bit code units, initially empty. 376 |
For every code point P of the input, run these substeps: 377 |
1. 379 | If P is a supplementary code point, 380 | append to result two 16-bit code units of values: 381 |
  1. ((P - 0x10000) >> 10) + 0xD800 383 |
  2. ((P - 0x10000) & 0x3FF) + 0xDC00 384 |
  385 |
2. Otherwise (P is a BMP code point), 386 | append to result a 16-bit code unit of value P. 387 |
388 |
Return result. 389 |

390 | 391 | Note: If the input is restricted to Unicode text, 392 | this is identical to encoding to UTF-16 393 | and the resulting sequence is well-formed in UTF-16. 394 | 395 |

407 | 408 | 409 |

410 | Decoding

411 | 412 | To decode from potentially ill-formed UTF-16 413 | to code points, 414 | run these steps: 415 | 416 |

Let result be a sequence of code points, initially empty. 418 |
For every 16-bit code unit U of the input, run these substeps: 419 |
1. 421 | If U is a lead surrogate 16-bit code unit, 422 | U is not the last 16-bit code unit of the input, 423 | and the next 16-bit code unit of the input next 424 | is a trail surrogate 16-bit code unit, 425 | then 426 | consume next 427 | and append to result a code point of value 428 | 429 | 0x10000 + 430 | ((U - 0xD800) << 10) + 431 | (next - 0xDC00). 432 |
2. 433 | Otherwise, 434 | append to result a code point of value U. 435 |
436 |
Return result. 437 |

438 | 439 | Note: By construction, 440 | the resulting sequence does not contain a surrogate code point pair. 441 | 442 | Note: If the input is well-formed in UTF-16, 443 | this is identical to decoding UTF-16 444 | and the resulting sequence is Unicode text. 445 | 446 | 447 |

448 | Generalized UTF-8

449 | 450 | For the purpose of this specification, 451 | generalized UTF-8 is an encoding of sequences of code points 452 | (not restricted to Unicode scalar values) 453 | using 8-bit bytes, 454 | based on the same underlying algorithm as UTF-8. 455 | It is a strict superset of UTF-8 456 | (like UTF-8 is a strict superset of ASCII). 457 | 458 | Each code point is encoded as a sequence of one to four bytes: 459 | 460 | 461 | 469 | 475 | 476 | 482 | 483 | 489 | 490 | 496 | 497 | 503 |

462 | 463 | Table 2. Bit distribution 464 | 465 | Bytes noted in binary, most significant bit first. 466 | `x` bits represent the least significant bits of the code points. 467 | 468 |
Code point 470 \|	First byte 471 \|	Second byte 472 \|	Third byte 473 \|	Fourth byte 474 \|
U+0000 to U+007F 477 \|	0xxxxxxx 478 \|	479 \|	480 \|	481 \|
U+0080 to U+07FF 484 \|	110xxxxx 485 \|	10xxxxxx 486 \|	487 \|	488 \|
U+0800 to U+FFFF 491 \|	1110xxxx 492 \|	10xxxxxx 493 \|	10xxxxxx 494 \|	495 \|
U+10000 to U+10FFFF 498 \|	11110xxx 499 \|	10xxxxxx 500 \|	10xxxxxx 501 \|	10xxxxxx 502 \|

504 | 505 | A byte sequence is well-formed in generalized UTF-8 506 | if and only if: 507 | 508 |

It is the empty string, or 510 |
511 | It is composed of a byte sequence that is well-formed in generalized UTF-8 512 | followed by a byte sequence in the following table. 513 |

514 | 515 | 516 | 523 | 529 | 530 | 536 | 537 | 543 | 544 | 550 | 551 | 557 | 558 | 564 | 565 | 571 | 572 | 578 |

517 | 518 | Table 3. Well-formed byte sequences representing a single code point 519 | 520 | Bytes noted in hexadecimal. 521 | 522 |
Code point 524 \|	First byte 525 \|	Second byte 526 \|	Third byte 527 \|	Fourth byte 528 \|
U+0000 to U+007F 531 \|	00 to 7F 532 \|	533 \|	534 \|	535 \|
U+0080 to U+07FF 538 \|	C2 to DF 539 \|	80 to BF 540 \|	541 \|	542 \|
U+0800 to U+0FFF 545 \|	E0 546 \|	A0 to BF 547 \|	80 to BF 548 \|	549 \|
U+1000 to U+FFFF 552 \|	E1 to EF 553 \|	80 to BF 554 \|	80 to BF 555 \|	556 \|
U+10000 to U+3FFFF 559 \|	F0 560 \|	90 to BF 561 \|	80 to BF 562 \|	80 to BF 563 \|
U+40000 to U+FFFFF 566 \|	F1 to F3 567 \|	80 to BF 568 \|	80 to BF 569 \|	80 to BF 570 \|
U+100000 to U+10FFFF 573 \|	F4 574 \|	80 to 8F 575 \|	80 to BF 576 \|	80 to BF 577 \|

579 | 580 | 581 |

582 | The WTF-8 encoding

607 | Encoding

608 | 609 | To encode from code points 610 | to well-formed WTF-8, 611 | run these steps: 612 | 613 |

Let result be a sequence of bytes, initially empty. 615 |
For every code point P of the input, run these substeps: 616 |
1. 618 | 619 | If P is a lead surrogate code point, 620 | P is not the last code point of the input, 621 | and the next code point is a trail surrogate code point, 622 | consume the next code point and 623 | set P’s value to: 624 | 625 | 0x10000 + 626 | ((P - 0xD800) << 10) + 627 | (next - 0xDC00). 628 | 629 |
2. Depending on P: 630 |
  631 |
  U+0000 to U+007F 632 |
  Append to result one byte of value P. 633 |
  U+0080 to U+07FF 634 |
  635 | Append to result two bytes of values: 636 |
  1. 0xC0 | (P >> 6) 638 |
  2. 0x80 | (P & 0x3F) 639 |
  640 |
  U+0800 to U+FFFF 641 |
  642 | Append to result three bytes of values 643 |
  1. 0xE0 | (P >> 12) 645 |
  2. 0x80 | ((P >> 6) & 0x3F) 646 |
  3. 0x80 | (P & 0x3F) 647 |
  648 |
  U+10000 to U+10FFFF 649 |
  650 | Append to result four bytes of values 651 |
  1. 0xF0 | (P >> 18) 653 |
  2. 0x80 | ((P >> 12) & 0x3F) 654 |
  3. 0x80 | ((P >> 6) & 0x3F) 655 |
  4. 0x80 | (P & 0x3F) 656 |
  657 |
  658 |
659 |
Return result. 660 |

674 | Decoding

Let result be a sequence of code points, initially empty. 692 |
For every byte B of the input, depending on B: 693 |
694 |
0x00 to 0x7F 695 |
Append to result a code point of value B. 696 | 697 |
0xC2 to 0xDF 698 |
699 | Let B2 be the next byte and consume it. 700 | 701 | Append to result a code point of value 702 | 703 | ((B & 0x1F) << 6) + 704 | (B2 & 0x3F) 705 | 706 |
0xE0 to 0xEF 707 |
708 | Let B2 and B3 be the next two bytes, 709 | and consume them. 710 | 711 | Append to result a code point of value 712 | 713 | ((B & 0x0F) << 12) + 714 | ((B2 & 0x3F) << 6) + 715 | (B3 & 0x3F) 716 | 717 |
0xF0 to 0xF4 718 |
719 | Let B2, B3, and B4 be the next three bytes, 720 | and consume them. 721 | 722 | Append to result a code point of value 723 | 724 | ((B & 0x07) << 18) + 725 | ((B2 & 0x3F) << 12) + 726 | ((B3 & 0x3F) << 6) + 727 | (B4 & 0x3F) 728 |
729 |
Return result. 730 |

731 | 732 | Note: If the input is also well-formed in UTF-8, 733 | this is identical to decoding UTF-8 734 | and the resulting sequence is Unicode text. 735 | 736 | 737 |

738 | Converting between WTF-8 and potentially ill-formed UTF-16

739 | 740 | To convert from potentially ill-formed UTF-16 to WTF-8, 741 | run these steps: 742 | 743 |

Decode from potentially ill-formed UTF-16 to code points 745 |
Encode from code points to well-formed WTF-8 746 |

747 | 748 | Note: This conversion never fails and is lossless. 749 | 750 | To convert from WTF-8 to potentially ill-formed UTF-16, 751 | run these steps: 752 | 753 |

Decode from well-formed WTF-8 to code points 755 |
Encode from code points to potentially ill-formed UTF-16 756 |

757 | 758 | Note: This conversion never fails 759 | and, if the input is well-formed in WTF-8, is lossless. 760 | 761 | 762 |

763 | Converting between WTF-8 and UTF-8

If the input contains a surrogate byte sequence, return failure. 787 |
Otherwise, return the input unchanged. 788 |

789 | 790 | Note: This conversion is lossless when it succeeds, but it can fail. 791 | 792 | 793 |

794 | Concatenating WTF-8 strings

795 | 796 | Concatenating WTF-8 strings requires extra care to preserve well-formedness. 797 | 798 | To concatenate two WTF-8 strings, run these steps: 799 | 800 |

802 | If the left input string ends with a lead surrogate byte sequence 803 | and the right input string starts with a trail surrogate byte sequence, 804 | run these substeps: 805 |
1. 807 | Let lead and trail be two code points, 808 | the respective results of 809 | decoding from WTF-8 810 | these two surrogate byte sequences. 811 |
2. 812 | Let supplementary be the 813 | encoding to WTF-8 814 | of a single code point of value 815 | 0x10000 + ((lead - 0xD800) << 10) + (trail - 0xDC00) 816 |
3. 817 | Let left be substring of the left input string 818 | that removes the three final bytes. 819 |
4. 820 | Let right be substring of the right input string 821 | that removes the three initial bytes. 822 |
5. 823 | Return the concatenation of 824 | left, supplementary, and right. 825 |
826 |
Otherwise, return the concatenation of the two input byte sequences 827 |

828 | 829 | Note: This is equivalent to 830 | converting both strings 831 | 832 | to potentially ill-formed UTF-16, 833 | concatenating the resulting 16-bit code unit sequences, 834 | then converting the concatenation back 835 | to WTF-8. 836 | 837 | 838 |

839 | Implementations

840 | 841 | This section is non-normative. 842 | 843 |

845 | Scheme 48 uses an encoding they call UTF-8of16 846 | to encode filenames 847 | and other operating system strings on Windows. 848 | This encoding is identical to WTF-8. 849 |
850 | Racket 851 | uses an encoding they call 852 | 853 | platform-UTF-8 to 854 | 855 | encode filenames on Windows. 856 | This encoding is identical to WTF-8. 857 |
858 | wtf-8.js 859 | implements this specification in JavaScript. 860 |
861 | rust-wtf8 862 | implements this specification in Rust. 863 |
864 | On Windows (which uses potentially ill-formed UTF-16 in its APIs), 865 | the Rust standard library 866 | uses WTF-8 867 | internally for OS strings, 868 | but does not expose the WTF-8 byte sequences. 869 |

870 | 871 | 872 |

873 | Acknowledgments

874 | 875 | Thanks to Coralie Mercier for 876 | 877 | coining the name WTF-8. 878 | 879 | Thanks for feedback and contributions from 880 | Anne van Kesteren, 881 | David Baron, 882 | Dylan Petonke, 883 | Guillaume Knispel, 884 | Henri Sivonen, 885 | Jacob Lifshay, 886 | James Graham, 887 | Lily Ballard, 888 | Mathias Bynens, 889 | Ms2ger, 890 | Sam Tobin-Hochstadt, 891 | Tab Atkins. 892 | -------------------------------------------------------------------------------- /logo-wtf-8.svg: -------------------------------------------------------------------------------- 1 | 20 | -------------------------------------------------------------------------------- /zero.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SimonSapin/wtf-8/cd80a83d87385ce5efb08554e86dce114fde5026/zero.png --------------------------------------------------------------------------------

990 \|	First byte 991 \|	Second byte 992 \|	Third byte 993 \|
Lead surrogate byte sequence 995 \|	ED 996 \|	A0 to AF 997 \|	80 to BF 998 \|
Trail surrogate byte sequence 1000 \|	ED 1001 \|	B0 to BF 1002 \|	80 to BF 1003 \|
Surrogate byte sequence 1005 \|	ED 1006 \|	A0 to BF 1007 \|	80 to BF 1008 \|

Code point 1091 \|	First byte 1092 \|	Second byte 1093 \|	Third byte 1094 \|	Fourth byte 1095 \|
U+0000 to U+007F 1097 \|	0xxxxxxx 1098 \|	1099 \|	1100 \|	1101 \|
U+0080 to U+07FF 1103 \|	110xxxxx 1104 \|	10xxxxxx 1105 \|	1106 \|	1107 \|
U+0800 to U+FFFF 1109 \|	1110xxxx 1110 \|	10xxxxxx 1111 \|	10xxxxxx 1112 \|	1113 \|
U+10000 to U+10FFFF 1115 \|	11110xxx 1116 \|	10xxxxxx 1117 \|	10xxxxxx 1118 \|	10xxxxxx 1119 \|

Code point 1133 \|	First byte 1134 \|	Second byte 1135 \|	Third byte 1136 \|	Fourth byte 1137 \|
U+0000 to U+007F 1139 \|	00 to 7F 1140 \|	1141 \|	1142 \|	1143 \|
U+0080 to U+07FF 1145 \|	C2 to DF 1146 \|	80 to BF 1147 \|	1148 \|	1149 \|
U+0800 to U+0FFF 1151 \|	E0 1152 \|	A0 to BF 1153 \|	80 to BF 1154 \|	1155 \|
U+1000 to U+FFFF 1157 \|	E1 to EF 1158 \|	80 to BF 1159 \|	80 to BF 1160 \|	1161 \|
U+10000 to U+3FFFF 1163 \|	F0 1164 \|	90 to BF 1165 \|	80 to BF 1166 \|	80 to BF 1167 \|
U+40000 to U+FFFFF 1169 \|	F1 to F3 1170 \|	80 to BF 1171 \|	80 to BF 1172 \|	80 to BF 1173 \|
U+100000 to U+10FFFF 1175 \|	F4 1176 \|	80 to 8F 1177 \|	80 to BF 1178 \|	80 to BF 1179 \|

312 \|	First byte 313 \|	Second byte 314 \|	Third byte 315 \|
Lead surrogate byte sequence 318 \|	ED 319 \|	A0 to AF 320 \|	80 to BF 321 \|
Trail surrogate byte sequence 324 \|	ED 325 \|	B0 to BF 326 \|	80 to BF 327 \|
Surrogate byte sequence 330 \|	ED 331 \|	A0 to BF 332 \|	80 to BF 333 \|

Code point 470 \|	First byte 471 \|	Second byte 472 \|	Third byte 473 \|	Fourth byte 474 \|
U+0000 to U+007F 477 \|	0xxxxxxx 478 \|	479 \|	480 \|	481 \|
U+0080 to U+07FF 484 \|	110xxxxx 485 \|	10xxxxxx 486 \|	487 \|	488 \|
U+0800 to U+FFFF 491 \|	1110xxxx 492 \|	10xxxxxx 493 \|	10xxxxxx 494 \|	495 \|
U+10000 to U+10FFFF 498 \|	11110xxx 499 \|	10xxxxxx 500 \|	10xxxxxx 501 \|	10xxxxxx 502 \|

[TITLE]

The WTF-8 encoding

Abstract

Table of Contents

1. Intended audience

2. Background and motivation

2.1. Differences with CESU-8

3. Terminology

3.1. Surrogate code points

3.2. Surrogate 16-bit code units

3.3. Surrogate byte sequences

4. Potentially ill-formed UTF-16

4.1. Encoding

4.2. Decoding

5. Generalized UTF-8

6. The WTF-8 encoding

6.1. Encoding

6.2. Decoding

6.3. Converting between WTF-8 and potentially ill-formed UTF-16

6.4. Converting between WTF-8 and UTF-8

6.5. Concatenating WTF-8 strings

7. Implementations

8. Acknowledgments

Conformance

Index

Terms defined by this specification

References

Normative References

Informative References

The WTF-8 encoding

18 | Intended audience

43 | Background and motivation

121 | Differences with CESU-8

146 | Terminology

241 | Surrogate code points

261 | Surrogate 16-bit code units

282 | Surrogate byte sequences

344 | Potentially ill-formed UTF-16

368 | Encoding

410 | Decoding

448 | Generalized UTF-8

582 | The WTF-8 encoding

607 | Encoding

674 | Decoding

738 | Converting between WTF-8 and potentially ill-formed UTF-16

763 | Converting between WTF-8 and UTF-8

794 | Concatenating WTF-8 strings

839 | Implementations

873 | Acknowledgments