├── Makefile ├── README.md ├── copyright.include ├── header.include ├── index.html ├── index.src.html ├── logo-wtf-8.svg └── zero.png /Makefile: -------------------------------------------------------------------------------- 1 | # https://github.com/tabatkins/bikeshed 2 | index.html: index.src.html Makefile *.include 3 | bikeshed spec $< $@ 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | The WTF-8 encoding specification 2 | -------------------------------- 3 | 4 | [Latest version](https://simonsapin.github.io/wtf-8/) 5 | -------------------------------------------------------------------------------- /copyright.include: -------------------------------------------------------------------------------- 1 | CC0 3 | To the extent possible under law, the editors have waived all copyright 4 | and related or neighboring rights to this work. 5 | -------------------------------------------------------------------------------- /header.include: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | [TITLE] 6 | 574 | 575 | 576 |
577 |

578 |

[TITLE]

579 |
580 |
581 | 582 |
583 |
584 | 585 |
586 | 587 |
588 | 589 |
590 | -------------------------------------------------------------------------------- /index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | The WTF-8 encoding 5 | 573 | 574 | 575 | 585 | 614 | 651 | 697 | 759 | 760 |
761 |

762 |

The WTF-8 encoding

763 |
764 |
765 |
Editor: 766 |
Simon Sapin (Mozilla) 767 |
Issue tracking: 768 |
On GitHub 769 |
Change history: 770 |
On GitHub 771 |
Last updated: 772 |
23 February 2022 773 |
774 |
775 |
776 | 778 |
779 |
780 |
781 |

Abstract

782 |

WTF-8 (Wobbly Transformation Format − 8-bit) is a superset of UTF-8 that encodes surrogate code points if they are not in a pair. It represents, in a way compatible with UTF-8, text from systems such as JavaScript and Windows that use UTF-16 internally but don’t enforce the well-formedness invariant that surrogates must be paired.

783 |
784 |
785 |
786 |

Table of Contents

787 |
    788 |
  1. 1 Intended audience 789 |
  2. 790 | 2 Background and motivation 791 |
      792 |
    1. 2.1 Differences with CESU-8 793 |
    794 |
  3. 795 | 3 Terminology 796 |
      797 |
    1. 3.1 Surrogate code points 798 |
    2. 3.2 Surrogate 16-bit code units 799 |
    3. 3.3 Surrogate byte sequences 800 |
    801 |
  4. 802 | 4 Potentially ill-formed UTF-16 803 |
      804 |
    1. 4.1 Encoding 805 |
    2. 4.2 Decoding 806 |
    807 |
  5. 5 Generalized UTF-8 808 |
  6. 809 | 6 The WTF-8 encoding 810 |
      811 |
    1. 6.1 Encoding 812 |
    2. 6.2 Decoding 813 |
    3. 6.3 Converting between WTF-8 and potentially ill-formed UTF-16 814 |
    4. 6.4 Converting between WTF-8 and UTF-8 815 |
    5. 6.5 Concatenating WTF-8 strings 816 |
    817 |
  7. 7 Implementations 818 |
  8. 8 Acknowledgments 819 |
  9. Conformance 820 |
  10. 821 | Index 822 |
      823 |
    1. Terms defined by this specification 824 |
    825 |
  11. 826 | References 827 |
      828 |
    1. Normative References 829 |
    2. Informative References 830 |
    831 |
832 |
833 | 834 |

1. Intended audience

835 |

WTF-8 is a hack intended to be used internally in self-contained systems 836 | with components that need to support potentially ill-formed UTF-16 for legacy reasons.

837 |

Any WTF-8 data must be converted 838 | to a Unicode encoding at the system’s boundary 839 | before being emitted. UTF-8 is recommended. WTF-8 must not be used to represent text 840 | in a file format or for transmission over the Internet.

841 |

In particular, 842 | the Encoding Standard [ENCODING] defines UTF-8 and other encodings for the Web. 843 | There is no and will not be any encoding label [ENCODING] or IANA charset alias [CHARSETS] for WTF-8.

844 |

2. Background and motivation

845 |

This section is non-normative.

846 |

When Unicode 1.0 was published in 1991, 847 | it defined 65536 code points from U+0000 to U+FFFF 848 | and assigned characters to around half of them. 849 | Many software implementations chose the obvious memory representation 850 | for Unicode text of 16 bits per code point / character.

851 |

At the time, “Unicode” was synonymous with that particular encoding. 852 | To disambiguate, that encoding is now called UCS-2.

853 |

As subsequent versions of Unicode assigned more characters, 854 | it became apparent that 65536 code points would not be sufficient. 855 | Unicode was extended to 1114112 code points from U+0000 to U+10FFFF, 856 | and the UTF-16 encoding was introduced. 857 | This encoding preserves compatibility with existing 16-bit based systems 858 | and represents new (supplementary) code points 859 | as a pair of “surrogates”.

860 |

UTF-16 is designed to represent any Unicode text, 861 | but it can not represent a surrogate code point pair since the corresponding surrogate 16-bit code unit pairs would instead represent a supplementary code point. 862 | Therefore, the concept of Unicode scalar value was introduced 863 | and Unicode text was restricted to not contain any surrogate code point. 864 | (This was presumably deemed simpler that only restricting pairs.)

865 |

UTF-16 was redefined to be ill-formed if it contains unpaired surrogate 16-bit code units. UTF-8 was similarly redefined to be ill-formed if it contains surrogate byte sequences.

866 |

Meanwhile, 16-bit based systems had little to no incentive 867 | to do anything about surrogates: 868 | For several years, 869 | Unicode did not assign any character to supplementary code points, 870 | and then (until emoji) only comparatively rare characters. 871 | Additionally, the Unicode Standard does not require conforming implementations 872 | to maintain well-formedness of UTF-16 strings.

873 |

As a result, surrogates do occur in practice and need to be preserved. 874 | For example:

875 | 882 |

We say that strings in these systems are encoded in potentially ill-formed UTF-16 or WTF-16.

883 |

Unpaired surrogate 16-bit code units are the only case 884 | where an arbitrary sequence of 16-bit code units is ill-formed in UTF-16. UTF-8, however, is more complex 885 | and maintaining its well-formedness is arguably more valuable.

886 |

This specification defines WTF-8, 887 | a superset of UTF-8 that can losslessly represent 888 | arbitrary sequences of 16-bit code unit (even if ill-formed in UTF-16) 889 | but preserves the other well-formedness constraints of UTF-8.

890 |

2.1. Differences with CESU-8

891 |

Unicode defines a Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8). WTF-8 is different from CESU-8.

892 |

CESU-8 encodes supplementary code points as surrogate pair byte sequences of six bytes, 893 | whereas WTF-8, like UTF-8, encodes them as sequences of four bytes. 894 | Therefore, CESU-8 is not a superset of UTF-8.

895 |

CESU-8 is also a mapping on UTF-16 code units. 896 | Therefore unpaired surrogate byte sequences are ill-formed in CESU-8, 897 | whereas supporting them is the entire point of WTF-8.

898 |

3. Terminology

899 |

These definitions correspond to those of the Glossary of Unicode Terms. [UNICODE]

900 |

A Unicode code point is any value in the Unicode codespace; 901 | that is, the range of integers from 0 to 1114111. 902 | It is noted with a “U+” prefix and four to six hexadecimal digits: 903 | the first and last code points are U+0000 and U+10FFFF.

904 |

The Basic Multilingual Plane is 905 | the range of code points from U+0000 to U+FFFF.

906 |

A BMP code point is a code point in the Basic Multilingual Plane.

907 |

A supplementary code point is a code point not in the Basic Multilingual Plane. 908 | That is, a code point in the range from U+10000 to U+10FFFF.

909 |

A Unicode scalar value is a code point that is not a surrogate code point. 910 | That is, a code point in the range from U+0000 to U+D7FF, 911 | or in the range U+E000 to U+10FFFF.

912 |

A BMP scalar value is a Unicode scalar value in the Basic Multilingual Plane. 913 | That is, a code point in the range from U+0000 to U+D7FF, 914 | or in the range U+E000 to U+FFFF.

915 |

Unicode text is a sequence of Unicode scalar values.

916 |

UTF-8 is an encoding of Unicode text using 8-bit bytes. 917 | Each Unicode scalar value is represented as a sequence of one to four bytes.

918 |

UTF-16 is an encoding of Unicode text using 16-bit code units. BMP scalar values are represented as a single 16-bit code unit with the same value. Supplementary code points are represented as a surrogate 16-bit code unit pair.

919 |

Note: this specification is only concerned with the UTF-16 encoding form (based on 16-bit code units), 920 | and not with the encoding scheme (based on bytes, with UTF-16BE and UTF-16LE variants).

921 |

A string is well-formed (not ill-formed) 922 | in a given encoding if it follows the specification of that encoding. [UNICODE] defines Well-Formed Code Unit Sequence for UTF-8 and UTF-16.

923 |
924 | In particular: 925 | 930 |
931 |

The replacement character is the code point U+FFFD REPLACEMENT CHARACTER (�). 932 | It is used as a substitute to replace ill-formed sub-sequences 933 | during a conversion.

934 |

A 16-bit code unit is a 16-bit integer used in UTF-16. 935 | It is noted with a “0x” prefix and four hexadecimal digits: 936 | the first and last 16-bit code units are 0x0000 and 0xFFFF.

937 |

Note: The byte serialization or memory representation of a 16-bit code unit (little-endian or big-endian) 938 | is out of scope for this specification.

939 |

When an algorithm iterates over a sequence (“For every i in …”), consuming the next item means advancing in the sequence 940 | such that that item will be skipped during the following iteration of the loop: 941 | the item after the next becomes the next item.

942 |
943 | 944 |

The following algorithm prints “1”, “2”, and “4”.

945 |

For every digit i in “1234”, run these substeps:

946 |
    947 |
  1. Print i 948 |
  2. If i is 2, consume the next digit. 949 |
950 |
951 |

3.1. Surrogate code points

952 |

A lead surrogate code point or high surrogate code point is a code point in the range from U+D800 to U+DBFF.

953 |

A trail surrogate code point or low surrogate code point is a code point in the range from U+DC00 to U+DFFF.

954 |

A surrogate code point is 955 | either a lead surrogate code point or a trail surrogate code point. 956 | That is, a code point in the range from U+D800 to U+DFFF.

957 |

A surrogate code point pair is a sequence of 958 | a lead surrogate code point followed by a trail surrogate code point.

959 |

An unpaired surrogate code point is a surrogate code point that is not part of a surrogate code point pair.

960 |

3.2. Surrogate 16-bit code units

961 |

A lead surrogate 16-bit code unit or high surrogate 16-bit code unit is a 16-bit code unit in the range from 0xD800 to 0xDBFF.

962 |

A trail surrogate 16-bit code unit or low surrogate 16-bit code unit is a 16-bit code unit in the range from 0xDC00 to 0xDFFF.

963 |

A surrogate 16-bit code unit is 964 | either a lead surrogate 16-bit code unit or a trail surrogate 16-bit code unit. 965 | That is, a 16-bit code unit in the range from 0xD800 to 0xDFFF.

966 |

A surrogate 16-bit code unit pair is a sequence of 967 | a lead surrogate 16-bit code unit followed by a trail surrogate 16-bit code unit. 968 | In UTF-16, it represents a supplementary code point.

969 |

An unpaired surrogate 16-bit code unit is a surrogate 16-bit code unit that is not part of a surrogate 16-bit code unit pair.

970 |

3.3. Surrogate byte sequences

971 |

Note: A surrogate byte sequence (and therefore any byte sequence described in this section) 972 | is ill-formed in UTF-8. 973 | Decoders are required to treat it as an error.

974 |

A lead surrogate byte sequence or high surrogate byte sequence is a sequence of three bytes 975 | that represents a lead surrogate code point in generalized UTF-8.

976 |

A trail surrogate byte sequence or low surrogate byte sequence is a sequence of three bytes 977 | that represents a trail surrogate code point in generalized UTF-8.

978 |

A surrogate byte sequence is 979 | either a lead surrogate byte sequence or a trail surrogate byte sequence. 980 | That is, a sequence of three bytes 981 | that represents a surrogate code point in generalized UTF-8.

982 | 983 | 987 | 988 | 989 | 994 | 999 | 1004 |
984 |

Table 1. Surrogate byte sequences

985 |

Bytes noted in hexadecimal.

986 |
990 | First byte 991 | Second byte 992 | Third byte 993 |
Lead surrogate byte sequence 995 | ED 996 | A0 to AF 997 | 80 to BF 998 |
Trail surrogate byte sequence 1000 | ED 1001 | B0 to BF 1002 | 80 to BF 1003 |
Surrogate byte sequence 1005 | ED 1006 | A0 to BF 1007 | 80 to BF 1008 |
1009 |

A surrogate pair byte sequence is a sequence six bytes composed of 1010 | a lead surrogate byte sequence followed by a trail surrogate byte sequence.

1011 |

An unpaired surrogate byte sequence is a surrogate byte sequence that is not part of a surrogate pair byte sequence.

1012 |

4. Potentially ill-formed UTF-16

1013 |

A sequence of 16-bit code units is potentially ill-formed UTF-16 if it is intended to be interpreted as UTF-16, 1014 | but is not necessarily well-formed in UTF-16. 1015 | It effectively encodes a sequence of code points that do not contain any surrogate code point pair.

1016 |

Note: Like UTF-16, potentially ill-formed UTF-16 can not represent a surrogate code point pair since the corresponding surrogate 16-bit code unit pair would instead 1017 | represent a supplementary code point. 1018 | Unlike well-formed UTF-16, it might contain isolated surrogate code points.

1019 |

Any sequence of 16-bit code units has an interpretation as potentially ill-formed UTF-16.

1020 |

WTF-16 is sometimes used as a shorter name for potentially ill-formed UTF-16, 1021 | especially in the context of systems were originally designed for UCS-2 and later upgraded to UTF-16 but never enforced well-formedness, 1022 | either by neglect or because of backward-compatibility constraints.

1023 |

4.1. Encoding

1024 |

To encode from code points to potentially ill-formed UTF-16, 1025 | run these steps:

1026 |
    1027 |
  1. Let result be a sequence of 16-bit code units, initially empty. 1028 |
  2. 1029 | For every code point P of the input, run these substeps: 1030 |
      1031 |
    1. 1032 | If P is a supplementary code point, 1033 | append to result two 16-bit code units of values: 1034 |
        1035 |
      1. ((P - 0x10000) >> 10) + 0xD800 1036 |
      2. ((P - 0x10000) & 0x3FF) + 0xDC00 1037 |
      1038 |
    2. Otherwise (P is a BMP code point), 1039 | append to result a 16-bit code unit of value P. 1040 |
    1041 |
  3. Return result. 1042 |
1043 |

Note: If the input is restricted to Unicode text, 1044 | this is identical to encoding to UTF-16 and the resulting sequence is well-formed in UTF-16.

1045 |
1046 |

If, on the other hand, the input contains a surrogate code point pair, 1047 | the conversion will be incorrect and 1048 | the resulting sequence will not represent the original code points.

1049 |

This situation should be considered an error, 1050 | but this specification does not define how to handle it. 1051 | Possibilities include aborting the conversion, 1052 | or replacing one of the surrogate code points of the pair 1053 | with a replacement character.

1054 |
1055 |

4.2. Decoding

1056 |

To decode from potentially ill-formed UTF-16 to code points, 1057 | run these steps:

1058 |
    1059 |
  1. Let result be a sequence of code points, initially empty. 1060 |
  2. 1061 | For every 16-bit code unit U of the input, run these substeps: 1062 |
      1063 |
    1. If U is a lead surrogate 16-bit code unit, U is not the last 16-bit code unit of the input, 1064 | and the next 16-bit code unit of the input next is a trail surrogate 16-bit code unit, 1065 | then consume next and append to result a code point of value 0x10000 + 1066 | ((U - 0xD800) << 10) + 1067 | (next - 0xDC00). 1068 |
    2. Otherwise, 1069 | append to result a code point of value U. 1070 |
    1071 |
  3. Return result. 1072 |
1073 |

Note: By construction, 1074 | the resulting sequence does not contain a surrogate code point pair.

1075 |

Note: If the input is well-formed in UTF-16, 1076 | this is identical to decoding UTF-16 and the resulting sequence is Unicode text.

1077 |

5. Generalized UTF-8

1078 |

For the purpose of this specification, generalized UTF-8 is an encoding of sequences of code points (not restricted to Unicode scalar values) 1079 | using 8-bit bytes, 1080 | based on the same underlying algorithm as UTF-8. 1081 | It is a strict superset of UTF-8 (like UTF-8 is a strict superset of ASCII).

1082 |

Each code point is encoded as a sequence of one to four bytes:

1083 | 1084 | 1088 | 1089 | 1090 | 1096 | 1102 | 1108 | 1114 |
1085 |

Table 2. Bit distribution

1086 |

Bytes noted in binary, most significant bit first. x bits represent the least significant bits of the code points.

1087 |
Code point 1091 | First byte 1092 | Second byte 1093 | Third byte 1094 | Fourth byte 1095 |
U+0000 to U+007F 1097 | 0xxxxxxx 1098 | 1099 | 1100 | 1101 |
U+0080 to U+07FF 1103 | 110xxxxx 1104 | 10xxxxxx 1105 | 1106 | 1107 |
U+0800 to U+FFFF 1109 | 1110xxxx 1110 | 10xxxxxx 1111 | 10xxxxxx 1112 | 1113 |
U+10000 to U+10FFFF 1115 | 11110xxx 1116 | 10xxxxxx 1117 | 10xxxxxx 1118 | 10xxxxxx 1119 |
1120 |

A byte sequence is well-formed in generalized UTF-8 if and only if:

1121 | 1125 | 1126 | 1130 | 1131 | 1132 | 1138 | 1144 | 1150 | 1156 | 1162 | 1168 | 1174 |
1127 |

Table 3. Well-formed byte sequences representing a single code point

1128 |

Bytes noted in hexadecimal.

1129 |
Code point 1133 | First byte 1134 | Second byte 1135 | Third byte 1136 | Fourth byte 1137 |
U+0000 to U+007F 1139 | 00 to 7F 1140 | 1141 | 1142 | 1143 |
U+0080 to U+07FF 1145 | C2 to DF 1146 | 80 to BF 1147 | 1148 | 1149 |
U+0800 to U+0FFF 1151 | E0 1152 | A0 to BF 1153 | 80 to BF 1154 | 1155 |
U+1000 to U+FFFF 1157 | E1 to EF 1158 | 80 to BF 1159 | 80 to BF 1160 | 1161 |
U+10000 to U+3FFFF 1163 | F0 1164 | 90 to BF 1165 | 80 to BF 1166 | 80 to BF 1167 |
U+40000 to U+FFFFF 1169 | F1 to F3 1170 | 80 to BF 1171 | 80 to BF 1172 | 80 to BF 1173 |
U+100000 to U+10FFFF 1175 | F4 1176 | 80 to 8F 1177 | 80 to BF 1178 | 80 to BF 1179 |
1180 |

6. The WTF-8 encoding

1181 |

WTF-8 (Wobbly Transformation Format − 8-bit) 1182 | is an encoding of code point sequences 1183 | that do not contain any surrogate code point pair using 8-bit bytes.

1184 |

Note: Like UTF-8 is artificially restricted to Unicode text in order to match UTF-16, WTF-8 is artificially restricted to exclude surrogate code point pairs in order to match potentially ill-formed UTF-16.

1185 |

It is identical to generalized UTF-8, 1186 | with the additional well-formedness constraint that 1187 | a surrogate pair byte sequence is ill-formed. 1188 | It is a strict subset of generalized UTF-8 and a strict superset of UTF-8.

1189 |

Note: Similarly, UTF-8 is a strict superset of ASCII.

1190 |

WTF-8 must not be used for interchange. 1191 | See Intended audience.

1192 |

6.1. Encoding

1193 |

To encode from code points to well-formed WTF-8, 1194 | run these steps:

1195 |
    1196 |
  1. Let result be a sequence of bytes, initially empty. 1197 |
  2. 1198 | For every code point P of the input, run these substeps: 1199 |
      1200 |
    1. 1201 |

      If P is a lead surrogate code point, P is not the last code point of the input, 1202 | and the next code point is a trail surrogate code point, consume the next code point and 1203 | set P’s value to: 0x10000 + 1204 | ((P - 0xD800) << 10) + 1205 | (next - 0xDC00).

      1206 |
    2. 1207 | Depending on P: 1208 |
      1209 |
      U+0000 to U+007F 1210 |
      Append to result one byte of value P. 1211 |
      U+0080 to U+07FF 1212 |
      1213 | Append to result two bytes of values: 1214 |
        1215 |
      1. 0xC0 | (P >> 6) 1216 |
      2. 0x80 | (P & 0x3F) 1217 |
      1218 |
      U+0800 to U+FFFF 1219 |
      1220 | Append to result three bytes of values 1221 |
        1222 |
      1. 0xE0 | (P >> 12) 1223 |
      2. 0x80 | ((P >> 6) & 0x3F) 1224 |
      3. 0x80 | (P & 0x3F) 1225 |
      1226 |
      U+10000 to U+10FFFF 1227 |
      1228 | Append to result four bytes of values 1229 |
        1230 |
      1. 0xF0 | (P >> 18) 1231 |
      2. 0x80 | ((P >> 12) & 0x3F) 1232 |
      3. 0x80 | ((P >> 6) & 0x3F) 1233 |
      4. 0x80 | (P & 0x3F) 1234 |
      1235 |
      1236 |
    1237 |
  3. Return result. 1238 |
1239 |

Note: If the input contains a surrogate code point pair, 1240 | the resulting byte sequence will be not represent 1241 | the original sequence of code points. 1242 | Instead, it will represent the same code points as if had been encoded in potentially ill-formed UTF-16. 1243 | This is also consistent with 1244 | encoding each code point to WTF-8 individually, 1245 | and concatenating the resulting WTF-8 byte sequences.

1246 |

6.2. Decoding

1247 |

To decode from well-formed WTF-8 to code points, 1248 | run these steps:

1249 |

Note: Since WTF-8 must not be used for interchange 1250 | (see Intended audience), 1251 | this algorithm is deliberately not defined for arbitrary byte sequences. 1252 | It is only defined for byte sequences known to be well-formed in WTF-8, 1253 | such as sequences encoded from code points, converted from UTF-16, or concatenated from sequences themselves well-formed in WTF-8.

1254 |
    1255 |
  1. Let result be a sequence of code points, initially empty. 1256 |
  2. 1257 | For every byte B of the input, depending on B: 1258 |
    1259 |
    0x00 to 0x7F 1260 |
    Append to result a code point of value B. 1261 |
    0xC2 to 0xDF 1262 |
    1263 | Let B2 be the next byte and consume it. 1264 |

    Append to result a code point of value ((B & 0x1F) << 6) + 1265 | (B2 & 0x3F)

    1266 |
    0xE0 to 0xEF 1267 |
    1268 | Let B2 and B3 be the next two bytes, 1269 | and consume them. 1270 |

    Append to result a code point of value ((B & 0x0F) << 12) + 1271 | ((B2 & 0x3F) << 6) + 1272 | (B3 & 0x3F)

    1273 |
    0xF0 to 0xF4 1274 |
    1275 | Let B2, B3, and B4 be the next three bytes, 1276 | and consume them. 1277 |

    Append to result a code point of value ((B & 0x07) << 18) + 1278 | ((B2 & 0x3F) << 12) + 1279 | ((B3 & 0x3F) << 6) + 1280 | (B4 & 0x3F)

    1281 |
    1282 |
  3. Return result. 1283 |
1284 |

Note: If the input is also well-formed in UTF-8, 1285 | this is identical to decoding UTF-8 and the resulting sequence is Unicode text.

1286 |

6.3. Converting between WTF-8 and potentially ill-formed UTF-16

1287 |

To convert from potentially ill-formed UTF-16 to WTF-8, 1288 | run these steps:

1289 | 1293 |

Note: This conversion never fails and is lossless.

1294 |

To convert from WTF-8 to potentially ill-formed UTF-16, 1295 | run these steps:

1296 | 1300 |

Note: This conversion never fails 1301 | and, if the input is well-formed in WTF-8, is lossless.

1302 |

6.4. Converting between WTF-8 and UTF-8

1303 |

Since WTF-8 is a superset of UTF-8, 1304 | any sequence of byte that is well-formed in UTF-8 is also well-formed in WTF-8 and represents the same text. 1305 | To convert from UTF-8 to WTF-8, 1306 | return the input unchanged.

1307 |

Note: This conversion never fails and is lossless.

1308 |

To convert lossily from WTF-8 to UTF-8, 1309 | replace any surrogate byte sequence with the sequence of three bytes <0xEF, 0xBF, 0xBD>, 1310 | the UTF-8 encoding of the replacement character.

1311 |

Note: Since surrogate byte sequences are also three bytes long, 1312 | this conversion can be done in place.

1313 |

Note: This conversion never fails but is lossy.

1314 |

To convert strictly from WTF-8 to UTF-8, run these steps:

1315 |
    1316 |
  1. If the input contains a surrogate byte sequence, return failure. 1317 |
  2. Otherwise, return the input unchanged. 1318 |
1319 |

Note: This conversion is lossless when it succeeds, but it can fail.

1320 |

6.5. Concatenating WTF-8 strings

1321 |

Concatenating WTF-8 strings requires extra care to preserve well-formedness.

1322 |

To concatenate two WTF-8 strings, run these steps:

1323 |
    1324 |
  1. 1325 | If the left input string ends with a lead surrogate byte sequence and the right input string starts with a trail surrogate byte sequence, 1326 | run these substeps: 1327 |
      1328 |
    1. Let lead and trail be two code points, 1329 | the respective results of decoding from WTF-8 these two surrogate byte sequences. 1330 |
    2. Let supplementary be the encoding to WTF-8 of a single code point of value 0x10000 + ((lead - 0xD800) << 10) + (trail - 0xDC00) 1331 |
    3. Let left be substring of the left input string 1332 | that removes the three final bytes. 1333 |
    4. Let right be substring of the right input string 1334 | that removes the three initial bytes. 1335 |
    5. Return the concatenation of left, supplementary, and right. 1336 |
    1337 |
  2. Otherwise, return the concatenation of the two input byte sequences 1338 |
1339 |

Note: This is equivalent to 1340 | converting both strings to potentially ill-formed UTF-16, 1341 | concatenating the resulting 16-bit code unit sequences, 1342 | then converting the concatenation back to WTF-8.

1343 |

7. Implementations

1344 |

This section is non-normative.

1345 | 1357 |

8. Acknowledgments

1358 |

Thanks to Coralie Mercier for coining the name WTF-8.

1359 |

Thanks for feedback and contributions from 1360 | Anne van Kesteren, 1361 | David Baron, 1362 | Dylan Petonke, 1363 | Guillaume Knispel, 1364 | Henri Sivonen, 1365 | Jacob Lifshay, 1366 | James Graham, 1367 | Lily Ballard, 1368 | Mathias Bynens, 1369 | Ms2ger, 1370 | Sam Tobin-Hochstadt, 1371 | Tab Atkins.

1372 |
1373 |

Conformance

1374 |

Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. 1375 | The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” 1376 | in the normative parts of this document 1377 | are to be interpreted as described in RFC 2119. 1378 | However, for readability, 1379 | these words do not appear in all uppercase letters in this specification.

1380 |

All of the text of this specification is normative 1381 | except sections explicitly marked as non-normative, examples, and notes. [RFC2119]

1382 |

Examples in this specification are introduced with the words “for example” 1383 | or are set apart from the normative text with class="example", like this:

1384 |
This is an example of an informative example.
1385 |

Informative notes begin with the word “Note” 1386 | and are set apart from the normative text with class="note", like this:

1387 |

Note, this is an informative note.

1388 |
1389 | 1518 |

Index

1519 |

Terms defined by this specification

1520 | 1574 |

References

1575 |

Normative References

1576 |
1577 |
[RFC2119] 1578 |
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://tools.ietf.org/html/rfc2119 1579 |
[UNICODE] 1580 |
The Unicode Standard. URL: http://www.unicode.org/versions/latest/ 1581 |
1582 |

Informative References

1583 |
1584 |
[CHARSETS] 1585 |
Character sets. URL: https://www.iana.org/assignments/character-sets 1586 |
[ENCODING] 1587 |
Anne van Kesteren. Encoding Standard. Living Standard. URL: https://encoding.spec.whatwg.org/ 1588 |
1589 | 1596 | 1623 | 1630 | 1637 | 1654 | 1665 | 1672 | 1691 | 1714 | 1735 | 1762 | 1777 | 1786 | 1805 | 1818 | 1829 | 1840 | 1857 | 1876 | 1885 | 1894 | 1901 | 1914 | 1923 | 1932 | 1941 | 1956 | 1969 | 1976 | 1999 | 2006 | 2013 | 2020 | 2031 | 2058 | 2069 | 2078 | 2087 | 2094 | 2103 | -------------------------------------------------------------------------------- /index.src.html: -------------------------------------------------------------------------------- 1 |

The WTF-8 encoding

2 |
  3 | Shortname: wtf-8
  4 | Status: LS
  5 | Boilerplate: omit feedback-header
  6 | !Issue tracking: On GitHub
  7 | !Change history: On GitHub
  8 | !Last updated: [DATE]
  9 | Editor: Simon Sapin, Mozilla https://www.mozilla.org/, https://exyr.org/
 10 | Abstract: WTF-8 (Wobbly Transformation Format − 8-bit) is a superset of UTF-8 that encodes surrogate code points if they are not in a pair. It represents, in a way compatible with UTF-8, text from systems such as JavaScript and Windows that use UTF-16 internally but don’t enforce the well-formedness invariant that surrogates must be paired.
 11 | 
12 | 13 | 14 |

18 | Intended audience

19 | 20 | WTF-8 is a hack intended to be used internally in self-contained systems 21 | with components that need to support potentially ill-formed UTF-16 22 | for legacy reasons. 23 | 24 | Any WTF-8 data must be converted 25 | to a Unicode encoding at the system’s boundary 26 | before being emitted. 27 | UTF-8 is recommended. 28 | WTF-8 must not be used to represent text 29 | in a file format or for transmission over the Internet. 30 | 31 | In particular, 32 | the Encoding Standard [[ENCODING]] 33 | defines UTF-8 and other encodings for the Web. 34 | There is no and will not be any 35 | 36 | encoding label [[ENCODING]] or 37 | 38 | IANA charset alias [[CHARSETS]] 39 | for WTF-8. 40 | 41 | 42 |

43 | Background and motivation

44 | 45 | This section is non-normative. 46 | 47 | When Unicode 1.0 48 | was published in 1991, 49 | it defined 65536 code points from U+0000 to U+FFFF 50 | and assigned characters to around half of them. 51 | Many software implementations chose the obvious memory representation 52 | for Unicode text of 16 bits per code point / character. 53 | 54 | At the time, “Unicode” was synonymous with that particular encoding. 55 | To disambiguate, that encoding is now called UCS-2. 56 | 57 | As subsequent versions of Unicode assigned more characters, 58 | it became apparent that 65536 code points would not be sufficient. 59 | Unicode was extended to 1114112 code points from U+0000 to U+10FFFF, 60 | and the UTF-16 encoding was introduced. 61 | This encoding preserves compatibility with existing 16-bit based systems 62 | and represents new (supplementary) code points 63 | as a pair of “surrogates”. 64 | 65 | UTF-16 is designed to represent any Unicode text, 66 | but it can not represent a surrogate code point pair 67 | since the corresponding surrogate 16-bit code unit pairs 68 | would instead represent a supplementary code point. 69 | Therefore, the concept of Unicode scalar value was introduced 70 | and Unicode text was restricted to not contain any surrogate code point. 71 | (This was presumably deemed simpler that only restricting pairs.) 72 | 73 | UTF-16 was redefined to be ill-formed 74 | if it contains unpaired surrogate 16-bit code units. 75 | UTF-8 was similarly redefined to be ill-formed 76 | if it contains surrogate byte sequences. 77 | 78 | Meanwhile, 16-bit based systems had little to no incentive 79 | to do anything about surrogates: 80 | For several years, 81 | Unicode did not assign any character to supplementary code points, 82 | and then (until emoji) only comparatively rare characters. 83 | Additionally, the Unicode Standard does not require conforming implementations 84 | to maintain well-formedness of UTF-16 strings. 85 | 86 | As a result, surrogates do occur in practice and need to be preserved. 87 | For example: 88 | 89 | 105 | 106 | We say that strings in these systems are encoded in potentially ill-formed UTF-16 107 | or WTF-16. 108 | 109 | Unpaired surrogate 16-bit code units are the only case 110 | where an arbitrary sequence of 16-bit code units is ill-formed in UTF-16. 111 | UTF-8, however, is more complex 112 | and maintaining its well-formedness is arguably more valuable. 113 | 114 | This specification defines WTF-8, 115 | a superset of UTF-8 that can losslessly represent 116 | arbitrary sequences of 16-bit code unit (even if ill-formed in UTF-16) 117 | but preserves the other well-formedness constraints of UTF-8. 118 | 119 | 120 |

121 | Differences with CESU-8

122 | 123 | Unicode defines a 124 | Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8). 125 | WTF-8 is different from CESU-8. 126 | 127 | CESU-8 encodes supplementary code points as surrogate pair byte sequences 128 | of six bytes, 129 | whereas WTF-8, like UTF-8, encodes them as sequences of four bytes. 130 | Therefore, CESU-8 is not a superset of UTF-8. 131 | 132 | CESU-8 is also a mapping on UTF-16 code units. 133 | Therefore unpaired surrogate byte sequences are ill-formed in CESU-8, 134 | whereas supporting them is the entire point of WTF-8. 135 | 136 | 143 | 144 | 145 |

146 | Terminology

147 | 148 | These definitions correspond to those of the 149 | Glossary of Unicode Terms. [[!UNICODE]] 150 | 151 | A Unicode code point is any value in the Unicode codespace; 152 | that is, the range of integers from 0 to 1114111. 153 | It is noted with a “U+” prefix and four to six hexadecimal digits: 154 | the first and last code points are U+0000 and U+10FFFF. 155 | 156 | The Basic Multilingual Plane is 157 | the range of code points from U+0000 to U+FFFF. 158 | 159 | A BMP code point is a code point in the Basic Multilingual Plane. 160 | 161 | A supplementary code point 162 | is a code point not in the Basic Multilingual Plane. 163 | That is, a code point in the range from U+10000 to U+10FFFF. 164 | 165 | A Unicode scalar value 166 | is a code point that is not a surrogate code point. 167 | That is, a code point in the range from U+0000 to U+D7FF, 168 | or in the range U+E000 to U+10FFFF. 169 | 170 | A BMP scalar value is a Unicode scalar value 171 | in the Basic Multilingual Plane. 172 | That is, a code point in the range from U+0000 to U+D7FF, 173 | or in the range U+E000 to U+FFFF. 174 | 175 | Unicode text is a sequence of Unicode scalar values. 176 | 177 | UTF-8 is an encoding of Unicode text using 8-bit bytes. 178 | Each Unicode scalar value is represented as a sequence of one to four bytes. 179 | 180 | UTF-16 is an encoding of Unicode text using 16-bit code units. 181 | BMP scalar values are represented as a single 16-bit code unit with the same value. 182 | Supplementary code points are represented as a surrogate 16-bit code unit pair. 183 | 184 | Note: this specification is only concerned with the UTF-16 185 | encoding form 186 | (based on 16-bit code units), 187 | and not with the 188 | encoding scheme 189 | (based on bytes, with UTF-16BE and UTF-16LE variants). 190 | 191 | A string is well-formed 192 | (not ill-formed) 193 | in a given encoding if it follows the specification of that encoding. 194 | [[!UNICODE]] defines 195 | 196 | Well-Formed Code Unit Sequence 197 | for UTF-8 and UTF-16. 198 | 199 |
200 | In particular: 201 | 202 | 207 |
208 | 209 | The replacement character is the code point U+FFFD 210 | REPLACEMENT CHARACTER (�). 211 | It is used as a substitute to replace ill-formed sub-sequences 212 | during a conversion. 213 | 214 | A 16-bit code unit is a 16-bit integer used in UTF-16. 215 | It is noted with a “0x” prefix and four hexadecimal digits: 216 | the first and last 16-bit code units are 0x0000 and 0xFFFF. 217 | 218 | Note: The byte serialization or memory representation of a 16-bit code unit 219 | (little-endian or big-endian) 220 | is out of scope for this specification. 221 | 222 | When an algorithm iterates over a sequence (“For every i in …”), 223 | consuming the next item means advancing in the sequence 224 | such that that item will be skipped during the following iteration of the loop: 225 | the item after the next becomes the next item. 226 | 227 |
228 | 229 | The following algorithm prints “1”, “2”, and “4”. 230 | 231 | For every digit i in “1234”, run these substeps: 232 | 233 |
    234 |
  1. Print i 235 |
  2. If i is 2, consume the next digit. 236 |
237 |
238 | 239 | 240 |

241 | Surrogate code points

242 | 243 | A lead surrogate code point or high surrogate code point 244 | is a code point in the range from U+D800 to U+DBFF. 245 | 246 | A trail surrogate code point or low surrogate code point 247 | is a code point in the range from U+DC00 to U+DFFF. 248 | 249 | A surrogate code point is 250 | either a lead surrogate code point or a trail surrogate code point. 251 | That is, a code point in the range from U+D800 to U+DFFF. 252 | 253 | A surrogate code point pair is a sequence of 254 | a lead surrogate code point followed by a trail surrogate code point. 255 | 256 | An unpaired surrogate code point is a surrogate code point 257 | that is not part of a surrogate code point pair. 258 | 259 | 260 |

261 | Surrogate 16-bit code units

262 | 263 | A lead surrogate 16-bit code unit or high surrogate 16-bit code unit 264 | is a 16-bit code unit in the range from 0xD800 to 0xDBFF. 265 | 266 | A trail surrogate 16-bit code unit or low surrogate 16-bit code unit 267 | is a 16-bit code unit in the range from 0xDC00 to 0xDFFF. 268 | 269 | A surrogate 16-bit code unit is 270 | either a lead surrogate 16-bit code unit or a trail surrogate 16-bit code unit. 271 | That is, a 16-bit code unit in the range from 0xD800 to 0xDFFF. 272 | 273 | A surrogate 16-bit code unit pair is a sequence of 274 | a lead surrogate 16-bit code unit followed by a trail surrogate 16-bit code unit. 275 | In UTF-16, it represents a supplementary code point. 276 | 277 | An unpaired surrogate 16-bit code unit is a surrogate 16-bit code unit 278 | that is not part of a surrogate 16-bit code unit pair. 279 | 280 | 281 |

282 | Surrogate byte sequences

283 | 284 | Note: A surrogate byte sequence 285 | (and therefore any byte sequence described in this section) 286 | is ill-formed in UTF-8. 287 | Decoders are required to treat it as an error. 288 | 289 | A lead surrogate byte sequence or high surrogate byte sequence 290 | is a sequence of three bytes 291 | that represents a lead surrogate code point in generalized UTF-8. 292 | 293 | 294 | A trail surrogate byte sequence or low surrogate byte sequence 295 | is a sequence of three bytes 296 | that represents a trail surrogate code point in generalized UTF-8. 297 | 298 | A surrogate byte sequence is 299 | either a lead surrogate byte sequence or a trail surrogate byte sequence. 300 | That is, a sequence of three bytes 301 | that represents a surrogate code point in generalized UTF-8. 302 | 303 | 304 | 311 | 316 | 317 | 322 | 323 | 328 | 329 | 334 |
305 | 306 | Table 1. Surrogate byte sequences 307 | 308 | Bytes noted in hexadecimal. 309 | 310 |
312 | First byte 313 | Second byte 314 | Third byte 315 |
Lead surrogate byte sequence 318 | ED 319 | A0 to AF 320 | 80 to BF 321 |
Trail surrogate byte sequence 324 | ED 325 | B0 to BF 326 | 80 to BF 327 |
Surrogate byte sequence 330 | ED 331 | A0 to BF 332 | 80 to BF 333 |
335 | 336 | A surrogate pair byte sequence is a sequence six bytes composed of 337 | a lead surrogate byte sequence followed by a trail surrogate byte sequence. 338 | 339 | An unpaired surrogate byte sequence is a surrogate byte sequence 340 | that is not part of a surrogate pair byte sequence. 341 | 342 | 343 |

344 | Potentially ill-formed UTF-16

345 | 346 | A sequence of 16-bit code units is potentially ill-formed UTF-16 347 | if it is intended to be interpreted as UTF-16, 348 | but is not necessarily well-formed in UTF-16. 349 | It effectively encodes a sequence of code points 350 | that do not contain any surrogate code point pair. 351 | 352 | Note: Like UTF-16, 353 | potentially ill-formed UTF-16 can not represent a surrogate code point pair 354 | since the corresponding surrogate 16-bit code unit pair would instead 355 | represent a supplementary code point. 356 | Unlike well-formed UTF-16, it might contain isolated surrogate code points. 357 | 358 | Any sequence of 16-bit code units 359 | has an interpretation as potentially ill-formed UTF-16. 360 | 361 | WTF-16 is sometimes used as a shorter name for potentially ill-formed UTF-16, 362 | especially in the context of systems that were originally designed for UCS-2 363 | and later upgraded to UTF-16 but never enforced well-formedness, 364 | either by neglect or because of backward-compatibility constraints. 365 | 366 | 367 |

368 | Encoding

369 | 370 | To encode from code points to 371 | potentially ill-formed UTF-16, 372 | run these steps: 373 | 374 |
    375 |
  1. Let result be a sequence of 16-bit code units, initially empty. 376 |
  2. For every code point P of the input, run these substeps: 377 |
      378 |
    1. 379 | If P is a supplementary code point, 380 | append to result two 16-bit code units of values: 381 |
        382 |
      1. ((P - 0x10000) >> 10) + 0xD800 383 |
      2. ((P - 0x10000) & 0x3FF) + 0xDC00 384 |
      385 |
    2. Otherwise (P is a BMP code point), 386 | append to result a 16-bit code unit of value P. 387 |
    388 |
  3. Return result. 389 |
390 | 391 | Note: If the input is restricted to Unicode text, 392 | this is identical to encoding to UTF-16 393 | and the resulting sequence is well-formed in UTF-16. 394 | 395 |
396 | 397 | If, on the other hand, the input contains a surrogate code point pair, 398 | the conversion will be incorrect and 399 | the resulting sequence will not represent the original code points. 400 | 401 | This situation should be considered an error, 402 | but this specification does not define how to handle it. 403 | Possibilities include aborting the conversion, 404 | or replacing one of the surrogate code points of the pair 405 | with a replacement character. 406 |
407 | 408 | 409 |

410 | Decoding

411 | 412 | To decode from potentially ill-formed UTF-16 413 | to code points, 414 | run these steps: 415 | 416 |
    417 |
  1. Let result be a sequence of code points, initially empty. 418 |
  2. For every 16-bit code unit U of the input, run these substeps: 419 |
      420 |
    1. 421 | If U is a lead surrogate 16-bit code unit, 422 | U is not the last 16-bit code unit of the input, 423 | and the next 16-bit code unit of the input next 424 | is a trail surrogate 16-bit code unit, 425 | then 426 | consume next 427 | and append to result a code point of value 428 | 429 | 0x10000 + 430 | ((U - 0xD800) << 10) + 431 | (next - 0xDC00). 432 |
    2. 433 | Otherwise, 434 | append to result a code point of value U. 435 |
    436 |
  3. Return result. 437 |
438 | 439 | Note: By construction, 440 | the resulting sequence does not contain a surrogate code point pair. 441 | 442 | Note: If the input is well-formed in UTF-16, 443 | this is identical to decoding UTF-16 444 | and the resulting sequence is Unicode text. 445 | 446 | 447 |

448 | Generalized UTF-8

449 | 450 | For the purpose of this specification, 451 | generalized UTF-8 is an encoding of sequences of code points 452 | (not restricted to Unicode scalar values) 453 | using 8-bit bytes, 454 | based on the same underlying algorithm as UTF-8. 455 | It is a strict superset of UTF-8 456 | (like UTF-8 is a strict superset of ASCII). 457 | 458 | Each code point is encoded as a sequence of one to four bytes: 459 | 460 | 461 | 469 | 475 | 476 | 482 | 483 | 489 | 490 | 496 | 497 | 503 |
462 | 463 | Table 2. Bit distribution 464 | 465 | Bytes noted in binary, most significant bit first. 466 | x bits represent the least significant bits of the code points. 467 | 468 |
Code point 470 | First byte 471 | Second byte 472 | Third byte 473 | Fourth byte 474 |
U+0000 to U+007F 477 | 0xxxxxxx 478 | 479 | 480 | 481 |
U+0080 to U+07FF 484 | 110xxxxx 485 | 10xxxxxx 486 | 487 | 488 |
U+0800 to U+FFFF 491 | 1110xxxx 492 | 10xxxxxx 493 | 10xxxxxx 494 | 495 |
U+10000 to U+10FFFF 498 | 11110xxx 499 | 10xxxxxx 500 | 10xxxxxx 501 | 10xxxxxx 502 |
504 | 505 | A byte sequence is well-formed in generalized UTF-8 506 | if and only if: 507 | 508 | 514 | 515 | 516 | 523 | 529 | 530 | 536 | 537 | 543 | 544 | 550 | 551 | 557 | 558 | 564 | 565 | 571 | 572 | 578 |
517 | 518 | Table 3. Well-formed byte sequences representing a single code point 519 | 520 | Bytes noted in hexadecimal. 521 | 522 |
Code point 524 | First byte 525 | Second byte 526 | Third byte 527 | Fourth byte 528 |
U+0000 to U+007F 531 | 00 to 7F 532 | 533 | 534 | 535 |
U+0080 to U+07FF 538 | C2 to DF 539 | 80 to BF 540 | 541 | 542 |
U+0800 to U+0FFF 545 | E0 546 | A0 to BF 547 | 80 to BF 548 | 549 |
U+1000 to U+FFFF 552 | E1 to EF 553 | 80 to BF 554 | 80 to BF 555 | 556 |
U+10000 to U+3FFFF 559 | F0 560 | 90 to BF 561 | 80 to BF 562 | 80 to BF 563 |
U+40000 to U+FFFFF 566 | F1 to F3 567 | 80 to BF 568 | 80 to BF 569 | 80 to BF 570 |
U+100000 to U+10FFFF 573 | F4 574 | 80 to 8F 575 | 80 to BF 576 | 80 to BF 577 |
579 | 580 | 581 |

582 | The WTF-8 encoding

583 | 584 | WTF-8 (Wobbly Transformation Format − 8-bit) 585 | is an encoding of code point sequences 586 | that do not contain any surrogate code point pair 587 | using 8-bit bytes. 588 | 589 | Note: Like UTF-8 is artificially restricted to Unicode text 590 | in order to match UTF-16, 591 | WTF-8 is artificially restricted to exclude surrogate code point pairs 592 | in order to match potentially ill-formed UTF-16. 593 | 594 | It is identical to generalized UTF-8, 595 | with the additional well-formedness constraint that 596 | a surrogate pair byte sequence is ill-formed. 597 | It is a strict subset of generalized UTF-8 598 | and a strict superset of UTF-8. 599 | 600 | Note: Similarly, UTF-8 is a strict superset of ASCII. 601 | 602 | WTF-8 must not be used for interchange. 603 | See Intended audience. 604 | 605 | 606 |

607 | Encoding

608 | 609 | To encode from code points 610 | to well-formed WTF-8, 611 | run these steps: 612 | 613 |
    614 |
  1. Let result be a sequence of bytes, initially empty. 615 |
  2. For every code point P of the input, run these substeps: 616 |
      617 |
    1. 618 | 619 | If P is a lead surrogate code point, 620 | P is not the last code point of the input, 621 | and the next code point is a trail surrogate code point, 622 | consume the next code point and 623 | set P’s value to: 624 | 625 | 0x10000 + 626 | ((P - 0xD800) << 10) + 627 | (next - 0xDC00). 628 | 629 |
    2. Depending on P: 630 |
      631 |
      U+0000 to U+007F 632 |
      Append to result one byte of value P. 633 |
      U+0080 to U+07FF 634 |
      635 | Append to result two bytes of values: 636 |
        637 |
      1. 0xC0 | (P >> 6) 638 |
      2. 0x80 | (P & 0x3F) 639 |
      640 |
      U+0800 to U+FFFF 641 |
      642 | Append to result three bytes of values 643 |
        644 |
      1. 0xE0 | (P >> 12) 645 |
      2. 0x80 | ((P >> 6) & 0x3F) 646 |
      3. 0x80 | (P & 0x3F) 647 |
      648 |
      U+10000 to U+10FFFF 649 |
      650 | Append to result four bytes of values 651 |
        652 |
      1. 0xF0 | (P >> 18) 653 |
      2. 0x80 | ((P >> 12) & 0x3F) 654 |
      3. 0x80 | ((P >> 6) & 0x3F) 655 |
      4. 0x80 | (P & 0x3F) 656 |
      657 |
      658 |
    659 |
  3. Return result. 660 |
661 | 662 | Note: If the input contains a surrogate code point pair, 663 | the resulting byte sequence will be not represent 664 | the original sequence of code points. 665 | Instead, it will represent the same code points 666 | as if had been encoded in potentially ill-formed UTF-16. 667 | This is also consistent with 668 | encoding each code point to WTF-8 individually, 669 | and concatenating 670 | the resulting WTF-8 byte sequences. 671 | 672 | 673 |

674 | Decoding

675 | 676 | To decode from well-formed WTF-8 677 | to code points, 678 | run these steps: 679 | 680 | Note: Since WTF-8 must not be used for interchange 681 | (see Intended audience), 682 | this algorithm is deliberately not defined for arbitrary byte sequences. 683 | It is only defined for byte sequences known to be well-formed in WTF-8, 684 | such as sequences 685 | encoded from code points, 686 | converted from UTF-16, or 687 | concatenated 688 | from sequences themselves well-formed in WTF-8. 689 | 690 |
    691 |
  1. Let result be a sequence of code points, initially empty. 692 |
  2. For every byte B of the input, depending on B: 693 |
    694 |
    0x00 to 0x7F 695 |
    Append to result a code point of value B. 696 | 697 |
    0xC2 to 0xDF 698 |
    699 | Let B2 be the next byte and consume it. 700 | 701 | Append to result a code point of value 702 | 703 | ((B & 0x1F) << 6) + 704 | (B2 & 0x3F) 705 | 706 |
    0xE0 to 0xEF 707 |
    708 | Let B2 and B3 be the next two bytes, 709 | and consume them. 710 | 711 | Append to result a code point of value 712 | 713 | ((B & 0x0F) << 12) + 714 | ((B2 & 0x3F) << 6) + 715 | (B3 & 0x3F) 716 | 717 |
    0xF0 to 0xF4 718 |
    719 | Let B2, B3, and B4 be the next three bytes, 720 | and consume them. 721 | 722 | Append to result a code point of value 723 | 724 | ((B & 0x07) << 18) + 725 | ((B2 & 0x3F) << 12) + 726 | ((B3 & 0x3F) << 6) + 727 | (B4 & 0x3F) 728 |
    729 |
  3. Return result. 730 |
731 | 732 | Note: If the input is also well-formed in UTF-8, 733 | this is identical to decoding UTF-8 734 | and the resulting sequence is Unicode text. 735 | 736 | 737 |

738 | Converting between WTF-8 and potentially ill-formed UTF-16

739 | 740 | To convert from potentially ill-formed UTF-16 to WTF-8, 741 | run these steps: 742 | 743 | 747 | 748 | Note: This conversion never fails and is lossless. 749 | 750 | To convert from WTF-8 to potentially ill-formed UTF-16, 751 | run these steps: 752 | 753 | 757 | 758 | Note: This conversion never fails 759 | and, if the input is well-formed in WTF-8, is lossless. 760 | 761 | 762 |

763 | Converting between WTF-8 and UTF-8

764 | 765 | Since WTF-8 is a superset of UTF-8, 766 | any sequence of byte that is well-formed in UTF-8 767 | is also well-formed in WTF-8 and represents the same text. 768 | To convert from UTF-8 to WTF-8, 769 | return the input unchanged. 770 | 771 | Note: This conversion never fails and is lossless. 772 | 773 | To convert lossily from WTF-8 to UTF-8, 774 | replace any surrogate byte sequence 775 | with the sequence of three bytes <0xEF, 0xBF, 0xBD>, 776 | the UTF-8 encoding of the replacement character. 777 | 778 | Note: Since surrogate byte sequences are also three bytes long, 779 | this conversion can be done in place. 780 | 781 | Note: This conversion never fails but is lossy. 782 | 783 | To convert strictly from WTF-8 to UTF-8, run these steps: 784 | 785 |
    786 |
  1. If the input contains a surrogate byte sequence, return failure. 787 |
  2. Otherwise, return the input unchanged. 788 |
789 | 790 | Note: This conversion is lossless when it succeeds, but it can fail. 791 | 792 | 793 |

794 | Concatenating WTF-8 strings

795 | 796 | Concatenating WTF-8 strings requires extra care to preserve well-formedness. 797 | 798 | To concatenate two WTF-8 strings, run these steps: 799 | 800 |
    801 |
  1. 802 | If the left input string ends with a lead surrogate byte sequence 803 | and the right input string starts with a trail surrogate byte sequence, 804 | run these substeps: 805 |
      806 |
    1. 807 | Let lead and trail be two code points, 808 | the respective results of 809 | decoding from WTF-8 810 | these two surrogate byte sequences. 811 |
    2. 812 | Let supplementary be the 813 | encoding to WTF-8 814 | of a single code point of value 815 | 0x10000 + ((lead - 0xD800) << 10) + (trail - 0xDC00) 816 |
    3. 817 | Let left be substring of the left input string 818 | that removes the three final bytes. 819 |
    4. 820 | Let right be substring of the right input string 821 | that removes the three initial bytes. 822 |
    5. 823 | Return the concatenation of 824 | left, supplementary, and right. 825 |
    826 |
  2. Otherwise, return the concatenation of the two input byte sequences 827 |
828 | 829 | Note: This is equivalent to 830 | converting both strings 831 | 832 | to potentially ill-formed UTF-16, 833 | concatenating the resulting 16-bit code unit sequences, 834 | then converting the concatenation back 835 | to WTF-8. 836 | 837 | 838 |

839 | Implementations

840 | 841 | This section is non-normative. 842 | 843 | 870 | 871 | 872 |

873 | Acknowledgments

874 | 875 | Thanks to Coralie Mercier for 876 | 877 | coining the name WTF-8. 878 | 879 | Thanks for feedback and contributions from 880 | Anne van Kesteren, 881 | David Baron, 882 | Dylan Petonke, 883 | Guillaume Knispel, 884 | Henri Sivonen, 885 | Jacob Lifshay, 886 | James Graham, 887 | Lily Ballard, 888 | Mathias Bynens, 889 | Ms2ger, 890 | Sam Tobin-Hochstadt, 891 | Tab Atkins. 892 | -------------------------------------------------------------------------------- /logo-wtf-8.svg: -------------------------------------------------------------------------------- 1 | 2 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | -------------------------------------------------------------------------------- /zero.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SimonSapin/wtf-8/cd80a83d87385ce5efb08554e86dce114fde5026/zero.png --------------------------------------------------------------------------------