├── w3c.json ├── README.md ├── local.css └── index.html /w3c.json: -------------------------------------------------------------------------------- 1 | { 2 | "group": 32113 3 | , "contacts": "rishida" 4 | , "repo-type": "note" 5 | , "policy": "restricted" 6 | } -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # unicode-xml 2 | Unicode in XML and other Markup Languages 3 | 4 | 5 | See https://w3c.github.io/unicode-xml/ to view the document. 6 | -------------------------------------------------------------------------------- /local.css: -------------------------------------------------------------------------------- 1 | h2 { 2 | margin-top: 3em; 3 | margin-bottom: 0em; 4 | } 5 | 6 | .head h2, #abstract h2, #sotd h2 { 7 | margin-top: 0; 8 | } 9 | 10 | h3 { 11 | margin-top: 3em; 12 | } 13 | 14 | .deprecation-box h3 { 15 | margin: 0; 16 | font-weight: bold; 17 | font-size: 120%; 18 | color: black; 19 | } 20 | h4 { 21 | font-size: 100%; 22 | font-weight: normal; 23 | color: #005a9c; 24 | margin-top: 2em; 25 | } 26 | 27 | .leadin { 28 | font-weight: bold; 29 | } 30 | 31 | ins { 32 | background-color: #99FF99; 33 | text-decoration: none; 34 | } 35 | 36 | del { 37 | display: inline; 38 | color: silver; 39 | } 40 | 41 | figure { 42 | margin-bottom: 2em; 43 | text-align: center; 44 | } 45 | 46 | figcaption { 47 | text-align: center; 48 | margin: 0.5em 2em; 49 | font-style: italic; 50 | font-size: 90%; 51 | } 52 | 53 | .figno:after { 54 | content: ':\00A0 '; 55 | } 56 | 57 | a.termref:link { 58 | color:#C60; 59 | text-decoration:none; 60 | border-bottom: 1px dotted #FC0; 61 | } 62 | 63 | a.termref:hover { 64 | color:#C60; 65 | text-decoration:none; 66 | border-bottom: 1px dotted #FC0; 67 | } 68 | 69 | a.termref:visited { 70 | color:#C60; 71 | text-decoration:none; 72 | border-bottom: 1px dotted #FC0; 73 | } 74 | 75 | a.termref:active { 76 | color:#C60; 77 | text-decoration:none; 78 | border-bottom: 1px dotted #FC0; 79 | } 80 | 81 | .qterm:before, .qchar:before { content: "‘"; } 82 | 83 | .qterm:after, .qchar:after { content: "’"; } 84 | 85 | .quote:before { content: '“'; } 86 | 87 | .quote:after { content: '”'; } 88 | 89 | code { 90 | color: #A52A2A; 91 | font-family: Consolas, "Andale Mono", "Lucida Console", "Lucida Sans Typewriter", Monaco, "Courier New", monospace; 92 | font-size: 100%; 93 | } 94 | 95 | samp, kbd { 96 | font-family: Consolas, "Andale Mono", "Lucida Console", "Lucida Sans Typewriter", Monaco, "Courier New", monospace; 97 | font-size: 100%; 98 | } 99 | 100 | .uname { 101 | text-transform: uppercase; 102 | font-size: 85%; 103 | letter-spacing:0.03em; 104 | } 105 | 106 | .lettername { 107 | font-style: italic; 108 | } 109 | 110 | .tab-format { 111 | margin-left: 10%; 112 | } 113 | 114 | table td { 115 | border: 1px solid #ddd; 116 | padding: 10px; 117 | } 118 | 119 | .rtlTermCell { 120 | direction: rtl; 121 | text-align: right; 122 | } 123 | 124 | .exampleList { 125 | float: left; 126 | margin:10px; 127 | } 128 | -------------------------------------------------------------------------------- /index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Unicode in XML and other Markup Languages 5 | 6 | 7 | 65 | 66 | 98 | 99 | 100 | 101 |
102 |

This document contains guidelines on the use of the Unicode Standard in 103 | conjunction with markup languages such as XML.

104 |
105 | 106 | 107 |
108 | 109 |

This document contains guidelines on the use of the Unicode Standard in 110 | conjunction with markup languages such as XML.

111 |
112 |

This document has been withdrawn

113 |

Many of the materials in this document are stale and out of date; 114 | the W3C is maintaining this version solely as a historical reference. 115 | This document was originally produced as a joint publication between 116 | the W3C and the Unicode 117 | Consortium. In 2016, Unicode withdrew publication as a Unicode 118 | Technical Report.

119 | 120 |
121 | 122 |
123 |

Sending comments on this 124 | document

125 |

If you wish to make comments regarding this document, please raise them as github issues. 126 | Only send comments by email if you are unable to raise issues on github (see links below). 127 | All comments are welcome.

128 |

To make it easier to track comments, please raise separate issues or emails for each comment, and point to the section you are commenting on  using a URL for the dated version of the document.

129 |
130 |
131 | 132 | 133 |
134 |

Introduction

135 |

The Unicode Standard  [Unicode] defines the 136 | universal character set. Its primary goal is to provide an unambiguous 137 | encoding of the content of plain text, ultimately covering all languages in 138 | the world, but also major text-based notational systems for science, 139 | technology, music, and scholarship.

140 |

Currently in its sixth major version, Unicode 141 | contains a large number of characters covering most of the currently used 142 | scripts in the world. It also contains additional characters for 143 | interoperability with older character encodings, and characters with 144 | control-like functions included primarily for reasons of providing 145 | unambiguous interpretation of plain text. Unicode provides specifications for 146 | use of all of these characters.

147 |

For document and data interchange, the Internet and the World Wide Web 148 | make extensive use of marked-up text such as HTML 149 | and XML. In many instances, markup provides the same, or 150 | essentially similar features to those provided by format characters in the 151 | Unicode Standard for use in plain text. Another special character category 152 | provided by Unicode are compatibility characters. While there may be valid 153 | reasons to support these characters and their specifications in plain text, 154 | their use in marked-up text can conflict with the rules of the markup 155 | language. Formatting characters are discussed in Section 3, Characters not Suitable for Use With Markup and 157 | Section 4, Format Characters Suitable for Use With 158 | Markup, compatibility characters in Section 5, Characters with Compatibility Mappings. 160 | Section 6 briefly discusses noncharacters, and Section 7 is devoted to white 161 | space.

162 |

The interaction of character 163 | encoding and methods of escaping characters in markup are discussed in the 164 | Character Model for the World Wide Web [Charmod].

165 |

The issues of using Unicode characters with marked-up text depend to some 166 | degree on the rules of the markup language in question and the set of 167 | elements it contains. In a narrow sense, this document concerns itself only 168 | with XML, and to some extent HTML. However, much of the general information 169 | presented here should be useful in a broader context, including some page 170 | layout languages.

171 |

Many of the recommendations of this 172 | report depend on the availability of particular markup or styling. Where 173 | possible, appropriate DTDs or Schemas should be used or designed to make 174 | such markup or styling available, or the DTDs or Schemas used should be 175 | appropriately extended. The current version of this document makes no 176 | specific recommendations for the design of DTDs or Schemas, or for the use 177 | of particular DTDs or Schemas, but the information presented here may be 178 | useful to designers of DTDs and Schemas, and to people selecting DTDs or 179 | Schemas for their applications.

180 |

The recommendations of this report do not apply in the case 181 | of XML used for blind data transport and similar cases.

182 | 183 |
184 |

Notation

185 |

This report uses XML [XML] as a prominent and general 186 | example of markup. The XML namespace notation [Namespace] is used to indicate that a certain element 188 | is taken from a specific markup language. As an example, the prefix 'xhtml:' 189 | indicates that this element is taken from [XHTML]. This 190 | means that the examples containing the namespace prefix 'xhtml:' are assumed 191 | to include a namespace declaration of xmlns:xhtml="..." 

192 |

Characters are denoted using the notation used in the Unicode Standard, 193 | that is, an optional U+ followed by their hexadecimal number, using at least 194 | 4 digits, such as "U+1234" or "U+10FFFD". In XML or HTML this could be 195 | expressed as "ሴ" or "􏿽".

196 |
197 |
198 | 199 | 200 |
201 |

General Considerations

202 |

There are several general points to consider when looking at the 203 | interaction between character encoding and markup. 

204 | 211 | 212 |
213 |

Linearity versus Structure

214 |

Encoding text as a sequence of characters without further 215 | information leads to a linear sequence, commonly called plain text. Character 216 | follows character, without any particular structure. Markup, on the other 217 | hand, defines a hierarchical structure for the text or data. In the case of 218 | XML and most other, similar markup languages, the markup defines a tree 219 | structure. While this tree structure is linearized for transmission in the 220 | XML document, once the document has been parsed, the tree is available 221 | directly.

222 |

Operations that are easy to perform on trees are often 223 | difficult to perform on linear sequences and vice versa. By separating 224 | functionality between character encoding and markup appropriately, the 225 | architecture becomes simpler, more powerful and longer-lasting.

226 |

In particular, operations on hierarchical structures can 227 | easily make sure that information is kept in context. Attributes assigned to 228 | parts of a document are moved together with the associated part of the 229 | document. Assigning an attribute to a part of a document limits the scope of 230 | the attribute to that part of the document. Performing the same operations on 231 | linear sequences of characters using control codes to set attributes and to 232 | delimit their scope requires much more work and is error prone. Locating the 233 | start or end of a span of text of the same attribute requires scanning 234 | backwards and forwards for the embedded delimiter or control code. Moving or 235 | editing text often results in mismatched control codes, so that an attribute 236 | might suddenly apply to text it was not intended for.

237 |
238 | 239 |
240 |

Overlap of Control Code and Markup Semantics

241 |

When markup is not available, plain text may require control 242 | characters. This is usually the case where plain text must contain some 243 | scoping or attribute information in order to be legible, i.e. to be 244 | able to transmit the same content between originator and receiver. Many of 245 | these control characters have direct equivalents in particular markup 246 | languages, since markup handles these concerns efficiently. If both 247 | characters and their markup equivalents may be present in the same text, the 248 | question of priority is raised. Therefore it is important to identify and 249 | resolve these ambiguities at the time markup is first applied.

250 |
251 | 252 |
253 |

Markup and Styling

254 |

Besides the basic character encoding and text markup there is 255 | a third contributor to text functionality, namely styling. Markup is 256 | concerned with the logical structure of the text or data, e.g. to 257 | indicate sections, subsections, and headers in a document, or to indicate the 258 | various fields of an address record. Styling is used to present the 259 | information in various ways, e.g. in different fonts, different type 260 | styles (italic, bold), different colors, etc. Some character codes do 261 | not encode a generic character, but a styled character. Where these 262 | characters are used, styling information is frozen, i.e. it is no 263 | longer possible to alter the appearance of the text by applying style 264 | information. However, there are many examples where a historically free 265 | stylistic variation has over time become a semantic distinction that is 266 | properly encoded as plain text. Sometimes, what is a free variation in some 267 | contexts, implies strict semantic differentiation in others. In all such 268 | instances, altering the appearance of the text by styling information would 269 | irreparably alter the content of the text. This is of particular concern with 270 | mathematical notation or systems for phonetic and phonemic transcription 271 | which make extensive semantic use of styles on a character by character 272 | basis.

273 |
274 | 275 |
276 |

Coincidence of Markup and Functions

277 |

Dealing with various functionalities on the markup level has 278 | the additional advantage that in most cases, text portions that need some 279 | particular attribute (or styling) are actually those text portions identified 280 | by markup. A paragraph may be in French, a citation may need a bidi 281 | embedding, a keyword may be in italics, a list number may be circled, and so 282 | on. This makes it very efficient to associate those attributes with 283 | markup.

284 |

However, where local or point-like functionality is needed, 285 | markup is not very efficient and its main benefit, easy manipulation 286 | of scope, is not required. On the contrary, the intrusion of markup in the 287 | middle of words can make search or sort operations more difficult. For these 288 | cases expressing the information as character codes is not only a viable, but 289 | often the preferred alternative, which needs to be considered in the design 290 | of markup languages.

291 |
292 | 293 |
294 |

Extensibility of Markup

295 |

Character encoding works with a range of integers used as 296 | character codes. This is extremely efficient, but has some limitations. 297 | Markup, on the other hand, is much more extensible. Using technologies such 298 | as XML Namespaces [Namespace] and their application 299 | in schema languages like [XML Schema], various 300 | vocabularies can be mixed.

301 |
302 | 303 |
304 |

Suitability of Characters in Markup

305 |

The suitability of a particular character for markup depends on its status 306 | in the Unicode Standard, the nature of its behavior in text and the 307 | availability of equivalent markup. Many format characters that are needed for 308 | advanced plain text are not suitable for use with markup. Section 3 gives a list and detailed descriptions. 310 | However, not all format characters are unsuitable for use with markup. Section 4 provides a list of format characters that are 312 | suitable for use with markup and gives some discussion about their use. In 313 | addition to format characters, the Unicode Standard also has compatibility 314 | characters, some of which may be replaceable by suitable markup. These 315 | characters are discussed in Section 5.

316 |
317 |
318 | 319 | 320 |
321 |

Characters not Suitable for use With Markup

322 |

There are characters which are unsuitable in the context of markup in 323 | XML/HTML and whose use is discouraged, because one or more of the following 324 | conditions apply:

325 | 332 |

Section 3.1 provides a list of such characters. 333 | Sections 3.2 through 3.10 discuss in more detail the following points for the discouraged 334 | characters.

335 | 343 | 344 |
345 |

Table of Characters not Suitable for use With Markup

346 |

The following table contains the characters currently considered not 347 | suitable for use with markup in XML or HTML. (See however the note in the Introduction.) They 348 | may also be unsuitable for other markup or page layout languages. For 349 | determining possible conflict this report uses the markup available in 350 | HTML.

351 |
352 |
Characters not suitable for use with markup
353 | 354 | 355 | 356 | 359 | 362 | 365 | 366 | 367 | 368 | 369 | 370 | 371 | 372 | 373 | 374 | 375 | 376 | 377 | 378 | 379 | 381 | 382 | 383 | 384 | 386 | 388 | 389 | 390 | 391 | 392 | 393 | 394 | 395 | 396 | 397 | 398 | 399 | 400 | 401 | 402 | 403 | 404 | 405 | 406 | 407 | 408 | 409 | 410 | 411 | 412 | 413 | 414 | 415 | 416 | 418 | 419 | 420 | 421 | 422 | 424 | 425 | 426 | 427 | 428 | 429 | 430 | 431 | 432 | 433 | 434 | 435 | 436 |

Codepoints

358 |

Names/Description

361 |

Short 363 | Comment

364 |
U+0340..U+0341Clones of grave and acuteDeprecated in Unicode
U+17A3, U+17D3Obsolete characters for KhmerDeprecated in Unicode
U+2028..U+2029Line and paragraph separatoruse <xhtml:br />, 380 | <xhtml:p></xhtml:p>, or equivalent
U+202A..U+202EBIDI embedding controls 
385 | (LRE, RLE, LRO, RLO, PDF)
Strongly discouraged in [HTML4.01]
U+206A..U+206BActivate/Inhibit Symmetric swappingDeprecated  in Unicode
U+206C..U+206DActivate/Inhibit Arabic form shapingDeprecated in Unicode
U+206E..U+206FActivate/Inhibit National digit shapesDeprecated in Unicode
U+FFF9..U+FFFBInterlinear annotation charactersUse ruby markup [Ruby]
U+FEFFas ZWNBSPUse U+2060 Word Joiner instead
as Byte Order MarkUse only at the start of a file, not as part of 417 | markup
U+FFFCObject replacement characterUse markup, e.g. HTML <object> or HTML 423 | <img>
U+1D173..U+1D17AScoping for Musical NotationUse an appropriate markup language
U+E0000..U+E007FLanguage Tag code points Use xhtml:lang or xml:lang
437 |
438 | 439 |

Except for Line and Paragraph Separator, or the Byte Order Mark, it is 440 | acceptable for browsers and similar user agents to ignore the presence of 441 | discouraged characters in HTML or XML. It is up to authoring tools to ensure 442 | proper conversion between these characters and equivalent markup where it 443 | exists.

444 |
445 | 446 |
447 |

Line and Paragraph Separator, U+2028..U+2029

448 |

Short description: The line and paragraph separator provide 449 | unambiguous means to denote hard line breaks and paragraph delimiters in 450 | plain text.

451 |

Reason for inclusion: These characters were introduced into the 452 | Unicode Standard to overcome the ambiguous and widely divergent use of 453 | control codes for this purpose. See Section 454 | 5.8, Newline Guidelines, in [Unicode].

455 |

Problems when used in markup: Including these characters in 456 | markup text does not work where it would duplicate the existing markup 457 | commands for delimiting paragraphs and lines.

458 |

Problems with other uses: The be can also 459 | problematic when used in plain text, because legacy data is usually converted 460 | code point for code point into Unicode and all receivers of Unicode plain 461 | text have to effectively be able to interpret the existing use of control 462 | codes for this purpose. As a result, fewer Unicode implementations support 463 | these characters, than would be the case otherwise.

464 |

Replacement markup: In HTML, use <xhtml:br /> instead of 465 | U+2028 and surround paragraphs by <xhtml:p> and </xhtml:p> 466 | instead of separating them with U+2029.

467 |

What to do if detected: In a browser context, treat as white 468 | space, or ignore. When received in an editing context, replace the character 469 | by the corresponding markup. 

470 |
471 | 472 |
473 |

Bidi Embedding Controls (LRE, RLE, LRO, RLO, PDF), U+202A..U+202E

474 |

Short description: The bidi embedding controls are required to 475 | supplement the Unicode Bidirectional Algorithm in plain text

476 |

Reason for inclusion: The Unicode Bidirectional algorithm 477 | unambiguously resolves the display direction for bidirectional text. It does 478 | so by assigning all characters directional categories and then resolving 479 | these in context. In a number of circumstances this implicit  method does not produce satisfactory results and embedding controls are 480 | needed to ensure that sender and receiver agree on the display direction for 481 | a given text. See Unicode Technical Report #9, The Bidirectional Algorithm [UAX 9].

483 |

Problems when used in markup: These characters duplicate 484 | available markup, which is better suited to handle the stateful nature of 485 | their effect. 

486 |

Problems with other uses: The embedding controls introduce a 487 | state into the plain text, which must be maintained when editing or 488 | displaying the text. Processes that are modifying the text without being 489 | aware of this state may inadvertently affect the rendering of large portions 490 | of the text, for example by removing a PDF.

491 |

Replacement markup: The following table gives the replacement 492 | markup:
493 |

494 | 495 | 496 | 497 | 498 | 499 | 500 | 501 | 502 | 503 | 504 | 505 | 506 | 507 | 508 | 509 | 510 | 511 | 512 | 513 | 514 | 516 | 517 | 518 | 519 | 520 | 521 | 522 | 523 | 524 | 525 | 526 | 527 | 528 |
UnicodeEquivalent markupComment

RLO

<xhtml:bdo dir = "rtl"> 

LRO

<xhtml:bdo dir = "ltr"> 
PDF</xhtml:bdo>when used to terminate RLO or LRO only, otherwise 515 | ignore
RLEdir = "rtl"attribute on block or inline element
LREdir = "ltr"attribute on block or inline element
529 |

For details on bidi markup, please see Section 8.2 of HTML [HMTL 4.0-8.2]. The text of HTML 4.0 gives this 531 | recommendation: 

532 |
533 |

Using HTML directionality markup with Unicode 534 | characters. Authors and designers of authoring software should be 535 | aware that conflicts can arise if the dir attribute is used on 538 | inline elements (including BDO) concurrently with the 541 | corresponding [UNICODE] formatting characters. Preferably one or the 543 | other should be used exclusively. The markup method offers a better 544 | guarantee of document structural integrity and alleviates some problems 545 | when editing bidirectional HTML text with a simple text editor, but some 546 | software may be more apt at using the [UNICODE] characters. If both methods are used, great 548 | care should be exercised to insure proper nesting of markup and directional 549 | embedding or override, otherwise, rendering results are undefined.

550 |
551 |

This document goes beyond HTML and recommends that only the markup 552 | should be used.

553 |

The interpretation of how to handle directionality markup 554 | for block level elements differs in different versions of [CSS].

556 |

What to do if detected: In a browser context, ignore. When 557 | received in an editing context, replace the characters by the appropriate 558 | markup. 

559 |
560 | 561 |
562 |

Deprecated Formatting Characters, U+206A..U+206F

563 |

Short description: These characters are deprecated. They were 564 | originally intended to allow explicit activation of contextual shaping, 565 | numeric digit rendering and symmetric swapping.

566 |

Reason for inclusion: These characters were retained from draft 567 | versions of ISO 10646.

568 |

Problems when used in markup: The processing model for these 569 | characters is not supported in markup.

570 |

Problems with other uses: The Unicode Standard requires that 571 | symmetric swapping, contextual shaping, and alternate digit shapes are 572 | enabled by default and no longer supports inhibiting any of them by use of 573 | these character codes. The most likely effect of their occurrence in 574 | generated text would be that of a 'garbage' character.

575 |

Conversion for use with markup: Apply the appropriate conversion 576 | to bring the data stream in line with the Unicode text model for 577 | bidirectional text and cursively-connected scripts.

578 |

What to do if detected: When received by a browser as part of 579 | marked up text, they may be ignored. When received in an editing context, 580 | they may be removed, possibly with a warning. Alternatively, an appropriate 581 | conversion from the legacy text model may be provided. This will most likely 582 | be limited to applications directly interfacing with and knowledgeable of the 583 | particular legacy implementation that inspired these characters.

584 |
585 | 586 |
587 |

Byte Order Mark, ZWNBSP, U+FEFF

588 |

Short description: U+FEFF has two functions. It is formally known 589 | as zero width no-break space (ZWNBSP), and can act as a word joiner, but its primary use is as byte 590 | order mark (BOM), to indicate in a file signature at the start of a file 591 | that a file is in a particular Unicode encoding form and of a particular byte 592 | order. Using U+FEFF as a word joiner in new data is deprecated  as of [Unicode3.2] in favor of U+2060 word joiner (WJ). The use as byte 595 | order mark remains unaffected.

596 |

Reason for inclusion: Originally included in Unicode for the sole 597 | purpose of indicating byte order or use in file signatures, the character 598 | acquired the ZWNBSP semantics as part of the merger between ISO/IEC 10646 and 599 | Unicode. When used as a byte order mark the character is placed at the 600 | beginning of a file. If a recipient views it as FEFF then the byte order 601 | between sender and receiver match. If the recipient views it as FFFE (a 602 | non-character code point) then the sender used opposite byte order from the 603 | recipient, and the recipient needs to invert the byte order or refuse to read 604 | the file. When used as a ZWNBSP the character is intended to prevent breaks 605 | between adjacent characters. This function is now provided by U+2060 word joiner (WJ) making it 607 | unnecessary to insert U+FEFF in the middle of a file. For more information 608 | see Chapter 16 of [Unicode].

609 |

Problems when used in markup: Using U+FEFF as ZWNBSP makes it 610 | impossible to distinguish it from the case where a byte order mark was left 611 | in the middle of a file inadvertently due to incorrect splicing. U+FEFF can 612 | and in some cases (XML encoded in UTF-16) must be used at the start of a file 613 | containing markup, but as a signature, this is not part of actual markup or 614 | marked-up content. Some older versions of browsers and parsers may not 615 | correctly recognize U+FEFF at the start of a file encoded in UTF-8. For 616 | details of how U+FEFF participates in encoding detection of XML files, see 617 | Appendix F of [XML 1.0].

618 |

Problems with other uses: The use of byte order mark as ZWNBSP is 619 | also problematic when used in plain text, and has been deprecated for that 620 | purpose in favor of U+2060 word 621 | joiner. The use of U+FEFF in file signatures to indicate byte order is 622 | the only recommended use of this character.

623 |

Replacement markup: None. In locations other than the beginning 624 | of a text file, U+FEFF can be removed or replaced by U+2060 in an editing 625 | environment.

626 |

What to do if detected:  When received by a browser as part of 627 | marked-up text, treat depending on location. At the start of an external 628 | entity, treat as byte order mark (i.e. as part of the character encoding, not 629 | as part of the parsed character stream, see e.g. Section 4.3.3 of [XML 1.0]). Otherwise, assume it is older data using it as 631 | ZWNBSP. When receiving plain text in an editing environment, editors may take 632 | one or more of several actions: replace ZWNBSP in the middle of a file with 633 | WJ or issue a warning to the user.

634 |
635 | 636 |
637 |

Interlinear Annotation Characters, U+FFF9-U+FFFB

638 |

Short description: The interlinear annotation characters are used 639 | to delimit interlinear annotations in certain circumstances. They are 640 | intended to provide text anchors and delimiters for interlinear annotation 641 | for in-process use and are not intended for interchange.

642 |

Reason for inclusion: The interlinear annotation characters were 643 | included in Unicode only in order to reserve code points for very frequent 644 | application-internal use. The interlinear annotation characters are used to 645 | delimit interlinear annotations in contexts where other delimiters are not 646 | available, and where non-textual means exist to carry formatting information. 647 | Many text-processing applications store the text and the associated markup 648 | (or in some cases styling information) of a document in separate structures. 649 | The actual text is kept in a single linear structure; additional information 650 | is kept separately with pointers to the appropriate text positions. This is 651 | called out-of-band information. The overall implementation makes sure that 652 | these two structures are kept in sync. If the text contains interlinear 653 | annotations, it is extremely helpful for implementations to have delimiters 654 | in the text itself; even though delimiters are not otherwise used for style 655 | markup. With this method, and unlike the case of the object replacement 656 | character, all textual information can remain in the standard text stream, 657 | but any additional formatting information is kept separately. In addition, 658 | the Interlinear Annotation Anchor serves as a placeholder for formatting 659 | information for the whole annotation object, the same way a paragraph mark 660 | can be a placeholder to attach paragraph formatting information.

661 |

Problems when used in markup: Including interlinear annotation 662 | characters in marked-up text does not work because the additional formatting 663 | information (how to position the annotation,...) is not available.

664 |

Problems with other uses: The interlinear annotation characters 665 | are also problematic when used in plain text, and are not intended for that 666 | purpose. In particular, on older display systems that simply ignore or 667 | replace the Interlinear Annotation Characters, the meaning of the text may be 668 | changed.

669 |

Replacement markup: The markup to be used in place of the 670 | Interlinear Annotation Characters depends on the formatting and nature of the 671 | interlinear annotation in question. For ruby, please see [Ruby].

673 |

What to do if detected:  When received by a browser as part of 674 | marked-up text, they may be ignored. When receiving plain text in an editing 675 | environment, editors may take one or more of several actions: remove U+FFF9 676 | together with removing all characters between U+FFFA and following U+FFFB;  677 | ignore U+FFF9 and turn U+FFFA and U+FFFB  into "[" and "]" respectively, or 678 | into similar characters; issue a warning to the user; or tentatively convert 679 | into appropriate ruby markup for further editing and formatting by the 680 | user.

681 |
682 | 683 |
684 |

Object Replacement Character, U+FFFC

685 |

Short description: The object replacement character is used to 686 | stand in place of an object (e.g. an image) included in a text.

687 |

Reason for inclusion: The object replacement character was 688 | included in Unicode only in order to reserve a codepoint for a very frequent 689 | application-internal use. Many text-processing applications store the text 690 | and the associated markup (or in some cases styling information) of a 691 | document in separate structures. The actual text is kept in a single linear 692 | structure; additional information is kept separately with pointers to the 693 | appropriate text positions. The overall implementation makes sure that these 694 | two structures are kept in sync. If the text contains objects such as images, 695 | it is extremely helpful for implementations to have a sentinel in the text 696 | itself; any additional information is kept separately.

697 |

Problems when used in markup: Including an object replacement 698 | character in markup text does not work because the additional information 699 | (what object to include,...) is not available.

700 |

Problems with other uses: The object replacement character is 701 | also problematic when used in plain text, because there is no way in plain 702 | text to provide the actual object information or a reference to it.

703 |

Replacement markup: The markup to be used in place of the Object 704 | Replacement Character depends on the object in question and the markup 705 | context it is used in. Typical cases are <xhtml:img src='...' />, 706 | <xhtml:object ...>, or <html:applet ...>. These constructs allow 707 | providing all additional information needed to identify and use the object in 708 | question.

709 |

What to do if detected: Browsers may ignore this character. When 710 | received in an editing context, if the actual object is accessible, editors 711 | may either replace the character by the appropriate markup for that object, 712 | or otherwise remove it, ideally providing a warning.

713 |
714 | 715 |
716 |

Musical Controls, U+1D173..U+1D17A

717 |

Short description: A series of characters for controlling scope 718 | in musical notation.

719 |

Reason for inclusion: These characters designate the start and 720 | end of common musical constructs. Full musical layout depends on additional 721 | information, for example pitch, that cannot be encoded using Unicode. 722 | However, many musical symbols may be depicted in isolation (and without 723 | assigning pitch) as part of a textual discussion of music. Plain text use of 724 | Unicode characters is primarily intended for this latter purpose. The scoping 725 | operators can be used to support limited renderings of beams, slurs, phrases, 726 | etc. in this context. However, in the context of markup languages, musical 727 | scoring calls for a dedicated markup language (analogous to MathML) which 728 | would be expected to contain markup for these constructs.

729 |

Problems when used in markup: These characters duplicate 730 | information that can in principle be expressed in markup.

731 |

Problems with other uses: Their special code range allows them to 732 | be easily filtered, but applications that do not expect them will treat them 733 | as garbage characters.

734 |

Replacement markup: Replace with equivalent markup if 735 | available.

736 |

What to do if detected: Browsers may ignore these characters. 737 | When received in an editing context, editors may remove or replace them by 738 | equivalent markup.

739 |
740 | 741 |
742 |

Language Tag Characters, U+E0000..U+E007F

743 |

Short description: A series of characters for expressing language 744 | tags, based on existing standards for language tags using the rules in 745 | Chapter 16 of [Unicode].

746 |

Reason for inclusion: These characters allow in-band language 747 | tagging in situations where full markup is not available, while allowing easy 748 | filtering by applications that do not support them. They were solely included 749 | for the benefit of those Internet protocols, such as ACAP, which require a 750 | standard mechanism for marking language in UTF-8 strings, and at the same 751 | time to avoid the use of other tagging schemes that relied on specific 752 | details of the encoding form used.

753 |

Problems when used in markup: These characters duplicate 754 | information that can be expressed in markup.

755 |

Problems with other uses: Their special code range allows them to 756 | be easily filtered, but applications that do not expect them will treat them 757 | as garbage characters.

758 |

Replacement markup: Replace with equivalent language markup. XML 759 | and XHTML have the xml:lang attribute. HTML has the lang attribute. These 760 | attributes follow different scoping rules than the tag characters, therefore 761 | this replacement will generally not be a simple 1:1 substitution.

762 |

What to do if detected: Browsers may ignore these characters. 763 | When received in an editing context, editors may remove or replace them by 764 | equivalent markup.

765 |
766 | 767 |
768 |

Other Characters Deprecated in Unicode

769 |

Short description: The Unicode Character Database [UnicodeData] lists all characters that have been 771 | deprecated in [Unicode]. This list may grow (slowly) 772 | over time. Deprecated characters remain valid characters forever, but their 773 | use is strongly discouraged. Deprecation of characters is applied only in 774 | exceptional circumstances. It is never the result of historical changes of a 775 | writing system: characters no longer in current, modern use are retained in 776 | Unicode, as they are needed for the representation of historical 777 | documents.

778 |

Reason for inclusion: Usually, characters that are deprecated 779 | were never needed, but were inadvertently added to the Unicode Standard, 780 | perhaps based on incomplete information available at the time of encoding.

781 |

Problems when used in markup: Except where noted elsewhere in 782 | this document, their presence in markup presents the same problems as in 783 | plain text, usually that of an unnecessary duplicate encoding.

784 |

Problems with other uses: Depends on the character and the reason 785 | for its deprecation. For more information see [Unicode].

787 |

Conversion for use with markup: For deprecated characters not 788 | discussed elsewhere in this document, see the relevant descriptions of those 789 | characters in [Unicode] for information on the 790 | recommended alternatives.

791 |

What to do if detected:  Unless a specific recommendation is 792 | given elsewhere, deprecated characters are not ignored; where possible, in an 793 | editing environment, a preferred alternate encoding may be substituted.

794 |
795 |
796 | 797 | 798 |
799 |

Format Characters Suitable for Use with Markup

800 |

The following table contains format characters that do not exhibit the 801 | problems discussed at the start of Section 3. Despite 802 | their apparent relation to or similarity with characters in table 3.1, they are considered suitable for use with markup. 804 | It is not acceptable for user agents to ignore the characters in table 4.1. 805 | For a description of these characters see [Unicode].

807 |
808 |
Some characters that affect text format but are suitable for use with markup
809 | 810 | 811 | 812 | 815 | 818 | 821 | 822 | 823 | 824 | 825 | 826 | 827 | 828 | 829 | 830 | 831 | 832 | 833 | 834 | 835 | 836 | 837 | 838 | 839 | 840 | 841 | 842 | 843 | 844 | 845 | 846 | 847 | 848 | 849 | 850 | 851 | 852 | 853 | 854 | 855 | 856 | 857 | 858 | 859 | 860 | 861 | 862 | 863 | 864 | 865 | 866 | 867 | 868 | 869 | 870 | 871 | 872 | 873 | 874 | 875 | 876 | 877 | 878 | 879 | 881 | 882 | 883 | 884 | 885 | 886 | 887 | 888 | 889 | 890 | 891 | 892 | 893 | 894 | 895 | 896 | 897 | 898 | 899 | 900 | 901 | 902 | 903 | 904 | 905 | 906 | 907 | 908 | 909 | 910 | 911 | 912 | 913 | 914 | 915 | 916 | 917 | 918 | 919 | 920 | 921 | 922 | 923 | 924 | 925 | 926 | 927 | 928 | 929 | 930 | 931 | 932 | 933 | 934 | 935 | 936 | 937 | 938 | 939 | 940 | 941 | 942 | 943 | 944 | 945 | 946 | 947 | 948 | 949 |

Code 813 | points

814 |

Names/Description

817 |

Short 819 | Comment

820 |
U+00A0No-break SpaceLine break control
U+00ADSoft HyphenLine break control
U+034FCombining Grapheme JoinerUsed in sorting
U+0600Arabic Number SignSubtending mark
U+0601Arabic Sign SanahSubtending mark
U+0602Arabic Footnote MarkerSubtending mark
U+0603Arabic Sign SafhaSubtending mark
U+06DDArabic End of AyahEnclosing mark
U+070FSyriac Abbreviation Mark (SAM)Supertending mark
U+0F0CTibetan Mark Delimiter Tsheg BstarNon-breaking form of 0F0B
U+115F..U+1160Hangul Jamo FillersFiller
U+180B..U+180EMongolian Variation Selectors(FVS1..FVS3), Mongolian 880 | Vowel SeparatorRequired for Mongolian
U+200BZero-width SpaceLine break control
U+200C..U+200DZero-width Join Controls (ZWJ and ZWNJ)Required for a.o. Persian and many Indic scripts
U+200E..U+200FImplicit Directional Marks (LRM and RLM)LRM and RLM are allowed
U+2011Non-breaking HyphenLine break control
U+202FNarrow No-break SpaceLine break control/Mongolian
U+2044Fraction SlashOr use markup (MathML)
U+2060Word JoinerUse for that purpose instead of U+FEFF ZWNBSP
U+2061..U+2064Invisible Mathematical OperatorsMathematical use
U+2FF0..U+2FFBIdeographic Character DescriptionGraphic characters (not controls)
U+303EIdeographic Variation IndicatorGraphic character (not a control)
U+FF80Halfwidth Hangul FillerFiller, not generally required
FE00..FE0FVariation SelectorsModify graphic characters
E0100..E01DFVariation SelectorsModify graphic characters
950 |
951 |

The following subsections briefly discuss some of the characters from the 952 | above list, particularly those that affect more than their immediately 953 | adjacent neighbors. Please see the Unicode Standard [Unicode] for full details.

955 |
956 |

Subtending Marks

957 |

Subtending marks are needed to represent a common feature in the Arabic 958 | and Syriac scripts where a mark can be placed below a range of characters, 959 | for example below a sequence of digits, to indicate a year. The Syriac 960 | abbreviation mark is placed above a series of characters, making it 961 | technically a supertending mark, and the ARABIC END OF AYAH is an enclosing 963 | mark. In the character stream, a subtending mark precedes the affected 964 | characters. The end of affected range of characters is defined implicitly, 965 | usually by the first non-alphanumeric character. 

966 |

Unlike subtending marks, the scope of combining enclosing 967 | marks, such as combining 969 | enclosing circle, is limited to the preceding default grapheme 970 | cluster. For details on grapheme clusters see Unicode Standard Annex #29: 971 | "Text Boundaries", [UAX 29] .

972 |

There is currently no existing markup that can represent the 973 | scoping and layout functions defined by these characters, so they cannot be 974 | substituted. It is unresolved to what degree intervening markup affects the 975 | scope of these marks.

976 |
977 |
978 |

Fraction Slash

979 |

The fraction slash is used between sequences of decimal 980 | digits to form fractions. Whether the resulting fraction has a horizontal or 981 | diagonal fraction line is unspecified. The fallback is to leave the digits 982 | unchanged and display a regular slash. In order to separate a digit from a 983 | following fraction, as in 1¾, the use of U+2009 THIN SPACE is recommended.

985 |

For better control of fractions the use of [MathML] is suggested where appropriate.

987 |
988 |
989 |

Variation Selectors

990 |

A variation selector is intended to cause a specific variant form (or 991 | range of variant forms) when applied to a base character. For a variation 992 | selector to have an effect it must immediately follow its base character. 993 | Only pre-determined combinations of selected base characters and specific 994 | variation selectors have a defined effect. All other combinations are 995 | ill-formed and are to be ignored. The list of standardized combinations is 996 | documented in the Unicode Character Database, see [Variants]. In addition to the 256 generic variation 998 | selectors, there are 3 Mongolian free variation selectors. They 999 | function in all other ways like variation selectors, except they only apply 1000 | to base characters from the Mongolian script. Since Mongolian, like Arabic, 1001 | has positional character shapes, the variations are limited to particular 1002 | shaping contexts.

1003 |
1004 |
1005 |

Ideographic Description Characters

1006 |

Ideographic Description Characters are included in the Unicode Standard as 1007 | a means to indicate the composition of ideographs from a combination of 1008 | pieces (terms), where each piece or term is either a Unicode character or 1009 | composed. Ordinarily the result would be a human readable description of a 1010 | character, perhaps one for which a font is not available. However, at least 1011 | some vendors are interested in automatic conversion of these sequences into 1012 | single ideographs.

1013 |
1014 |
1015 |

Invisible Mathematical Operators

1016 |

These characters are needed to convey the intended meaning of a 1017 | mathematical expression to an automated parser whenever two elements are 1018 | simply written next to each other. See Unicode Technical Report #25: "Unicode 1019 | Support for Mathematics" [UTR25] for more details.

1020 |
1021 |
1022 |

Line Break Controls

1023 |

Most of these characters prevent line breaks adjacent to them, but ZWSP 1024 | and SHY provide invisible line break opportunities. The detailed function of 1025 | these characters is described in Unicode Standard Annex #14: "Line Breaking 1026 | Properties" [UAX14]. While high-end applications may be 1027 | able to deduce line breaking opportunities automatically solely with the help 1028 | of very generic markup or styling properties, the use of these characters 1029 | currently provides the most reliable and straight-forward way to control line 1030 | breaking and hyphenation. Note that [HTML4.01] uses 1031 | U+00A0 NO-BREAK SPACE also as a "hard space" (i.e. a space with a fixed 1032 | width), something that is not part of its character semantics in [Unicode].

1034 |

U+2011 NON-BREAKING HYPHEN (NBHY) is used to encode a hyphen that does not 1035 | provide a line break opportunity. In several languages, the sequence <SHY, 1036 | NBHY> may be used to handle special line breaking behavior for explicit 1037 | hyphens, see  [UAX14].

1038 |
1039 |
1040 |

Hangul Fillers

1041 |

These should not be needed except for texts that need to have a fixed 1042 | number of jamos per Korean syllable block. See the description of Korean 1043 | Syllable Blocks in [Unicode].

1044 |
1045 |
1046 |
1047 |

Characters with Compatibility Mappings

1048 |

The Unicode Standard provides compatibility mappings for a number of 1049 | characters. Compatibility mappings indicate a relationship to another 1050 | character, but the exact nature of the relationship varies. In some cases the 1051 | relationship means "is based on" in some other cases it denotes a property. 1052 | When plain text is marked up, it may make sense to map some of these 1053 | characters to a combination of their compatibility equivalents and suitable markup. It is important to 1055 | understand the nature of the distinctions between characters and their 1056 | compatibility equivalents and the context in which these distinctions matter. 1057 | It is never advisable to apply compatibility mappings indiscriminately. This 1058 | section provides guidance on when and how to apply compatibility mappings in 1059 | the case of importing text from non-XML (non-marked-up) sources. The section 1060 | is organized by the "compatibility tag" associated with each compatibility 1061 | mapping.

1062 |
1063 |

Overview

1064 |

The following table gives an overview of the various compatibility 1065 | characters, organized by "compatibility tag". The first column, Tag 1066 | value, contains the value of the "compatibility tag" from the Unicode 1067 | Character Database [UnicodeData]. Although these 1068 | tags use "<" and ">", they do not appear as such in markup and should 1069 | not be confused with XML tags. Code range indicates a further break 1070 | down by code points. Action summarizes the recommended action to be 1071 | taken whenever markup is first applied to non-XML text. Each entry indicates 1072 | whether the characters can be substituted using the compatibility equivalent 1073 | according to Normalization Form KC of [UAX 15], can be 1074 | replaced by equivalent markup where available, or should be retained. For 1075 | some cases, instead of or in addition to markup, style information [CSS] is needed. Description and usage provides 1077 | additional information. Sections 5.3 through 5.6 provide additional information for some of these 1079 | sets of compatibility characters including detailed recommended actions.

1080 |
1081 |
Characters with compatibility mappings
1082 | 1083 | 1084 | 1085 | 1086 | 1087 | 1088 | 1089 | 1090 | 1091 | 1092 | 1093 | 1094 | 1096 | 1097 | 1098 | 1099 | 1100 | 1101 | 1102 | 1103 | 1104 | 1105 | 1106 | 1108 | 1109 | 1110 | 1111 | 1112 | 1114 | 1115 | 1116 | 1117 | 1118 | 1120 | 1121 | 1122 | 1123 | 1125 | 1127 | 1128 | 1129 | 1130 | 1132 | 1134 | 1135 | 1136 | 1137 | 1139 | 1141 | 1142 | 1143 | 1144 | 1145 | 1147 | 1148 | 1149 | 1150 | 1152 | 1154 | 1155 | 1156 | 1157 | 1158 | 1160 | 1161 | 1162 | 1163 | 1164 | 1166 | 1167 | 1168 | 1169 | 1170 | 1171 | 1172 | 1173 | 1174 | 1175 | 1176 | 1177 | 1178 | 1179 | 1180 | 1181 | 1182 | 1184 | 1185 | 1186 | 1187 | 1188 | 1189 | 1191 | 1192 | 1193 | 1194 | 1195 | 1196 | 1197 | 1198 | 1199 | 1200 | 1201 | 1202 | 1203 | 1204 | 1205 | 1206 | 1207 | 1208 | 1209 | 1210 | 1211 | 1212 | 1213 | 1214 | 1215 | 1216 | 1217 | 1218 | 1219 | 1220 | 1223 | 1224 | 1225 | 1226 | 1227 | 1228 | 1230 | 1231 | 1232 | 1233 | 1234 | 1235 | 1237 | 1238 | 1239 | 1240 | 1241 | 1243 | 1244 | 1245 | 1246 | 1247 | 1249 | 1250 | 1251 | 1252 | 1253 | 1255 | 1256 | 1257 | 1258 | 1259 | 1260 | 1262 | 1263 | 1264 | 1265 | 1266 | 1268 | 1269 | 1270 | 1271 | 1272 | 1273 | 1275 | 1276 | 1277 | 1278 | 1279 | 1280 | 1281 | 1282 | 1283 | 1284 | 1285 | 1286 | 1287 | 1288 | 1290 | 1291 | 1292 | 1293 | 1294 | 1295 | 1296 | 1297 | 1298 | 1299 | 1300 | 1301 | 1302 | 1303 | 1304 |
Tag valueCode rangeActionDescription and usage
<circled>allretainCircled letters and digits used for list 1095 | item markers, and in running text
<compat>2002..200AretainFixed width spaces
2100..2101retainVariant letter forms that are used as 1107 | symbols
2105..2106retainVariant letter forms that are used as 1113 | symbols
2121, 213BretainFor use as single code point in vertical 1119 | layout
2160..217Fretain, or use list item marker style, or 1124 | normalizeFor use as single code point in vertical 1126 | layout, or as list item marker
2474..249Bretain, or use list item marker style, or 1131 | normalizeParenthesized or dotted number used as 1133 | list item marker
249C..24B5retain, or use list item marker style, or 1138 | normalizeParenthesized letters used as list item 1140 | markers
3131..318EretainCompatibility Hangul Jamo. These do not 1146 | conjoin
3200..3229retain, or use list item marker style, or 1151 | normalizeParenthesized characters used as list item 1153 | markers
322A..3243retainParenthesized characters used 1159 | as symbols in vertical layout
32C0..32CBretainString used as single code point in 1165 | vertical layout
all otherretainMaintain, semantic distinctions apply
<final>allnormalizeArabic Presentation forms
<font>allretainVariant letter forms that are used as 1183 | symbols
<fraction>allnormalizeAs long as fraction slash is 1190 | supported!
<initial>allnormalizeArabic Presentation forms
<isolated>allnormalizeArabic Presentation forms
<medial>allnormalizeArabic Presentation forms
<narrow>allretainHalf-width characters
<noBreak>allretainThe compatibility mapping merely indicates 1221 | the equivalent breaking character. The noBreak distinction must be 1222 | preserved
<small>allretainPrecise usage unknown. Maintain, but do 1229 | not generate
<square>3300..3357retainSingle display cell cluster containing 1236 | multiple lines of kana for vertical layout
3358..337DretainFor use as single code point in vertical 1242 | layout
33E0..33FEretainFor use as single code point in vertical 1248 | layout
all otherretainVariant letter form used as symbol in 1254 | vertical layout
<sub>2080..208Eretain, or use markupSubscript digits 0-9, as well as minus, 1261 | plus, equal and parens
all otherretainSubscript characters, usually used as 1267 | modifier letters in phonetic notation
<super>00B2..00B3retain, or use  markupSuperscript digits 0-9, as 1274 | well as minus, plus, equal and parens
00B9
2070
2074..207E
all otherretainSuperscript characters, usually used as 1289 | modifier letters in phonetic notation
<vertical>allnormalizeEast Asian Presentation forms
<wide>allretainFull-width characters
1305 |
1306 |

Some symbols used in vertical layout exist as single code 1307 | points in legacy systems, but can also be composed on the fly by more 1308 | advanced display engines. There are currently no style properties that 1309 | could be used to express squared Kana clusters (kumimoji) or 1310 | horizontal in vertical writing mode (tate-chu-yoko).

1311 |
1312 | 1313 |
1314 |

Generating New Text

1315 |

Presentation forms and characters for which adequate representation exists 1316 | as marked up text should never be entered into new data. Many of the 1317 | characters with <font> tag are however suitable for new data, as long 1318 | as they are used in the manner they are intended, that is as symbols, with 1319 | definite semantic differentiation between the different forms. The largest 1320 | set of these characters exists to carry essential semantic distinctions in 1321 | mathematical notation, where the any loss of markup during text export would 1322 | compromise the meaning of the text. Most of the characters with <super> 1323 | and <sub> tag have been encoded for use in phonetic or phonemic 1324 | transcriptions, where they act as ordinary letters and the use of style 1325 | markup is therefore deemed inappropriate. However, it is inappropriate to use 1326 | any of these classes of characters to create the appearance of styled text 1327 | runs.

1328 |

For example to write hello 1329 | (in italic style), one should use the 1330 | equivalent of <span style='font-style: italic;'>hello</span>, 1331 | and not the sequence of Unicode characters ℎℯℓℓℴ 1332 | (U+210E PLANCK CONSTANT, 1333 | U+212F SCRIPT SMALL E, 1334 | U+2113 SCRIPT SMALL L, 1335 | U+2113 SCRIPT SMALL L, 1336 | U+2134 SCRIPT SMALL O). 1337 | Conversely, to indicate Planck's constant one should use 1338 | ℎ (U+210E PLANCK CONSTANT) and not something such as 1339 | <span style='font-style: italic;'>h</span>.

1340 |

When style is applied across entire words, sentences or paragraphs, the 1341 | use of markup is preferred. When style is applied to individual letters, 1342 | especially to letters inside a word, giving them a particular interpretation, 1343 | the use of character codes is preferred. See also Section 5.6.

1345 |
1346 |
1347 |

List Item Marker Characters

1348 |

Short description: Characters with a <circled> tag or 1349 | characters with <compat> tag and compatibility mapping to a 1350 | parenthesized string.

1351 |

Reason for inclusion: They are most frequently used for marking 1352 | enumerated list items, but the characters with a <circled> tag often 1353 | occur as dingbats or footnote markers in tables. The same characters are used 1354 | in regular text when citing an item from a corresponding ordered list.

1355 |

Problems when used in markup: These characters do not cause undue 1356 | interaction with markup

1357 |

Problems with other uses: None

1358 |

Replacement markup: (in text use) these characters are often used 1359 | in running text; sometimes, but not exclusively, in situations where the text 1360 | is to be associated with an item from a nearby numbered list. Replacement 1361 | markup may not be available, and the support for such markup is much more 1362 | limited today than was anticipated when this document was first written.

1363 |

(list item style) When generating marked up text these characters occur 1364 | only internal to the user agent when list item styles are rendered. When 1365 | marking up plain text data they could be converted to suitable list item 1366 | styles, if such use can be properly inferred. The default recommendation is 1367 | to retain the original character.

1368 |

(characters with compatibility mappings of the form "(n)" or 1369 | "n." or roman numerals) Unlike circled characters, these could be 1370 | rendered by sequences of regular characters. Using a list item marker style 1371 | would in theory allow the support of longer lists (the Unicode characters are 1372 | limited to the set  (1) to (20) and "1." to "20."). Using regular character 1373 | sequences would also allow the use of fonts that match the text of the 1374 | list.

1375 |

What to do if detected: No action needs to be taken by browsers. 1376 | When received in an editing context, substitution of a list item marker style 1377 | may be appropriate. However, the same characters are very often used as 1378 | dingbat-like symbols in tables, or may appear in general text, whether or not 1379 | referring to an item from a list. Therefore the user must have the choice of 1380 | whether to replace the character.

1381 |
1382 |
1383 |

Fractions

1384 |

Short description: Single character fractions such as ½ or ¼.

1385 |

Reason for inclusion: Subsets of these occur in practically all 1386 | legacy character sets.

1387 |

Problems when used in markup: The character repertoire is limited 1388 | to a few common fractions. When used with more general methods of generating 1389 | fractions such as MathML [MathML] the usual problem of 1390 | dual representation arises.

1391 |

Problems with other uses: Other than normalization issues, these 1392 | characters present no undue problems in plain text. Where fraction slash is 1393 | supported, these can be expressed by substituting their compatibility 1394 | mappings.

1395 |

Replacement markup: MathML can represent fractions unambiguously. 1396 | When using fraction slash, care must be taken such that values like 3½ do not 1397 | turn into 31/2 (=15.5).

1398 |

What to do if detected: No action needs to be taken by browsers 1399 | or editors, except when converting plain text to MathML.

1400 |
1401 |
1402 |

Squared or Horizontal

1403 |

Short description: Characters that are symbols composed of groups 1404 | of typically kana or Latin letters, digits plus slash for use in a single 1405 | display cell in vertical display of text. 

1406 |

Reason for inclusion: Many existing character sets contain these 1407 | as precomposed characters since for simple implementations this is the only 1408 | way to support the common use of providing metric units and other 1409 | abbreviations in a single character cell for vertical text layout. 

1410 |

Problems when used in markup: Proposed markup, including CSS 1411 | styling, would be able express an unbounded set of these abbreviations, 1412 | obviating the need of cataloguing these in the character encoding standard 1413 | and making them more directly accessible to text based processing, for 1414 | example searching.

1415 |

Problems with other uses: The repertoire of these legacy 1416 | characters is limited; many more combinations are in actual use than are 1417 | accounted for in character sets. Pre-composed symbols do not make their text 1418 | content available to search engines. They also require re-encoding for text 1419 | laid out horizontally.

1420 |

Replacement markup: None available.

1421 |

What to do if detected: No action required. (Subject to change 1422 | pending the outcome of current proposals.)

1423 |
1424 |
1425 |

Superscripts and Subscripts

1426 |

Short description: Mainly super and subscript digits, but also 1427 | signs, parentheses and a large number of letters.

1428 |

Reason for inclusion:  Super and subscripted letters and digits 1429 | are quite common in some forms of phonetic or phonemic transcriptions, where 1430 | the use of styles is both awkward and prone to data integrity issues when 1431 | exported to plain text. For super or subscripted letters in phonetic 1432 | transcription in particular, a change from superscript of subscript to 1433 | regular style would alter the meaning. Note that such use in transcription is 1434 | not limited to letters: superscripted small digits are often used to indicate 1435 | tone. When used for these purposes, these characters should be retained and 1436 | markup should not be used.

1437 |

A few super and subscript characters, primarily the digits, also occur in 1438 | many legacy character sets, including Latin-1. Their use in pure plain text 1439 | is common for databases, e.g. including metric units for part descriptions  1440 | (viz. cm2) or for (usually simplified) formulae as occur in titles 1441 | of scientific publications.

1442 |

When used in mathematical context (MathML) it is recommended to 1443 | consistently use style markup for superscripts and subscripts. This is 1444 | because mathematical layout allows not just individual symbols, but entire 1445 | expressions to be superscripted or subscripted in a regular, nested 1446 | manner.

1447 |

Problems when used in markup: Mixing direct use of these 1448 | characters with the use of style markup provides multiple representations of 1449 | the same text, leading to potentially different treatment by search and 1450 | display engines.

1451 |

However, when super and sub-scripts are to reflect semantic distinctions, 1452 | it is easier to work with these meanings encoded in text rather than markup, 1453 | for example, in phonetic or phonemic transcription. Otherwise, they would 1454 | require markup in the middle of words, and  they may also be inadvertently 1455 | changed to normal style text, when exporting to plain text. This applies to 1456 | the majority of super and subscripted characters in Unicode.  On the other 1457 | hand, some user agent may support certain superscripted or subscripted 1458 | characters only when used as marked up text for example, because of lack of 1459 | font support for them.

1460 |

Problems with other uses: none

1461 |

Replacement markup: Unless used as letters, <xhtml:sup> and 1462 | <xhtml:sub> or <mathml:msup> and <mathml:msub> may be 1463 | used.

1464 |

What to do if detected: Both representations (with or without 1465 | style markup) should be equivalent for search purposes. Input methods for 1466 | mathematical texts might enforce the use of styles.  If superscript 1467 | characters are encountered during display of mathematical formulae, it is 1468 | recommended that they be displayed in a manner indistinguishable from that 1469 | achieved by using regular characters with corresponding style markup..

1470 |
1471 |
1472 |

Other Characters Marked <compat>

1473 |

Short description: The <compat> label was given to a set of 1474 | compatibility characters whose further classification was not settled at the 1475 | time the standard was created. The largest components are list item marker 1476 | characters.

1477 |

Reason for inclusion: These characters occur in many legacy 1478 | character sets.

1479 |

Problems when used in markup: none. There usually is no 1480 | equivalent markup.

1481 |

Problems with other uses: none

1482 |

Replacement markup: none.

1483 |

What to do if detected: No action required.

1484 |
1485 |
1486 |
1487 |

Noncharacters

1488 |

The Unicode Standard defines 66 non-character code points, or noncharacters. These are the last two positions on each of the 17 1489 | planes, in other words, all characters whose code points end in ...FFFE or 1490 | ...FFFF, as well as the 32 code points from U+FDD0 to U+FDEF. Applications 1491 | are free to use any of these code points internally but should never attempt 1492 | to interchange them. In effect, noncharacters can be thought of as 1493 | application-internal private-use code points.

1494 |
1495 |
1496 |

White Space

1497 |

This section presents common issues with white space characters in markup 1498 | languages, mostly based on their difference in function as part of the 1499 | structure of the markup source (syntactic white space) on the one hand and as 1500 | part of the document content on the other hand.

1501 |

The set of characters in the Unicode standard that have the property 1502 | "White_Space" (see 'White Space' in the [UCD]) is 1503 | quite large. It includes white space characters with different line breaking 1504 | properties, different ligating properties, and different widths. It is 1505 | appropriate to use these characters as part of markup content for their very 1506 | specific purpose. It  is preferable to place them in the markup source so 1507 | that they are surrounded by ordinary characters rather than line breaks for 1508 | example.  The set of white space characters defined by typical markup 1509 | language specifications is a subset of the characters that are considered 1510 | white space by [Unicode] .

1511 |

Each markup language defines the set of characters that it accepts as part 1512 | of the markup syntax, this is usually a very small set. The XML [XML1.0] and [XML1.1] specifications 1514 | define white space as a combination of one or more of the following 1515 | characters: U+0020 SPACE, carriage return (U+000D), line feed (U+000A), or 1516 | tab (U+0009). [HTML4.01] adds to these the form feed 1517 | character (U+000C), but that character cannot be used in any XHTML 1518 | version.

1519 |

In addition, markup languages may use conventions for converting or 1520 | removing some kinds of white space. XML processors replace some combinations 1521 | of end-of-line characters by a single line feed character. [XML1.0] normalizes any two character sequences of (U+000D 1523 | U+000A) or any U+000D not followed by U+000A to a single U+000A. [XML1.1] also normalizes NEL (U+0085) and U+2028 LINE 1525 | SEPARATOR, but U+2029 PARAGRAPH SEPARATOR is not treated that way. Additional 1526 | processing of white space before it is handled to an application also occurs 1527 | for attribute values: line breaks are replaced by spaces, leading and 1528 | trailing spaces are removed, and subsequent spaces are replaced by a single 1529 | space.

1530 |

In XML, white space is purely syntactic inside tags, for example, to 1531 | separate the element name from attributes, and between elements in element 1532 | content models (as they are typical for data-oriented applications). White 1533 | space in element content models is used to lay out the markup source, using 1534 | line breaks and indentation, to improve readability. The same use of white 1535 | space is possible in many cases in mixed content (typical for text-oriented 1536 | applications).

1537 |

Because XML is used for a very wide range of applications, after the 1538 | processing steps mentioned above it passes all white space to the 1539 | application. Some XML applications such as [XHTML] may 1540 | have their own white space processing rules when processing white space 1541 | characters. Also, applications and software transforming XML (e.g. [XSLT]) have specific conventions of how they handle white 1543 | space, and specific ways of how to control this behavior. To appropriately 1544 | use white space characters, readers are advised to examine all involved 1545 | standards and software.

1546 |

If the characters U+2028 and U+2029 appear in text, they may be treated as 1547 | zero-width characters without semantic meaning (see Section 3.2).

1548 |
1549 |

Converting Newline Functions to White Space

1550 |

White space that is not purely syntactic, including control codes that 1551 | define a newline function (see Section 5.8, Newline Guidelines, in [Unicode]), can be handled in three main ways.

1553 |
    1554 |
  1. For data-oriented applications, the textual content of elements is 1555 | treated according to the needs of the data type in question. In many 1556 | cases, processing by the application includes aspects similar to those of 1557 | the processing of attribute values by the XML parser itself. For some 1558 | types of data, in particular small data items, some applications may also 1559 | simply prohibit the use of white space.
  2. 1560 |
  3. For running text in text-oriented applications, reflowing is used, i.e. 1561 | the line breaks in the markup source are removed and the text is reflown 1562 | into lines whose length is determined by the output medium and styling 1563 | properties. In the context of Unicode, this reflowing process requires 1564 | care; it is described in more detail below.
  4. 1565 |
  5. For preformatted text, such as program source code, line breaks must be 1566 | preserved. Text-oriented applications usually contain special markup for 1567 | preformatted text, e.g. <xhtml:pre>. XML itself defines an 1568 | xml:space attribute that applications may use for a similar purpose.
  6. 1569 |
1570 |

When reflowing, line breaks and adjacent white space can be treated as 1571 | space, removed, collapsed with adjacent control characters of the same type, 1572 | or treated as zero-width space. Which choice is appropriate depends on the 1573 | script of the surrounding text. The assumption is that line breaks and 1574 | adjacent white space (in particular following white space, used for 1575 | indentation) was added to make the markup source more readable, in particular 1576 | to make each line fit on a line of a plain text editor. For scripts that use 1577 | spaces, line breaks will have been inserted where there originally was a 1578 | space; treating them as spaces therefore preserves the intended separation 1579 | between words. For scripts which do not use spaces, such as Ideographic 1580 | scripts or certain South East Asian scripts, such as Thai, line feeds should 1581 | be removed, or replaced by U+200B zero width space. The choice of treatment 1582 | can depend on the script value of the characters preceding and following the 1583 | line feed character, assuming these characters belong to the same run of 1584 | text.

1585 |

The Unicode Standard [Unicode] 1586 | specifies that the zero width space is considered a valid line-break point 1587 | and that if two characters with a zero width space in between are placed on 1588 | the same line they are placed with no space between them; and that if they 1589 | are placed on two lines no additional glyph area is created at the 1590 | line-break.

1591 |

The details of reflowing are the responsibility of the various markup 1592 | applications (e.g. [XHTML]). However, there is a 1593 | tendency to move this functionality from markup applications to styling, so 1594 | that it can be shared across applications.

1595 |

Authors should be aware of the fact that the above script-specific 1596 | treatment of line breaks when reflowing text is not yet available in all 1597 | implementations (e.g. browsers). For scripts that do not use white space to 1598 | separate words, it may therefore still be advisable to not split long 1599 | lines.

1600 |

Editing tools should try to support the user in the appropriate use of 1601 | white space. Some white space characters cannot easily be entered via a 1602 | keyboard, but some others, e.g. U+3000 Ideographic Space, can. Editing tools 1603 | should try to make sure that only line breaks and white space that is 1604 | accepted as syntactic white space by the relevant markup language are used to 1605 | improve markup source readability.

1606 |

While the styling possibilities provided by CSS and its implementations 1607 | have not reached the level of professional typesetting systems, they offer a 1608 | wide range of ways to control layout and spacing of text. A very simple 1609 | example is text centering, which would have been done by inserting an 1610 | appropriate number of spaces on each line in pure plain text.

1611 |
1612 |
1613 | 1614 | 1615 |
1616 |

References

1617 |
1618 |
[Charmod]
1619 |
Martin J. Dürst, François Yergeau, Richard Ishida, Misha Wolf, Tex 1620 | Texin, Eds., Character Model for the World Wide Web 1.0: 1621 | Fundamentals, W3C Recommendation, 15-February-2005, <https://www.w3.org/TR/2005/REC-charmod-20050215/>.
1623 |
[CharReq]
1624 |
Martin J. Dürst, Requirements for String Identity and Character 1625 | Indexing Definitions for the WWW, W3C Working Draft, 1626 | 10-July-1998, <https://www.w3.org/TR/WD-charreq>.
1628 |
[CSS]
1629 |
For information on cascading style sheet specifications, see <https://www.w3.org/Style/CSS/>.
1631 |
[Feedback]
1632 |
Reporting Errors and Requesting Information Online to the Unicode 1633 | Consortium,<https://www.unicode.org/reporting.html>.
1635 |
[HTML4.01]
1636 |
Dave Raggett, Arnaud Le Hors, Ian Jacobs, Eds., HTML 4.01 1637 | Specification, W3C Recommendation, 18-Dec-1997 (revised on 1638 | 24-Dec-1999), <https://www.w3.org/TR/1999/REC-html401-19991224/>.
1640 |
[HTML 4.0 - 8.2]
1641 |
Section 8.2 of [HTML4.0] Specifying the direction of text and 1642 | tables: the dir attribute <https://www.w3.org/TR/1999/REC-html401-19991224/struct/dirlang.html#h-8.2>.
1644 |
[MathML]
1645 |
David Carlisle, Patrick Ion, Robert Miner, Nico Poppelier, Eds., Mathematical Mathematical Markup Language (MathML) Version 2.0 1646 | (Second Edition), W3C Recommendation, 21-Oct-2003, <https://www.w3.org/TR/2003/REC-MathML2-20031021/>.
1648 |
[Namespace]
1649 |
Tim Bray, Dave Hollander, Andrew Layman, Eds., Namespaces in XML 1650 | (Second Edition), W3C Recommendation, 16-Aug-2006, <https://www.w3.org/TR/2006/REC-xml-names-20060816/>.
1652 |
[Ruby]
1653 |
Marcin Sawicki, Michel Suignard, Masayasu Ishikawa, Martin Dürst, Tex 1654 | Texin, Eds., Ruby Annotation, W3C Recommendation, 31-May-2001, 1655 | <https://www.w3.org/TR/2001/REC-ruby-20010531/>.
1657 |
[UAX 9]
1658 |
Mark Davis, Unicode Standard Annex #9, The Bidirectional 1659 | Algorithm, <https://www.unicode.org/reports/tr9/>.
1661 |
[UAX14]
1662 |
Asmus Freytag,Unicode Standard Annex #14, Line Breaking 1663 | Properties https://www.unicode.org/reports/tr14/
1665 |
[UAX 15]
1666 |
Mark Davis, Martin Dürst, Unicode Standard Annex #15, Unicode 1667 | Normalization Forms, <https://www.unicode.org/reports/tr15/>.
1669 |
[UAX 29]
1670 |
Mark Davis,Unicode Standard Annex #29, Text Boundaries. <https://www.unicode.org/reports/tr29/>
1672 |
[UCD]
1673 |
Unicode Standard Annex #44, Unicode Character Database, <https://www.unicode.org/reports/tr44/>.
1675 |
[Unicode]
1676 |
The Unicode Consortium.The Unicode 1677 | Standard https://www.unicode.org/versions/latest/.
1678 |
[Unicode3.2]
1679 |
Unicode Standard Annex #28 Unicode 3.2, The 1681 | Unicode Consortium, 2002.
1682 |
[UnicodeData]
1683 |
Unicode Character Database, <https://www.unicode.org/Public/UNIDATA/UCD.html>.
1685 |
[UnicodeVersions]
1686 |
Versions of the Unicode Standard, <https://www.unicode.org/standard/versions/>.
1688 |
[UTR25]
1689 |
Asmus Freytag, Barbara Beeton, Murray Sargent, Unicode Technical 1690 | Report #25, Unicode Support for Mathematics, <https://www.unicode.org/reports/tr25/>
1692 |
[Variants]
1693 |
Standardized Variants <https://www.unicode.org/Public/UNIDATA/StandardizedVariants.html>.
1695 |
[XHTML]
1696 |
Steven Pemberton, et al., Eds., XHTML>™1.0: The Extensible 1697 | HyperText Markup Language - A Reformulation of HTML 4.0 in XML 1698 | 1.0, W3C Recommendation, 01-Aug-2002, <https://www.w3.org/TR/2002/REC-xhtml1-20020801/>.
1700 |
[XML 1.0]
1701 |
Tim Bray, Jean Paoli, Eve Maler, C. M. Sperberg-McQueen, François 1702 | Yergeau, Eds., Extensible Markup Language (XML) 1.0 (Fourth 1703 | Edition), W3C Recommendation, 16-August-2006, <https://www.w3.org/TR/2006/REC-xml-20060816/>.
1705 |
[XLST]
1706 |
Michael Kay, Ed., XSL Transformations (XSLT) Version 2.0, W3C 1707 | Recommendation, 23-January-2007, <https://www.w3.org/TR/2007/REC-xslt20-20070123/>
1709 |
[XML 1.1]
1710 |
Jean Paoli, Eve Maler, Tim Bray, C. M. Sperberg-McQueen, François 1711 | Yergeau, John Cowan, Eds., Extensible Markup Language (XML) 1.1 1712 | (Second Edition), W3C Recommendation 16-August-2006, <https://www.w3.org/TR/2006/REC-xml11-20060816/>.
1714 |
[XML Schema]
1715 |
Henry S. Thompson, David Beech, Murray Maloney, Noah Mendelsohn, 1716 | Eds., XML Schema Part 1: Structures Second Edition, W3C 1717 | Recommendation 28-October-2004, <https://www.w3.org/TR/2004/REC-xmlschema-1-20041028/> 1719 | .
1720 |
1721 |
1722 |
1723 |

Acknowledgements

1724 |

Mark Davis and Hideki Hiura contributed to the early drafts. Yukka Korpela 1725 | and Felix Sasaki provided input to the current document.

1726 |
1727 | 1728 |
1729 |

Changes Since the Last Published Version

1730 |

The following changes have been made since the Working Group Note of 2013-01-24:

1731 | 1738 |

See the github commit log for more details and diffs.

1739 |
1740 | 1741 | 1742 | 1743 | --------------------------------------------------------------------------------