├── w3c.json ├── README.md ├── local.css └── index.html /w3c.json: -------------------------------------------------------------------------------- 1 | { 2 | "group": 32113 3 | , "contacts": "rishida" 4 | , "repo-type": "note" 5 | , "policy": "restricted" 6 | } -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # unicode-xml 2 | Unicode in XML and other Markup Languages 3 | 4 | 5 | See https://w3c.github.io/unicode-xml/ to view the document. 6 | -------------------------------------------------------------------------------- /local.css: -------------------------------------------------------------------------------- 1 | h2 { 2 | margin-top: 3em; 3 | margin-bottom: 0em; 4 | } 5 | 6 | .head h2, #abstract h2, #sotd h2 { 7 | margin-top: 0; 8 | } 9 | 10 | h3 { 11 | margin-top: 3em; 12 | } 13 | 14 | .deprecation-box h3 { 15 | margin: 0; 16 | font-weight: bold; 17 | font-size: 120%; 18 | color: black; 19 | } 20 | h4 { 21 | font-size: 100%; 22 | font-weight: normal; 23 | color: #005a9c; 24 | margin-top: 2em; 25 | } 26 | 27 | .leadin { 28 | font-weight: bold; 29 | } 30 | 31 | ins { 32 | background-color: #99FF99; 33 | text-decoration: none; 34 | } 35 | 36 | del { 37 | display: inline; 38 | color: silver; 39 | } 40 | 41 | figure { 42 | margin-bottom: 2em; 43 | text-align: center; 44 | } 45 | 46 | figcaption { 47 | text-align: center; 48 | margin: 0.5em 2em; 49 | font-style: italic; 50 | font-size: 90%; 51 | } 52 | 53 | .figno:after { 54 | content: ':\00A0 '; 55 | } 56 | 57 | a.termref:link { 58 | color:#C60; 59 | text-decoration:none; 60 | border-bottom: 1px dotted #FC0; 61 | } 62 | 63 | a.termref:hover { 64 | color:#C60; 65 | text-decoration:none; 66 | border-bottom: 1px dotted #FC0; 67 | } 68 | 69 | a.termref:visited { 70 | color:#C60; 71 | text-decoration:none; 72 | border-bottom: 1px dotted #FC0; 73 | } 74 | 75 | a.termref:active { 76 | color:#C60; 77 | text-decoration:none; 78 | border-bottom: 1px dotted #FC0; 79 | } 80 | 81 | .qterm:before, .qchar:before { content: "‘"; } 82 | 83 | .qterm:after, .qchar:after { content: "’"; } 84 | 85 | .quote:before { content: '“'; } 86 | 87 | .quote:after { content: '”'; } 88 | 89 | code { 90 | color: #A52A2A; 91 | font-family: Consolas, "Andale Mono", "Lucida Console", "Lucida Sans Typewriter", Monaco, "Courier New", monospace; 92 | font-size: 100%; 93 | } 94 | 95 | samp, kbd { 96 | font-family: Consolas, "Andale Mono", "Lucida Console", "Lucida Sans Typewriter", Monaco, "Courier New", monospace; 97 | font-size: 100%; 98 | } 99 | 100 | .uname { 101 | text-transform: uppercase; 102 | font-size: 85%; 103 | letter-spacing:0.03em; 104 | } 105 | 106 | .lettername { 107 | font-style: italic; 108 | } 109 | 110 | .tab-format { 111 | margin-left: 10%; 112 | } 113 | 114 | table td { 115 | border: 1px solid #ddd; 116 | padding: 10px; 117 | } 118 | 119 | .rtlTermCell { 120 | direction: rtl; 121 | text-align: right; 122 | } 123 | 124 | .exampleList { 125 | float: left; 126 | margin:10px; 127 | } 128 | -------------------------------------------------------------------------------- /index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 |
4 |This document contains guidelines on the use of the Unicode Standard in 103 | conjunction with markup languages such as XML.
104 |This document contains guidelines on the use of the Unicode Standard in 110 | conjunction with markup languages such as XML.
111 |Many of the materials in this document are stale and out of date; 114 | the W3C is maintaining this version solely as a historical reference. 115 | This document was originally produced as a joint publication between 116 | the W3C and the Unicode 117 | Consortium. In 2016, Unicode withdrew publication as a Unicode 118 | Technical Report.
119 | 120 |Sending comments on this 124 | document
125 |If you wish to make comments regarding this document, please raise them as github issues. 126 | Only send comments by email if you are unable to raise issues on github (see links below). 127 | All comments are welcome.
128 |To make it easier to track comments, please raise separate issues or emails for each comment, and point to the section you are commenting on using a URL for the dated version of the document.
129 |The Unicode Standard [Unicode] defines the 136 | universal character set. Its primary goal is to provide an unambiguous 137 | encoding of the content of plain text, ultimately covering all languages in 138 | the world, but also major text-based notational systems for science, 139 | technology, music, and scholarship.
140 |Currently in its sixth major version, Unicode 141 | contains a large number of characters covering most of the currently used 142 | scripts in the world. It also contains additional characters for 143 | interoperability with older character encodings, and characters with 144 | control-like functions included primarily for reasons of providing 145 | unambiguous interpretation of plain text. Unicode provides specifications for 146 | use of all of these characters.
147 |For document and data interchange, the Internet and the World Wide Web 148 | make extensive use of marked-up text such as HTML 149 | and XML. In many instances, markup provides the same, or 150 | essentially similar features to those provided by format characters in the 151 | Unicode Standard for use in plain text. Another special character category 152 | provided by Unicode are compatibility characters. While there may be valid 153 | reasons to support these characters and their specifications in plain text, 154 | their use in marked-up text can conflict with the rules of the markup 155 | language. Formatting characters are discussed in Section 3, Characters not Suitable for Use With Markup and 157 | Section 4, Format Characters Suitable for Use With 158 | Markup, compatibility characters in Section 5, Characters with Compatibility Mappings. 160 | Section 6 briefly discusses noncharacters, and Section 7 is devoted to white 161 | space.
162 |The interaction of character 163 | encoding and methods of escaping characters in markup are discussed in the 164 | Character Model for the World Wide Web [Charmod].
165 |The issues of using Unicode characters with marked-up text depend to some 166 | degree on the rules of the markup language in question and the set of 167 | elements it contains. In a narrow sense, this document concerns itself only 168 | with XML, and to some extent HTML. However, much of the general information 169 | presented here should be useful in a broader context, including some page 170 | layout languages.
171 |Many of the recommendations of this 172 | report depend on the availability of particular markup or styling. Where 173 | possible, appropriate DTDs or Schemas should be used or designed to make 174 | such markup or styling available, or the DTDs or Schemas used should be 175 | appropriately extended. The current version of this document makes no 176 | specific recommendations for the design of DTDs or Schemas, or for the use 177 | of particular DTDs or Schemas, but the information presented here may be 178 | useful to designers of DTDs and Schemas, and to people selecting DTDs or 179 | Schemas for their applications.
180 |The recommendations of this report do not apply in the case 181 | of XML used for blind data transport and similar cases.
182 | 183 |This report uses XML [XML] as a prominent and general 186 | example of markup. The XML namespace notation [Namespace] is used to indicate that a certain element 188 | is taken from a specific markup language. As an example, the prefix 'xhtml:' 189 | indicates that this element is taken from [XHTML]. This 190 | means that the examples containing the namespace prefix 'xhtml:' are assumed 191 | to include a namespace declaration of xmlns:xhtml="..."
192 |Characters are denoted using the notation used in the Unicode Standard, 193 | that is, an optional U+ followed by their hexadecimal number, using at least 194 | 4 digits, such as "U+1234" or "U+10FFFD". In XML or HTML this could be 195 | expressed as "ሴ" or "􏿽".
196 |There are several general points to consider when looking at the 203 | interaction between character encoding and markup.
204 |Encoding text as a sequence of characters without further 215 | information leads to a linear sequence, commonly called plain text. Character 216 | follows character, without any particular structure. Markup, on the other 217 | hand, defines a hierarchical structure for the text or data. In the case of 218 | XML and most other, similar markup languages, the markup defines a tree 219 | structure. While this tree structure is linearized for transmission in the 220 | XML document, once the document has been parsed, the tree is available 221 | directly.
222 |Operations that are easy to perform on trees are often 223 | difficult to perform on linear sequences and vice versa. By separating 224 | functionality between character encoding and markup appropriately, the 225 | architecture becomes simpler, more powerful and longer-lasting.
226 |In particular, operations on hierarchical structures can 227 | easily make sure that information is kept in context. Attributes assigned to 228 | parts of a document are moved together with the associated part of the 229 | document. Assigning an attribute to a part of a document limits the scope of 230 | the attribute to that part of the document. Performing the same operations on 231 | linear sequences of characters using control codes to set attributes and to 232 | delimit their scope requires much more work and is error prone. Locating the 233 | start or end of a span of text of the same attribute requires scanning 234 | backwards and forwards for the embedded delimiter or control code. Moving or 235 | editing text often results in mismatched control codes, so that an attribute 236 | might suddenly apply to text it was not intended for.
237 |When markup is not available, plain text may require control 242 | characters. This is usually the case where plain text must contain some 243 | scoping or attribute information in order to be legible, i.e. to be 244 | able to transmit the same content between originator and receiver. Many of 245 | these control characters have direct equivalents in particular markup 246 | languages, since markup handles these concerns efficiently. If both 247 | characters and their markup equivalents may be present in the same text, the 248 | question of priority is raised. Therefore it is important to identify and 249 | resolve these ambiguities at the time markup is first applied.
250 |Besides the basic character encoding and text markup there is 255 | a third contributor to text functionality, namely styling. Markup is 256 | concerned with the logical structure of the text or data, e.g. to 257 | indicate sections, subsections, and headers in a document, or to indicate the 258 | various fields of an address record. Styling is used to present the 259 | information in various ways, e.g. in different fonts, different type 260 | styles (italic, bold), different colors, etc. Some character codes do 261 | not encode a generic character, but a styled character. Where these 262 | characters are used, styling information is frozen, i.e. it is no 263 | longer possible to alter the appearance of the text by applying style 264 | information. However, there are many examples where a historically free 265 | stylistic variation has over time become a semantic distinction that is 266 | properly encoded as plain text. Sometimes, what is a free variation in some 267 | contexts, implies strict semantic differentiation in others. In all such 268 | instances, altering the appearance of the text by styling information would 269 | irreparably alter the content of the text. This is of particular concern with 270 | mathematical notation or systems for phonetic and phonemic transcription 271 | which make extensive semantic use of styles on a character by character 272 | basis.
273 |Dealing with various functionalities on the markup level has 278 | the additional advantage that in most cases, text portions that need some 279 | particular attribute (or styling) are actually those text portions identified 280 | by markup. A paragraph may be in French, a citation may need a bidi 281 | embedding, a keyword may be in italics, a list number may be circled, and so 282 | on. This makes it very efficient to associate those attributes with 283 | markup.
284 |However, where local or point-like functionality is needed, 285 | markup is not very efficient and its main benefit, easy manipulation 286 | of scope, is not required. On the contrary, the intrusion of markup in the 287 | middle of words can make search or sort operations more difficult. For these 288 | cases expressing the information as character codes is not only a viable, but 289 | often the preferred alternative, which needs to be considered in the design 290 | of markup languages.
291 |Character encoding works with a range of integers used as 296 | character codes. This is extremely efficient, but has some limitations. 297 | Markup, on the other hand, is much more extensible. Using technologies such 298 | as XML Namespaces [Namespace] and their application 299 | in schema languages like [XML Schema], various 300 | vocabularies can be mixed.
301 |The suitability of a particular character for markup depends on its status 306 | in the Unicode Standard, the nature of its behavior in text and the 307 | availability of equivalent markup. Many format characters that are needed for 308 | advanced plain text are not suitable for use with markup. Section 3 gives a list and detailed descriptions. 310 | However, not all format characters are unsuitable for use with markup. Section 4 provides a list of format characters that are 312 | suitable for use with markup and gives some discussion about their use. In 313 | addition to format characters, the Unicode Standard also has compatibility 314 | characters, some of which may be replaceable by suitable markup. These 315 | characters are discussed in Section 5.
316 |There are characters which are unsuitable in the context of markup in 323 | XML/HTML and whose use is discouraged, because one or more of the following 324 | conditions apply:
325 |Section 3.1 provides a list of such characters. 333 | Sections 3.2 through 3.10 discuss in more detail the following points for the discouraged 334 | characters.
335 |The following table contains the characters currently considered not 347 | suitable for use with markup in XML or HTML. (See however the note in the Introduction.) They 348 | may also be unsuitable for other markup or page layout languages. For 349 | determining possible conflict this report uses the markup available in 350 | HTML.
351 |Codepoints 358 | |
359 | Names/Description 361 | |
362 | Short 363 | Comment 364 | |
365 |
|---|---|---|
| U+0340..U+0341 | 368 |Clones of grave and acute | 369 |Deprecated in Unicode | 370 |
| U+17A3, U+17D3 | 373 |Obsolete characters for Khmer | 374 |Deprecated in Unicode | 375 |
| U+2028..U+2029 | 378 |Line and paragraph separator | 379 |use <xhtml:br />, 380 | <xhtml:p></xhtml:p>, or equivalent | 381 |
| U+202A..U+202E | 384 |BIDI embedding controls 385 | (LRE, RLE, LRO, RLO, PDF) |
386 | Strongly discouraged in [HTML4.01] | 388 |
| U+206A..U+206B | 391 |Activate/Inhibit Symmetric swapping | 392 |Deprecated in Unicode | 393 |
| U+206C..U+206D | 396 |Activate/Inhibit Arabic form shaping | 397 |Deprecated in Unicode | 398 |
| U+206E..U+206F | 401 |Activate/Inhibit National digit shapes | 402 |Deprecated in Unicode | 403 |
| U+FFF9..U+FFFB | 406 |Interlinear annotation characters | 407 |Use ruby markup [Ruby] | 408 |
| U+FEFF | 411 |as ZWNBSP | 412 |Use U+2060 Word Joiner instead | 413 |
| as Byte Order Mark | 416 |Use only at the start of a file, not as part of 417 | markup | 418 ||
| U+FFFC | 421 |Object replacement character | 422 |Use markup, e.g. HTML <object> or HTML 423 | <img> | 424 |
| U+1D173..U+1D17A | 427 |Scoping for Musical Notation | 428 |Use an appropriate markup language | 429 |
| U+E0000..U+E007F | 432 |Language Tag code points | 433 |Use xhtml:lang or xml:lang | 434 |
Except for Line and Paragraph Separator, or the Byte Order Mark, it is 440 | acceptable for browsers and similar user agents to ignore the presence of 441 | discouraged characters in HTML or XML. It is up to authoring tools to ensure 442 | proper conversion between these characters and equivalent markup where it 443 | exists.
444 |Short description: The line and paragraph separator provide 449 | unambiguous means to denote hard line breaks and paragraph delimiters in 450 | plain text.
451 |Reason for inclusion: These characters were introduced into the 452 | Unicode Standard to overcome the ambiguous and widely divergent use of 453 | control codes for this purpose. See Section 454 | 5.8, Newline Guidelines, in [Unicode].
455 |Problems when used in markup: Including these characters in 456 | markup text does not work where it would duplicate the existing markup 457 | commands for delimiting paragraphs and lines.
458 |Problems with other uses: The be can also 459 | problematic when used in plain text, because legacy data is usually converted 460 | code point for code point into Unicode and all receivers of Unicode plain 461 | text have to effectively be able to interpret the existing use of control 462 | codes for this purpose. As a result, fewer Unicode implementations support 463 | these characters, than would be the case otherwise.
464 |Replacement markup: In HTML, use <xhtml:br /> instead of 465 | U+2028 and surround paragraphs by <xhtml:p> and </xhtml:p> 466 | instead of separating them with U+2029.
467 |What to do if detected: In a browser context, treat as white 468 | space, or ignore. When received in an editing context, replace the character 469 | by the corresponding markup.
470 |Short description: The bidi embedding controls are required to 475 | supplement the Unicode Bidirectional Algorithm in plain text
476 |Reason for inclusion: The Unicode Bidirectional algorithm 477 | unambiguously resolves the display direction for bidirectional text. It does 478 | so by assigning all characters directional categories and then resolving 479 | these in context. In a number of circumstances this implicit method does not produce satisfactory results and embedding controls are 480 | needed to ensure that sender and receiver agree on the display direction for 481 | a given text. See Unicode Technical Report #9, The Bidirectional Algorithm [UAX 9].
483 |Problems when used in markup: These characters duplicate 484 | available markup, which is better suited to handle the stateful nature of 485 | their effect.
486 |Problems with other uses: The embedding controls introduce a 487 | state into the plain text, which must be maintained when editing or 488 | displaying the text. Processes that are modifying the text without being 489 | aware of this state may inadvertently affect the rendering of large portions 490 | of the text, for example by removing a PDF.
491 |Replacement markup: The following table gives the replacement
492 | markup:
493 |
| Unicode | 498 |Equivalent markup | 499 |Comment | 500 |
RLO |
503 | <xhtml:bdo dir = "rtl"> | 504 |505 | |
LRO |
508 | <xhtml:bdo dir = "ltr"> | 509 |510 | |
| </xhtml:bdo> | 514 |when used to terminate RLO or LRO only, otherwise 515 | ignore | 516 ||
| RLE | 519 |dir = "rtl" | 520 |attribute on block or inline element | 521 |
| LRE | 524 |dir = "ltr" | 525 |attribute on block or inline element | 526 |
For details on bidi markup, please see Section 8.2 of HTML [HMTL 4.0-8.2]. The text of HTML 4.0 gives this 531 | recommendation:
532 |533 |551 |Using HTML directionality markup with Unicode 534 | characters. Authors and designers of authoring software should be 535 | aware that conflicts can arise if the dir attribute is used on 538 | inline elements (including BDO) concurrently with the 541 | corresponding [UNICODE] formatting characters. Preferably one or the 543 | other should be used exclusively. The markup method offers a better 544 | guarantee of document structural integrity and alleviates some problems 545 | when editing bidirectional HTML text with a simple text editor, but some 546 | software may be more apt at using the [UNICODE] characters. If both methods are used, great 548 | care should be exercised to insure proper nesting of markup and directional 549 | embedding or override, otherwise, rendering results are undefined.
550 |
This document goes beyond HTML and recommends that only the markup 552 | should be used.
553 |The interpretation of how to handle directionality markup 554 | for block level elements differs in different versions of [CSS].
556 |What to do if detected: In a browser context, ignore. When 557 | received in an editing context, replace the characters by the appropriate 558 | markup.
559 |Short description: These characters are deprecated. They were 564 | originally intended to allow explicit activation of contextual shaping, 565 | numeric digit rendering and symmetric swapping.
566 |Reason for inclusion: These characters were retained from draft 567 | versions of ISO 10646.
568 |Problems when used in markup: The processing model for these 569 | characters is not supported in markup.
570 |Problems with other uses: The Unicode Standard requires that 571 | symmetric swapping, contextual shaping, and alternate digit shapes are 572 | enabled by default and no longer supports inhibiting any of them by use of 573 | these character codes. The most likely effect of their occurrence in 574 | generated text would be that of a 'garbage' character.
575 |Conversion for use with markup: Apply the appropriate conversion 576 | to bring the data stream in line with the Unicode text model for 577 | bidirectional text and cursively-connected scripts.
578 |What to do if detected: When received by a browser as part of 579 | marked up text, they may be ignored. When received in an editing context, 580 | they may be removed, possibly with a warning. Alternatively, an appropriate 581 | conversion from the legacy text model may be provided. This will most likely 582 | be limited to applications directly interfacing with and knowledgeable of the 583 | particular legacy implementation that inspired these characters.
584 |Short description: U+FEFF has two functions. It is formally known 589 | as zero width no-break space (ZWNBSP), and can act as a word joiner, but its primary use is as byte 590 | order mark (BOM), to indicate in a file signature at the start of a file 591 | that a file is in a particular Unicode encoding form and of a particular byte 592 | order. Using U+FEFF as a word joiner in new data is deprecated as of [Unicode3.2] in favor of U+2060 word joiner (WJ). The use as byte 595 | order mark remains unaffected.
596 |Reason for inclusion: Originally included in Unicode for the sole 597 | purpose of indicating byte order or use in file signatures, the character 598 | acquired the ZWNBSP semantics as part of the merger between ISO/IEC 10646 and 599 | Unicode. When used as a byte order mark the character is placed at the 600 | beginning of a file. If a recipient views it as FEFF then the byte order 601 | between sender and receiver match. If the recipient views it as FFFE (a 602 | non-character code point) then the sender used opposite byte order from the 603 | recipient, and the recipient needs to invert the byte order or refuse to read 604 | the file. When used as a ZWNBSP the character is intended to prevent breaks 605 | between adjacent characters. This function is now provided by U+2060 word joiner (WJ) making it 607 | unnecessary to insert U+FEFF in the middle of a file. For more information 608 | see Chapter 16 of [Unicode].
609 |Problems when used in markup: Using U+FEFF as ZWNBSP makes it 610 | impossible to distinguish it from the case where a byte order mark was left 611 | in the middle of a file inadvertently due to incorrect splicing. U+FEFF can 612 | and in some cases (XML encoded in UTF-16) must be used at the start of a file 613 | containing markup, but as a signature, this is not part of actual markup or 614 | marked-up content. Some older versions of browsers and parsers may not 615 | correctly recognize U+FEFF at the start of a file encoded in UTF-8. For 616 | details of how U+FEFF participates in encoding detection of XML files, see 617 | Appendix F of [XML 1.0].
618 |Problems with other uses: The use of byte order mark as ZWNBSP is 619 | also problematic when used in plain text, and has been deprecated for that 620 | purpose in favor of U+2060 word 621 | joiner. The use of U+FEFF in file signatures to indicate byte order is 622 | the only recommended use of this character.
623 |Replacement markup: None. In locations other than the beginning 624 | of a text file, U+FEFF can be removed or replaced by U+2060 in an editing 625 | environment.
626 |What to do if detected: When received by a browser as part of 627 | marked-up text, treat depending on location. At the start of an external 628 | entity, treat as byte order mark (i.e. as part of the character encoding, not 629 | as part of the parsed character stream, see e.g. Section 4.3.3 of [XML 1.0]). Otherwise, assume it is older data using it as 631 | ZWNBSP. When receiving plain text in an editing environment, editors may take 632 | one or more of several actions: replace ZWNBSP in the middle of a file with 633 | WJ or issue a warning to the user.
634 |Short description: The interlinear annotation characters are used 639 | to delimit interlinear annotations in certain circumstances. They are 640 | intended to provide text anchors and delimiters for interlinear annotation 641 | for in-process use and are not intended for interchange.
642 |Reason for inclusion: The interlinear annotation characters were 643 | included in Unicode only in order to reserve code points for very frequent 644 | application-internal use. The interlinear annotation characters are used to 645 | delimit interlinear annotations in contexts where other delimiters are not 646 | available, and where non-textual means exist to carry formatting information. 647 | Many text-processing applications store the text and the associated markup 648 | (or in some cases styling information) of a document in separate structures. 649 | The actual text is kept in a single linear structure; additional information 650 | is kept separately with pointers to the appropriate text positions. This is 651 | called out-of-band information. The overall implementation makes sure that 652 | these two structures are kept in sync. If the text contains interlinear 653 | annotations, it is extremely helpful for implementations to have delimiters 654 | in the text itself; even though delimiters are not otherwise used for style 655 | markup. With this method, and unlike the case of the object replacement 656 | character, all textual information can remain in the standard text stream, 657 | but any additional formatting information is kept separately. In addition, 658 | the Interlinear Annotation Anchor serves as a placeholder for formatting 659 | information for the whole annotation object, the same way a paragraph mark 660 | can be a placeholder to attach paragraph formatting information.
661 |Problems when used in markup: Including interlinear annotation 662 | characters in marked-up text does not work because the additional formatting 663 | information (how to position the annotation,...) is not available.
664 |Problems with other uses: The interlinear annotation characters 665 | are also problematic when used in plain text, and are not intended for that 666 | purpose. In particular, on older display systems that simply ignore or 667 | replace the Interlinear Annotation Characters, the meaning of the text may be 668 | changed.
669 |Replacement markup: The markup to be used in place of the 670 | Interlinear Annotation Characters depends on the formatting and nature of the 671 | interlinear annotation in question. For ruby, please see [Ruby].
673 |What to do if detected: When received by a browser as part of 674 | marked-up text, they may be ignored. When receiving plain text in an editing 675 | environment, editors may take one or more of several actions: remove U+FFF9 676 | together with removing all characters between U+FFFA and following U+FFFB; 677 | ignore U+FFF9 and turn U+FFFA and U+FFFB into "[" and "]" respectively, or 678 | into similar characters; issue a warning to the user; or tentatively convert 679 | into appropriate ruby markup for further editing and formatting by the 680 | user.
681 |Short description: The object replacement character is used to 686 | stand in place of an object (e.g. an image) included in a text.
687 |Reason for inclusion: The object replacement character was 688 | included in Unicode only in order to reserve a codepoint for a very frequent 689 | application-internal use. Many text-processing applications store the text 690 | and the associated markup (or in some cases styling information) of a 691 | document in separate structures. The actual text is kept in a single linear 692 | structure; additional information is kept separately with pointers to the 693 | appropriate text positions. The overall implementation makes sure that these 694 | two structures are kept in sync. If the text contains objects such as images, 695 | it is extremely helpful for implementations to have a sentinel in the text 696 | itself; any additional information is kept separately.
697 |Problems when used in markup: Including an object replacement 698 | character in markup text does not work because the additional information 699 | (what object to include,...) is not available.
700 |Problems with other uses: The object replacement character is 701 | also problematic when used in plain text, because there is no way in plain 702 | text to provide the actual object information or a reference to it.
703 |Replacement markup: The markup to be used in place of the Object 704 | Replacement Character depends on the object in question and the markup 705 | context it is used in. Typical cases are <xhtml:img src='...' />, 706 | <xhtml:object ...>, or <html:applet ...>. These constructs allow 707 | providing all additional information needed to identify and use the object in 708 | question.
709 |What to do if detected: Browsers may ignore this character. When 710 | received in an editing context, if the actual object is accessible, editors 711 | may either replace the character by the appropriate markup for that object, 712 | or otherwise remove it, ideally providing a warning.
713 |Short description: A series of characters for controlling scope 718 | in musical notation.
719 |Reason for inclusion: These characters designate the start and 720 | end of common musical constructs. Full musical layout depends on additional 721 | information, for example pitch, that cannot be encoded using Unicode. 722 | However, many musical symbols may be depicted in isolation (and without 723 | assigning pitch) as part of a textual discussion of music. Plain text use of 724 | Unicode characters is primarily intended for this latter purpose. The scoping 725 | operators can be used to support limited renderings of beams, slurs, phrases, 726 | etc. in this context. However, in the context of markup languages, musical 727 | scoring calls for a dedicated markup language (analogous to MathML) which 728 | would be expected to contain markup for these constructs.
729 |Problems when used in markup: These characters duplicate 730 | information that can in principle be expressed in markup.
731 |Problems with other uses: Their special code range allows them to 732 | be easily filtered, but applications that do not expect them will treat them 733 | as garbage characters.
734 |Replacement markup: Replace with equivalent markup if 735 | available.
736 |What to do if detected: Browsers may ignore these characters. 737 | When received in an editing context, editors may remove or replace them by 738 | equivalent markup.
739 |Short description: A series of characters for expressing language 744 | tags, based on existing standards for language tags using the rules in 745 | Chapter 16 of [Unicode].
746 |Reason for inclusion: These characters allow in-band language 747 | tagging in situations where full markup is not available, while allowing easy 748 | filtering by applications that do not support them. They were solely included 749 | for the benefit of those Internet protocols, such as ACAP, which require a 750 | standard mechanism for marking language in UTF-8 strings, and at the same 751 | time to avoid the use of other tagging schemes that relied on specific 752 | details of the encoding form used.
753 |Problems when used in markup: These characters duplicate 754 | information that can be expressed in markup.
755 |Problems with other uses: Their special code range allows them to 756 | be easily filtered, but applications that do not expect them will treat them 757 | as garbage characters.
758 |Replacement markup: Replace with equivalent language markup. XML 759 | and XHTML have the xml:lang attribute. HTML has the lang attribute. These 760 | attributes follow different scoping rules than the tag characters, therefore 761 | this replacement will generally not be a simple 1:1 substitution.
762 |What to do if detected: Browsers may ignore these characters. 763 | When received in an editing context, editors may remove or replace them by 764 | equivalent markup.
765 |Short description: The Unicode Character Database [UnicodeData] lists all characters that have been 771 | deprecated in [Unicode]. This list may grow (slowly) 772 | over time. Deprecated characters remain valid characters forever, but their 773 | use is strongly discouraged. Deprecation of characters is applied only in 774 | exceptional circumstances. It is never the result of historical changes of a 775 | writing system: characters no longer in current, modern use are retained in 776 | Unicode, as they are needed for the representation of historical 777 | documents.
778 |Reason for inclusion: Usually, characters that are deprecated 779 | were never needed, but were inadvertently added to the Unicode Standard, 780 | perhaps based on incomplete information available at the time of encoding.
781 |Problems when used in markup: Except where noted elsewhere in 782 | this document, their presence in markup presents the same problems as in 783 | plain text, usually that of an unnecessary duplicate encoding.
784 |Problems with other uses: Depends on the character and the reason 785 | for its deprecation. For more information see [Unicode].
787 |Conversion for use with markup: For deprecated characters not 788 | discussed elsewhere in this document, see the relevant descriptions of those 789 | characters in [Unicode] for information on the 790 | recommended alternatives.
791 |What to do if detected: Unless a specific recommendation is 792 | given elsewhere, deprecated characters are not ignored; where possible, in an 793 | editing environment, a preferred alternate encoding may be substituted.
794 |The following table contains format characters that do not exhibit the 801 | problems discussed at the start of Section 3. Despite 802 | their apparent relation to or similarity with characters in table 3.1, they are considered suitable for use with markup. 804 | It is not acceptable for user agents to ignore the characters in table 4.1. 805 | For a description of these characters see [Unicode].
807 |Code 813 | points 814 | |
815 | Names/Description 817 | |
818 | Short 819 | Comment 820 | |
821 |
|---|---|---|
| U+00A0 | 824 |No-break Space | 825 |Line break control | 826 |
| U+00AD | 829 |Soft Hyphen | 830 |Line break control | 831 |
| U+034F | 834 |Combining Grapheme Joiner | 835 |Used in sorting | 836 |
| U+0600 | 839 |Arabic Number Sign | 840 |Subtending mark | 841 |
| U+0601 | 844 |Arabic Sign Sanah | 845 |Subtending mark | 846 |
| U+0602 | 849 |Arabic Footnote Marker | 850 |Subtending mark | 851 |
| U+0603 | 854 |Arabic Sign Safha | 855 |Subtending mark | 856 |
| U+06DD | 859 |Arabic End of Ayah | 860 |Enclosing mark | 861 |
| U+070F | 864 |Syriac Abbreviation Mark (SAM) | 865 |Supertending mark | 866 |
| U+0F0C | 869 |Tibetan Mark Delimiter Tsheg Bstar | 870 |Non-breaking form of 0F0B | 871 |
| U+115F..U+1160 | 874 |Hangul Jamo Fillers | 875 |Filler | 876 |
| U+180B..U+180E | 879 |Mongolian Variation Selectors(FVS1..FVS3), Mongolian 880 | Vowel Separator | 881 |Required for Mongolian | 882 |
| U+200B | 885 |Zero-width Space | 886 |Line break control | 887 |
| U+200C..U+200D | 890 |Zero-width Join Controls (ZWJ and ZWNJ) | 891 |Required for a.o. Persian and many Indic scripts | 892 |
| U+200E..U+200F | 895 |Implicit Directional Marks (LRM and RLM) | 896 |LRM and RLM are allowed | 897 |
| U+2011 | 900 |Non-breaking Hyphen | 901 |Line break control | 902 |
| U+202F | 905 |Narrow No-break Space | 906 |Line break control/Mongolian | 907 |
| U+2044 | 910 |Fraction Slash | 911 |Or use markup (MathML) | 912 |
| U+2060 | 915 |Word Joiner | 916 |Use for that purpose instead of U+FEFF ZWNBSP | 917 |
| U+2061..U+2064 | 920 |Invisible Mathematical Operators | 921 |Mathematical use | 922 |
| U+2FF0..U+2FFB | 925 |Ideographic Character Description | 926 |Graphic characters (not controls) | 927 |
| U+303E | 930 |Ideographic Variation Indicator | 931 |Graphic character (not a control) | 932 |
| U+FF80 | 935 |Halfwidth Hangul Filler | 936 |Filler, not generally required | 937 |
| FE00..FE0F | 940 |Variation Selectors | 941 |Modify graphic characters | 942 |
| E0100..E01DF | 945 |Variation Selectors | 946 |Modify graphic characters | 947 |
The following subsections briefly discuss some of the characters from the 952 | above list, particularly those that affect more than their immediately 953 | adjacent neighbors. Please see the Unicode Standard [Unicode] for full details.
955 |Subtending marks are needed to represent a common feature in the Arabic 958 | and Syriac scripts where a mark can be placed below a range of characters, 959 | for example below a sequence of digits, to indicate a year. The Syriac 960 | abbreviation mark is placed above a series of characters, making it 961 | technically a supertending mark, and the ARABIC END OF AYAH is an enclosing 963 | mark. In the character stream, a subtending mark precedes the affected 964 | characters. The end of affected range of characters is defined implicitly, 965 | usually by the first non-alphanumeric character.
966 |Unlike subtending marks, the scope of combining enclosing 967 | marks, such as combining 969 | enclosing circle, is limited to the preceding default grapheme 970 | cluster. For details on grapheme clusters see Unicode Standard Annex #29: 971 | "Text Boundaries", [UAX 29] .
972 |There is currently no existing markup that can represent the 973 | scoping and layout functions defined by these characters, so they cannot be 974 | substituted. It is unresolved to what degree intervening markup affects the 975 | scope of these marks.
976 |The fraction slash is used between sequences of decimal 980 | digits to form fractions. Whether the resulting fraction has a horizontal or 981 | diagonal fraction line is unspecified. The fallback is to leave the digits 982 | unchanged and display a regular slash. In order to separate a digit from a 983 | following fraction, as in 1¾, the use of U+2009 THIN SPACE is recommended.
985 |For better control of fractions the use of [MathML] is suggested where appropriate.
987 |A variation selector is intended to cause a specific variant form (or 991 | range of variant forms) when applied to a base character. For a variation 992 | selector to have an effect it must immediately follow its base character. 993 | Only pre-determined combinations of selected base characters and specific 994 | variation selectors have a defined effect. All other combinations are 995 | ill-formed and are to be ignored. The list of standardized combinations is 996 | documented in the Unicode Character Database, see [Variants]. In addition to the 256 generic variation 998 | selectors, there are 3 Mongolian free variation selectors. They 999 | function in all other ways like variation selectors, except they only apply 1000 | to base characters from the Mongolian script. Since Mongolian, like Arabic, 1001 | has positional character shapes, the variations are limited to particular 1002 | shaping contexts.
1003 |Ideographic Description Characters are included in the Unicode Standard as 1007 | a means to indicate the composition of ideographs from a combination of 1008 | pieces (terms), where each piece or term is either a Unicode character or 1009 | composed. Ordinarily the result would be a human readable description of a 1010 | character, perhaps one for which a font is not available. However, at least 1011 | some vendors are interested in automatic conversion of these sequences into 1012 | single ideographs.
1013 |These characters are needed to convey the intended meaning of a 1017 | mathematical expression to an automated parser whenever two elements are 1018 | simply written next to each other. See Unicode Technical Report #25: "Unicode 1019 | Support for Mathematics" [UTR25] for more details.
1020 |Most of these characters prevent line breaks adjacent to them, but ZWSP 1024 | and SHY provide invisible line break opportunities. The detailed function of 1025 | these characters is described in Unicode Standard Annex #14: "Line Breaking 1026 | Properties" [UAX14]. While high-end applications may be 1027 | able to deduce line breaking opportunities automatically solely with the help 1028 | of very generic markup or styling properties, the use of these characters 1029 | currently provides the most reliable and straight-forward way to control line 1030 | breaking and hyphenation. Note that [HTML4.01] uses 1031 | U+00A0 NO-BREAK SPACE also as a "hard space" (i.e. a space with a fixed 1032 | width), something that is not part of its character semantics in [Unicode].
1034 |U+2011 NON-BREAKING HYPHEN (NBHY) is used to encode a hyphen that does not 1035 | provide a line break opportunity. In several languages, the sequence <SHY, 1036 | NBHY> may be used to handle special line breaking behavior for explicit 1037 | hyphens, see [UAX14].
1038 |These should not be needed except for texts that need to have a fixed 1042 | number of jamos per Korean syllable block. See the description of Korean 1043 | Syllable Blocks in [Unicode].
1044 |The Unicode Standard provides compatibility mappings for a number of 1049 | characters. Compatibility mappings indicate a relationship to another 1050 | character, but the exact nature of the relationship varies. In some cases the 1051 | relationship means "is based on" in some other cases it denotes a property. 1052 | When plain text is marked up, it may make sense to map some of these 1053 | characters to a combination of their compatibility equivalents and suitable markup. It is important to 1055 | understand the nature of the distinctions between characters and their 1056 | compatibility equivalents and the context in which these distinctions matter. 1057 | It is never advisable to apply compatibility mappings indiscriminately. This 1058 | section provides guidance on when and how to apply compatibility mappings in 1059 | the case of importing text from non-XML (non-marked-up) sources. The section 1060 | is organized by the "compatibility tag" associated with each compatibility 1061 | mapping.
1062 |The following table gives an overview of the various compatibility 1065 | characters, organized by "compatibility tag". The first column, Tag 1066 | value, contains the value of the "compatibility tag" from the Unicode 1067 | Character Database [UnicodeData]. Although these 1068 | tags use "<" and ">", they do not appear as such in markup and should 1069 | not be confused with XML tags. Code range indicates a further break 1070 | down by code points. Action summarizes the recommended action to be 1071 | taken whenever markup is first applied to non-XML text. Each entry indicates 1072 | whether the characters can be substituted using the compatibility equivalent 1073 | according to Normalization Form KC of [UAX 15], can be 1074 | replaced by equivalent markup where available, or should be retained. For 1075 | some cases, instead of or in addition to markup, style information [CSS] is needed. Description and usage provides 1077 | additional information. Sections 5.3 through 5.6 provide additional information for some of these 1079 | sets of compatibility characters including detailed recommended actions.
1080 || Tag value | 1086 |Code range | 1087 |Action | 1088 |Description and usage | 1089 |
|---|---|---|---|
| <circled> | 1092 |all | 1093 |retain | 1094 |Circled letters and digits used for list 1095 | item markers, and in running text | 1096 |
| <compat> | 1099 |2002..200A | 1100 |retain | 1101 |Fixed width spaces | 1102 |
| 2100..2101 | 1105 |retain | 1106 |Variant letter forms that are used as 1107 | symbols | 1108 ||
| 2105..2106 | 1111 |retain | 1112 |Variant letter forms that are used as 1113 | symbols | 1114 ||
| 2121, 213B | 1117 |retain | 1118 |For use as single code point in vertical 1119 | layout | 1120 ||
| 2160..217F | 1123 |retain, or use list item marker style, or 1124 | normalize | 1125 |For use as single code point in vertical 1126 | layout, or as list item marker | 1127 ||
| 2474..249B | 1130 |retain, or use list item marker style, or 1131 | normalize | 1132 |Parenthesized or dotted number used as 1133 | list item marker | 1134 ||
| 249C..24B5 | 1137 |retain, or use list item marker style, or 1138 | normalize | 1139 |Parenthesized letters used as list item 1140 | markers | 1141 ||
| 3131..318E | 1144 |retain | 1145 |Compatibility Hangul Jamo. These do not 1146 | conjoin | 1147 ||
| 3200..3229 | 1150 |retain, or use list item marker style, or 1151 | normalize | 1152 |Parenthesized characters used as list item 1153 | markers | 1154 ||
| 322A..3243 | 1157 |retain | 1158 |Parenthesized characters used 1159 | as symbols in vertical layout | 1160 ||
| 32C0..32CB | 1163 |retain | 1164 |String used as single code point in 1165 | vertical layout | 1166 ||
| all other | 1169 |retain | 1170 |Maintain, semantic distinctions apply | 1171 ||
| <final> | 1174 |all | 1175 |normalize | 1176 |Arabic Presentation forms | 1177 |
| <font> | 1180 |all | 1181 |retain | 1182 |Variant letter forms that are used as 1183 | symbols | 1184 |
| <fraction> | 1187 |all | 1188 |normalize | 1189 |As long as fraction slash is 1190 | supported! | 1191 |
| <initial> | 1194 |all | 1195 |normalize | 1196 |Arabic Presentation forms | 1197 |
| <isolated> | 1200 |all | 1201 |normalize | 1202 |Arabic Presentation forms | 1203 |
| <medial> | 1206 |all | 1207 |normalize | 1208 |Arabic Presentation forms | 1209 |
| <narrow> | 1212 |all | 1213 |retain | 1214 |Half-width characters | 1215 |
| <noBreak> | 1218 |all | 1219 |retain | 1220 |The compatibility mapping merely indicates 1221 | the equivalent breaking character. The noBreak distinction must be 1222 | preserved | 1223 |
| <small> | 1226 |all | 1227 |retain | 1228 |Precise usage unknown. Maintain, but do 1229 | not generate | 1230 |
| <square> | 1233 |3300..3357 | 1234 |retain | 1235 |Single display cell cluster containing 1236 | multiple lines of kana for vertical layout | 1237 |
| 3358..337D | 1240 |retain | 1241 |For use as single code point in vertical 1242 | layout | 1243 ||
| 33E0..33FE | 1246 |retain | 1247 |For use as single code point in vertical 1248 | layout | 1249 ||
| all other | 1252 |retain | 1253 |Variant letter form used as symbol in 1254 | vertical layout | 1255 ||
| <sub> | 1258 |2080..208E | 1259 |retain, or use markup | 1260 |Subscript digits 0-9, as well as minus, 1261 | plus, equal and parens | 1262 |
| all other | 1265 |retain | 1266 |Subscript characters, usually used as 1267 | modifier letters in phonetic notation | 1268 ||
| <super> | 1271 |00B2..00B3 | 1272 |retain, or use markup | 1273 |Superscript digits 0-9, as 1274 | well as minus, plus, equal and parens | 1275 |
| 00B9 | 1278 ||||
| 2070 | 1281 ||||
| 2074..207E | 1284 ||||
| all other | 1287 |retain | 1288 |Superscript characters, usually used as 1289 | modifier letters in phonetic notation | 1290 ||
| <vertical> | 1293 |all | 1294 |normalize | 1295 |East Asian Presentation forms | 1296 |
| <wide> | 1299 |all | 1300 |retain | 1301 |Full-width characters | 1302 |
Some symbols used in vertical layout exist as single code 1307 | points in legacy systems, but can also be composed on the fly by more 1308 | advanced display engines. There are currently no style properties that 1309 | could be used to express squared Kana clusters (kumimoji) or 1310 | horizontal in vertical writing mode (tate-chu-yoko).
1311 |Presentation forms and characters for which adequate representation exists 1316 | as marked up text should never be entered into new data. Many of the 1317 | characters with <font> tag are however suitable for new data, as long 1318 | as they are used in the manner they are intended, that is as symbols, with 1319 | definite semantic differentiation between the different forms. The largest 1320 | set of these characters exists to carry essential semantic distinctions in 1321 | mathematical notation, where the any loss of markup during text export would 1322 | compromise the meaning of the text. Most of the characters with <super> 1323 | and <sub> tag have been encoded for use in phonetic or phonemic 1324 | transcriptions, where they act as ordinary letters and the use of style 1325 | markup is therefore deemed inappropriate. However, it is inappropriate to use 1326 | any of these classes of characters to create the appearance of styled text 1327 | runs.
1328 |For example to write hello
1329 | (in italic style), one should use the
1330 | equivalent of <span style='font-style: italic;'>hello</span>,
1331 | and not the sequence of Unicode characters ℎℯℓℓℴ
1332 | (U+210E PLANCK CONSTANT,
1333 | U+212F SCRIPT SMALL E,
1334 | U+2113 SCRIPT SMALL L,
1335 | U+2113 SCRIPT SMALL L,
1336 | U+2134 SCRIPT SMALL O).
1337 | Conversely, to indicate Planck's constant one should use
1338 | ℎ (U+210E PLANCK CONSTANT) and not something such as
1339 | <span style='font-style: italic;'>h</span>.
When style is applied across entire words, sentences or paragraphs, the 1341 | use of markup is preferred. When style is applied to individual letters, 1342 | especially to letters inside a word, giving them a particular interpretation, 1343 | the use of character codes is preferred. See also Section 5.6.
1345 |Short description: Characters with a <circled> tag or 1349 | characters with <compat> tag and compatibility mapping to a 1350 | parenthesized string.
1351 |Reason for inclusion: They are most frequently used for marking 1352 | enumerated list items, but the characters with a <circled> tag often 1353 | occur as dingbats or footnote markers in tables. The same characters are used 1354 | in regular text when citing an item from a corresponding ordered list.
1355 |Problems when used in markup: These characters do not cause undue 1356 | interaction with markup
1357 |Problems with other uses: None
1358 |Replacement markup: (in text use) these characters are often used 1359 | in running text; sometimes, but not exclusively, in situations where the text 1360 | is to be associated with an item from a nearby numbered list. Replacement 1361 | markup may not be available, and the support for such markup is much more 1362 | limited today than was anticipated when this document was first written.
1363 |(list item style) When generating marked up text these characters occur 1364 | only internal to the user agent when list item styles are rendered. When 1365 | marking up plain text data they could be converted to suitable list item 1366 | styles, if such use can be properly inferred. The default recommendation is 1367 | to retain the original character.
1368 |(characters with compatibility mappings of the form "(n)" or 1369 | "n." or roman numerals) Unlike circled characters, these could be 1370 | rendered by sequences of regular characters. Using a list item marker style 1371 | would in theory allow the support of longer lists (the Unicode characters are 1372 | limited to the set (1) to (20) and "1." to "20."). Using regular character 1373 | sequences would also allow the use of fonts that match the text of the 1374 | list.
1375 |What to do if detected: No action needs to be taken by browsers. 1376 | When received in an editing context, substitution of a list item marker style 1377 | may be appropriate. However, the same characters are very often used as 1378 | dingbat-like symbols in tables, or may appear in general text, whether or not 1379 | referring to an item from a list. Therefore the user must have the choice of 1380 | whether to replace the character.
1381 |Short description: Single character fractions such as ½ or ¼.
1385 |Reason for inclusion: Subsets of these occur in practically all 1386 | legacy character sets.
1387 |Problems when used in markup: The character repertoire is limited 1388 | to a few common fractions. When used with more general methods of generating 1389 | fractions such as MathML [MathML] the usual problem of 1390 | dual representation arises.
1391 |Problems with other uses: Other than normalization issues, these 1392 | characters present no undue problems in plain text. Where fraction slash is 1393 | supported, these can be expressed by substituting their compatibility 1394 | mappings.
1395 |Replacement markup: MathML can represent fractions unambiguously. 1396 | When using fraction slash, care must be taken such that values like 3½ do not 1397 | turn into 31/2 (=15.5).
1398 |What to do if detected: No action needs to be taken by browsers 1399 | or editors, except when converting plain text to MathML.
1400 |Short description: Characters that are symbols composed of groups 1404 | of typically kana or Latin letters, digits plus slash for use in a single 1405 | display cell in vertical display of text.
1406 |Reason for inclusion: Many existing character sets contain these 1407 | as precomposed characters since for simple implementations this is the only 1408 | way to support the common use of providing metric units and other 1409 | abbreviations in a single character cell for vertical text layout.
1410 |Problems when used in markup: Proposed markup, including CSS 1411 | styling, would be able express an unbounded set of these abbreviations, 1412 | obviating the need of cataloguing these in the character encoding standard 1413 | and making them more directly accessible to text based processing, for 1414 | example searching.
1415 |Problems with other uses: The repertoire of these legacy 1416 | characters is limited; many more combinations are in actual use than are 1417 | accounted for in character sets. Pre-composed symbols do not make their text 1418 | content available to search engines. They also require re-encoding for text 1419 | laid out horizontally.
1420 |Replacement markup: None available.
1421 |What to do if detected: No action required. (Subject to change 1422 | pending the outcome of current proposals.)
1423 |Short description: Mainly super and subscript digits, but also 1427 | signs, parentheses and a large number of letters.
1428 |Reason for inclusion: Super and subscripted letters and digits 1429 | are quite common in some forms of phonetic or phonemic transcriptions, where 1430 | the use of styles is both awkward and prone to data integrity issues when 1431 | exported to plain text. For super or subscripted letters in phonetic 1432 | transcription in particular, a change from superscript of subscript to 1433 | regular style would alter the meaning. Note that such use in transcription is 1434 | not limited to letters: superscripted small digits are often used to indicate 1435 | tone. When used for these purposes, these characters should be retained and 1436 | markup should not be used.
1437 |A few super and subscript characters, primarily the digits, also occur in 1438 | many legacy character sets, including Latin-1. Their use in pure plain text 1439 | is common for databases, e.g. including metric units for part descriptions 1440 | (viz. cm2) or for (usually simplified) formulae as occur in titles 1441 | of scientific publications.
1442 |When used in mathematical context (MathML) it is recommended to 1443 | consistently use style markup for superscripts and subscripts. This is 1444 | because mathematical layout allows not just individual symbols, but entire 1445 | expressions to be superscripted or subscripted in a regular, nested 1446 | manner.
1447 |Problems when used in markup: Mixing direct use of these 1448 | characters with the use of style markup provides multiple representations of 1449 | the same text, leading to potentially different treatment by search and 1450 | display engines.
1451 |However, when super and sub-scripts are to reflect semantic distinctions, 1452 | it is easier to work with these meanings encoded in text rather than markup, 1453 | for example, in phonetic or phonemic transcription. Otherwise, they would 1454 | require markup in the middle of words, and they may also be inadvertently 1455 | changed to normal style text, when exporting to plain text. This applies to 1456 | the majority of super and subscripted characters in Unicode. On the other 1457 | hand, some user agent may support certain superscripted or subscripted 1458 | characters only when used as marked up text for example, because of lack of 1459 | font support for them.
1460 |Problems with other uses: none
1461 |Replacement markup: Unless used as letters, <xhtml:sup> and 1462 | <xhtml:sub> or <mathml:msup> and <mathml:msub> may be 1463 | used.
1464 |What to do if detected: Both representations (with or without 1465 | style markup) should be equivalent for search purposes. Input methods for 1466 | mathematical texts might enforce the use of styles. If superscript 1467 | characters are encountered during display of mathematical formulae, it is 1468 | recommended that they be displayed in a manner indistinguishable from that 1469 | achieved by using regular characters with corresponding style markup..
1470 |Short description: The <compat> label was given to a set of 1474 | compatibility characters whose further classification was not settled at the 1475 | time the standard was created. The largest components are list item marker 1476 | characters.
1477 |Reason for inclusion: These characters occur in many legacy 1478 | character sets.
1479 |Problems when used in markup: none. There usually is no 1480 | equivalent markup.
1481 |Problems with other uses: none
1482 |Replacement markup: none.
1483 |What to do if detected: No action required.
1484 |The Unicode Standard defines 66 non-character code points, or noncharacters. These are the last two positions on each of the 17 1489 | planes, in other words, all characters whose code points end in ...FFFE or 1490 | ...FFFF, as well as the 32 code points from U+FDD0 to U+FDEF. Applications 1491 | are free to use any of these code points internally but should never attempt 1492 | to interchange them. In effect, noncharacters can be thought of as 1493 | application-internal private-use code points.
1494 |This section presents common issues with white space characters in markup 1498 | languages, mostly based on their difference in function as part of the 1499 | structure of the markup source (syntactic white space) on the one hand and as 1500 | part of the document content on the other hand.
1501 |The set of characters in the Unicode standard that have the property 1502 | "White_Space" (see 'White Space' in the [UCD]) is 1503 | quite large. It includes white space characters with different line breaking 1504 | properties, different ligating properties, and different widths. It is 1505 | appropriate to use these characters as part of markup content for their very 1506 | specific purpose. It is preferable to place them in the markup source so 1507 | that they are surrounded by ordinary characters rather than line breaks for 1508 | example. The set of white space characters defined by typical markup 1509 | language specifications is a subset of the characters that are considered 1510 | white space by [Unicode] .
1511 |Each markup language defines the set of characters that it accepts as part 1512 | of the markup syntax, this is usually a very small set. The XML [XML1.0] and [XML1.1] specifications 1514 | define white space as a combination of one or more of the following 1515 | characters: U+0020 SPACE, carriage return (U+000D), line feed (U+000A), or 1516 | tab (U+0009). [HTML4.01] adds to these the form feed 1517 | character (U+000C), but that character cannot be used in any XHTML 1518 | version.
1519 |In addition, markup languages may use conventions for converting or 1520 | removing some kinds of white space. XML processors replace some combinations 1521 | of end-of-line characters by a single line feed character. [XML1.0] normalizes any two character sequences of (U+000D 1523 | U+000A) or any U+000D not followed by U+000A to a single U+000A. [XML1.1] also normalizes NEL (U+0085) and U+2028 LINE 1525 | SEPARATOR, but U+2029 PARAGRAPH SEPARATOR is not treated that way. Additional 1526 | processing of white space before it is handled to an application also occurs 1527 | for attribute values: line breaks are replaced by spaces, leading and 1528 | trailing spaces are removed, and subsequent spaces are replaced by a single 1529 | space.
1530 |In XML, white space is purely syntactic inside tags, for example, to 1531 | separate the element name from attributes, and between elements in element 1532 | content models (as they are typical for data-oriented applications). White 1533 | space in element content models is used to lay out the markup source, using 1534 | line breaks and indentation, to improve readability. The same use of white 1535 | space is possible in many cases in mixed content (typical for text-oriented 1536 | applications).
1537 |Because XML is used for a very wide range of applications, after the 1538 | processing steps mentioned above it passes all white space to the 1539 | application. Some XML applications such as [XHTML] may 1540 | have their own white space processing rules when processing white space 1541 | characters. Also, applications and software transforming XML (e.g. [XSLT]) have specific conventions of how they handle white 1543 | space, and specific ways of how to control this behavior. To appropriately 1544 | use white space characters, readers are advised to examine all involved 1545 | standards and software.
1546 |If the characters U+2028 and U+2029 appear in text, they may be treated as 1547 | zero-width characters without semantic meaning (see Section 3.2).
1548 |White space that is not purely syntactic, including control codes that 1551 | define a newline function (see Section 5.8, Newline Guidelines, in [Unicode]), can be handled in three main ways.
1553 |When reflowing, line breaks and adjacent white space can be treated as 1571 | space, removed, collapsed with adjacent control characters of the same type, 1572 | or treated as zero-width space. Which choice is appropriate depends on the 1573 | script of the surrounding text. The assumption is that line breaks and 1574 | adjacent white space (in particular following white space, used for 1575 | indentation) was added to make the markup source more readable, in particular 1576 | to make each line fit on a line of a plain text editor. For scripts that use 1577 | spaces, line breaks will have been inserted where there originally was a 1578 | space; treating them as spaces therefore preserves the intended separation 1579 | between words. For scripts which do not use spaces, such as Ideographic 1580 | scripts or certain South East Asian scripts, such as Thai, line feeds should 1581 | be removed, or replaced by U+200B zero width space. The choice of treatment 1582 | can depend on the script value of the characters preceding and following the 1583 | line feed character, assuming these characters belong to the same run of 1584 | text.
1585 |The Unicode Standard [Unicode] 1586 | specifies that the zero width space is considered a valid line-break point 1587 | and that if two characters with a zero width space in between are placed on 1588 | the same line they are placed with no space between them; and that if they 1589 | are placed on two lines no additional glyph area is created at the 1590 | line-break.
1591 |The details of reflowing are the responsibility of the various markup 1592 | applications (e.g. [XHTML]). However, there is a 1593 | tendency to move this functionality from markup applications to styling, so 1594 | that it can be shared across applications.
1595 |Authors should be aware of the fact that the above script-specific 1596 | treatment of line breaks when reflowing text is not yet available in all 1597 | implementations (e.g. browsers). For scripts that do not use white space to 1598 | separate words, it may therefore still be advisable to not split long 1599 | lines.
1600 |Editing tools should try to support the user in the appropriate use of 1601 | white space. Some white space characters cannot easily be entered via a 1602 | keyboard, but some others, e.g. U+3000 Ideographic Space, can. Editing tools 1603 | should try to make sure that only line breaks and white space that is 1604 | accepted as syntactic white space by the relevant markup language are used to 1605 | improve markup source readability.
1606 |While the styling possibilities provided by CSS and its implementations 1607 | have not reached the level of professional typesetting systems, they offer a 1608 | wide range of ways to control layout and spacing of text. A very simple 1609 | example is text centering, which would have been done by inserting an 1610 | appropriate number of spaces on each line in pure plain text.
1611 |Mark Davis and Hideki Hiura contributed to the early drafts. Yukka Korpela 1725 | and Felix Sasaki provided input to the current document.
1726 |The following changes have been made since the Working Group Note of 2013-01-24:
1731 |See the github commit log for more details and diffs.
1739 |