Unicode in XML and other Markup Languages

201 |

General Considerations

202 |

There are several general points to consider when looking at the 203 | interaction between character encoding and markup.

204 |

Linearity of text vs. hierarchy of markup structure
Overlap of control codes and markup semantics
Markup vs. Styling
Coincidence of semantic markup and functions
Extensibility of markup

211 | 212 |

213 |

Linearity versus Structure

214 |

Encoding text as a sequence of characters without further 215 | information leads to a linear sequence, commonly called plain text. Character 216 | follows character, without any particular structure. Markup, on the other 217 | hand, defines a hierarchical structure for the text or data. In the case of 218 | XML and most other, similar markup languages, the markup defines a tree 219 | structure. While this tree structure is linearized for transmission in the 220 | XML document, once the document has been parsed, the tree is available 221 | directly.

222 |

Operations that are easy to perform on trees are often 223 | difficult to perform on linear sequences and vice versa. By separating 224 | functionality between character encoding and markup appropriately, the 225 | architecture becomes simpler, more powerful and longer-lasting.

226 |

In particular, operations on hierarchical structures can 227 | easily make sure that information is kept in context. Attributes assigned to 228 | parts of a document are moved together with the associated part of the 229 | document. Assigning an attribute to a part of a document limits the scope of 230 | the attribute to that part of the document. Performing the same operations on 231 | linear sequences of characters using control codes to set attributes and to 232 | delimit their scope requires much more work and is error prone. Locating the 233 | start or end of a span of text of the same attribute requires scanning 234 | backwards and forwards for the embedded delimiter or control code. Moving or 235 | editing text often results in mismatched control codes, so that an attribute 236 | might suddenly apply to text it was not intended for.

237 |

238 | 239 |

240 |

Overlap of Control Code and Markup Semantics

241 |

When markup is not available, plain text may require control 242 | characters. This is usually the case where plain text must contain some 243 | scoping or attribute information in order to be legible, i.e. to be 244 | able to transmit the same content between originator and receiver. Many of 245 | these control characters have direct equivalents in particular markup 246 | languages, since markup handles these concerns efficiently. If both 247 | characters and their markup equivalents may be present in the same text, the 248 | question of priority is raised. Therefore it is important to identify and 249 | resolve these ambiguities at the time markup is first applied.

250 |

251 | 252 |

253 |

Markup and Styling

254 |

Besides the basic character encoding and text markup there is 255 | a third contributor to text functionality, namely styling. Markup is 256 | concerned with the logical structure of the text or data, e.g. to 257 | indicate sections, subsections, and headers in a document, or to indicate the 258 | various fields of an address record. Styling is used to present the 259 | information in various ways, e.g. in different fonts, different type 260 | styles (italic, bold), different colors, etc. Some character codes do 261 | not encode a generic character, but a styled character. Where these 262 | characters are used, styling information is frozen, i.e. it is no 263 | longer possible to alter the appearance of the text by applying style 264 | information. However, there are many examples where a historically free 265 | stylistic variation has over time become a semantic distinction that is 266 | properly encoded as plain text. Sometimes, what is a free variation in some 267 | contexts, implies strict semantic differentiation in others. In all such 268 | instances, altering the appearance of the text by styling information would 269 | irreparably alter the content of the text. This is of particular concern with 270 | mathematical notation or systems for phonetic and phonemic transcription 271 | which make extensive semantic use of styles on a character by character 272 | basis.

273 |

274 | 275 |

276 |

Coincidence of Markup and Functions

277 |

Dealing with various functionalities on the markup level has 278 | the additional advantage that in most cases, text portions that need some 279 | particular attribute (or styling) are actually those text portions identified 280 | by markup. A paragraph may be in French, a citation may need a bidi 281 | embedding, a keyword may be in italics, a list number may be circled, and so 282 | on. This makes it very efficient to associate those attributes with 283 | markup.

284 |

However, where local or point-like functionality is needed, 285 | markup is not very efficient and its main benefit, easy manipulation 286 | of scope, is not required. On the contrary, the intrusion of markup in the 287 | middle of words can make search or sort operations more difficult. For these 288 | cases expressing the information as character codes is not only a viable, but 289 | often the preferred alternative, which needs to be considered in the design 290 | of markup languages.

291 |

292 | 293 |

294 |

Extensibility of Markup

295 |

Character encoding works with a range of integers used as 296 | character codes. This is extremely efficient, but has some limitations. 297 | Markup, on the other hand, is much more extensible. Using technologies such 298 | as XML Namespaces [Namespace] and their application 299 | in schema languages like [XML Schema], various 300 | vocabularies can be mixed.

301 |

302 | 303 |

304 |

Suitability of Characters in Markup

305 |

The suitability of a particular character for markup depends on its status 306 | in the Unicode Standard, the nature of its behavior in text and the 307 | availability of equivalent markup. Many format characters that are needed for 308 | advanced plain text are not suitable for use with markup. Section 3 gives a list and detailed descriptions. 310 | However, not all format characters are unsuitable for use with markup. Section 4 provides a list of format characters that are 312 | suitable for use with markup and gives some discussion about their use. In 313 | addition to format characters, the Unicode Standard also has compatibility 314 | characters, some of which may be replaceable by suitable markup. These 315 | characters are discussed in Section 5.

316 |

317 |

321 |

Characters not Suitable for use With Markup

322 |

There are characters which are unsuitable in the context of markup in 323 | XML/HTML and whose use is discouraged, because one or more of the following 324 | conditions apply:

325 |

They are deprecated in the Unicode Standard.
They are unsupportable without additional data.
They are difficult to handle because they are stateful.
They are better handled by markup.
They are undesirable because of conflict with equivalent markup.

332 |

Section 3.1 provides a list of such characters. 333 | Sections 3.2 through 3.10 discuss in more detail the following points for the discouraged 334 | characters.

335 |

Short description of semantics
Reason for inclusion in Unicode
Specific problems when used with markup
Other areas where problems may occur (e.g. plain text)
What kind of markup to use instead
What to do if detected in a particular context

343 | 344 |

345 |

Table of Characters not Suitable for use With Markup

346 |

The following table contains the characters currently considered not 347 | suitable for use with markup in XML or HTML. (See however the note in the Introduction.) They 348 | may also be unsuitable for other markup or page layout languages. For 349 | determining possible conflict this report uses the markup available in 350 | HTML.

351 |

352 |

Characters not suitable for use with markup

353 | 354 | 355 | 356 | 359 | 362 | 365 | 366 | 367 | 368 | 369 | 370 | 371 | 372 | 373 | 374 | 375 | 376 | 377 | 378 | 379 | 381 | 382 | 383 | 384 | 386 | 388 | 389 | 390 | 391 | 392 | 393 | 394 | 395 | 396 | 397 | 398 | 399 | 400 | 401 | 402 | 403 | 404 | 405 | 406 | 407 | 408 | 409 | 410 | 411 | 412 | 413 | 414 | 415 | 416 | 418 | 419 | 420 | 421 | 422 | 424 | 425 | 426 | 427 | 428 | 429 | 430 | 431 | 432 | 433 | 434 | 435 | 436 |

Codepoints 358 \|	Names/Description 361 \|	Short 363 \| Comment 364 \|
U+0340..U+0341	Clones of grave and acute	Deprecated in Unicode
U+17A3, U+17D3	Obsolete characters for Khmer	Deprecated in Unicode
U+2028..U+2029	Line and paragraph separator	use <xhtml:br />, 380 \| <xhtml:p></xhtml:p>, or equivalent
U+202A..U+202E	BIDI embedding controls 385 \| (LRE, RLE, LRO, RLO, PDF)	Strongly discouraged in [HTML4.01]
U+206A..U+206B	Activate/Inhibit Symmetric swapping	Deprecated in Unicode
U+206C..U+206D	Activate/Inhibit Arabic form shaping	Deprecated in Unicode
U+206E..U+206F	Activate/Inhibit National digit shapes	Deprecated in Unicode
U+FFF9..U+FFFB	Interlinear annotation characters	Use ruby markup [Ruby]
U+FEFF	as ZWNBSP	Use U+2060 Word Joiner instead
U+FEFF	as Byte Order Mark	Use only at the start of a file, not as part of 417 \| markup
U+FFFC	Object replacement character	Use markup, e.g. HTML <object> or HTML 423 \| <img>
U+1D173..U+1D17A	Scoping for Musical Notation	Use an appropriate markup language
U+E0000..U+E007F	Language Tag code points	Use xhtml:lang or xml:lang

437 | 438 | 439 |

Except for Line and Paragraph Separator, or the Byte Order Mark, it is 440 | acceptable for browsers and similar user agents to ignore the presence of 441 | discouraged characters in HTML or XML. It is up to authoring tools to ensure 442 | proper conversion between these characters and equivalent markup where it 443 | exists.

444 |

445 | 446 |

447 |

Line and Paragraph Separator, U+2028..U+2029

448 |

Short description: The line and paragraph separator provide 449 | unambiguous means to denote hard line breaks and paragraph delimiters in 450 | plain text.

451 |

Reason for inclusion: These characters were introduced into the 452 | Unicode Standard to overcome the ambiguous and widely divergent use of 453 | control codes for this purpose. See Section 454 | 5.8, Newline Guidelines, in [Unicode].

455 |

Problems when used in markup: Including these characters in 456 | markup text does not work where it would duplicate the existing markup 457 | commands for delimiting paragraphs and lines.

458 |

Problems with other uses: The be can also 459 | problematic when used in plain text, because legacy data is usually converted 460 | code point for code point into Unicode and all receivers of Unicode plain 461 | text have to effectively be able to interpret the existing use of control 462 | codes for this purpose. As a result, fewer Unicode implementations support 463 | these characters, than would be the case otherwise.

464 |

Replacement markup: In HTML, use <xhtml:br /> instead of 465 | U+2028 and surround paragraphs by <xhtml:p> and </xhtml:p> 466 | instead of separating them with U+2029.

467 |

What to do if detected: In a browser context, treat as white 468 | space, or ignore. When received in an editing context, replace the character 469 | by the corresponding markup.

470 |

471 | 472 |

473 |

Bidi Embedding Controls (LRE, RLE, LRO, RLO, PDF), U+202A..U+202E

474 |

Short description: The bidi embedding controls are required to 475 | supplement the Unicode Bidirectional Algorithm in plain text

476 |

Reason for inclusion: The Unicode Bidirectional algorithm 477 | unambiguously resolves the display direction for bidirectional text. It does 478 | so by assigning all characters directional categories and then resolving 479 | these in context. In a number of circumstances this implicit method does not produce satisfactory results and embedding controls are 480 | needed to ensure that sender and receiver agree on the display direction for 481 | a given text. See Unicode Technical Report #9, The Bidirectional Algorithm [UAX 9].

483 |

Problems when used in markup: These characters duplicate 484 | available markup, which is better suited to handle the stateful nature of 485 | their effect.

486 |

Problems with other uses: The embedding controls introduce a 487 | state into the plain text, which must be maintained when editing or 488 | displaying the text. Processes that are modifying the text without being 489 | aware of this state may inadvertently affect the rendering of large portions 490 | of the text, for example by removing a PDF.

491 |

Replacement markup: The following table gives the replacement 492 | markup:
493 |

494 | 495 | 496 | 497 | 498 | 499 | 500 | 501 | 502 | 503 | 504 | 505 | 506 | 507 | 508 | 509 | 510 | 511 | 512 | 513 | 514 | 516 | 517 | 518 | 519 | 520 | 521 | 522 | 523 | 524 | 525 | 526 | 527 | 528 |

Unicode	Equivalent markup	Comment
RLO	<xhtml:bdo dir = "rtl">
LRO	<xhtml:bdo dir = "ltr">
PDF	</xhtml:bdo>	when used to terminate RLO or LRO only, otherwise 515 \| ignore
RLE	dir = "rtl"	attribute on block or inline element
LRE	dir = "ltr"	attribute on block or inline element

529 |

For details on bidi markup, please see Section 8.2 of HTML [HMTL 4.0-8.2]. The text of HTML 4.0 gives this 531 | recommendation:

532 |

533 |
Using HTML directionality markup with Unicode 534 | characters. Authors and designers of authoring software should be 535 | aware that conflicts can arise if the dir attribute is used on 538 | inline elements (including BDO) concurrently with the 541 | corresponding [UNICODE] formatting characters. Preferably one or the 543 | other should be used exclusively. The markup method offers a better 544 | guarantee of document structural integrity and alleviates some problems 545 | when editing bidirectional HTML text with a simple text editor, but some 546 | software may be more apt at using the [UNICODE] characters. If both methods are used, great 548 | care should be exercised to insure proper nesting of markup and directional 549 | embedding or override, otherwise, rendering results are undefined.
550 |

551 |

This document goes beyond HTML and recommends that only the markup 552 | should be used.

553 |

The interpretation of how to handle directionality markup 554 | for block level elements differs in different versions of [CSS].

556 |

What to do if detected: In a browser context, ignore. When 557 | received in an editing context, replace the characters by the appropriate 558 | markup.

559 |

560 | 561 |

562 |

Deprecated Formatting Characters, U+206A..U+206F

563 |

Short description: These characters are deprecated. They were 564 | originally intended to allow explicit activation of contextual shaping, 565 | numeric digit rendering and symmetric swapping.

566 |

Reason for inclusion: These characters were retained from draft 567 | versions of ISO 10646.

568 |

Problems when used in markup: The processing model for these 569 | characters is not supported in markup.

570 |

Problems with other uses: The Unicode Standard requires that 571 | symmetric swapping, contextual shaping, and alternate digit shapes are 572 | enabled by default and no longer supports inhibiting any of them by use of 573 | these character codes. The most likely effect of their occurrence in 574 | generated text would be that of a 'garbage' character.

575 |

Conversion for use with markup: Apply the appropriate conversion 576 | to bring the data stream in line with the Unicode text model for 577 | bidirectional text and cursively-connected scripts.

578 |

What to do if detected: When received by a browser as part of 579 | marked up text, they may be ignored. When received in an editing context, 580 | they may be removed, possibly with a warning. Alternatively, an appropriate 581 | conversion from the legacy text model may be provided. This will most likely 582 | be limited to applications directly interfacing with and knowledgeable of the 583 | particular legacy implementation that inspired these characters.

584 |

585 | 586 |

587 |

Byte Order Mark, ZWNBSP, U+FEFF

588 |

Short description: U+FEFF has two functions. It is formally known 589 | as zero width no-break space (ZWNBSP), and can act as a word joiner, but its primary use is as byte 590 | order mark (BOM), to indicate in a file signature at the start of a file 591 | that a file is in a particular Unicode encoding form and of a particular byte 592 | order. Using U+FEFF as a word joiner in new data is deprecated as of [Unicode3.2] in favor of U+2060 word joiner (WJ). The use as byte 595 | order mark remains unaffected.

596 |

Reason for inclusion: Originally included in Unicode for the sole 597 | purpose of indicating byte order or use in file signatures, the character 598 | acquired the ZWNBSP semantics as part of the merger between ISO/IEC 10646 and 599 | Unicode. When used as a byte order mark the character is placed at the 600 | beginning of a file. If a recipient views it as FEFF then the byte order 601 | between sender and receiver match. If the recipient views it as FFFE (a 602 | non-character code point) then the sender used opposite byte order from the 603 | recipient, and the recipient needs to invert the byte order or refuse to read 604 | the file. When used as a ZWNBSP the character is intended to prevent breaks 605 | between adjacent characters. This function is now provided by U+2060 word joiner (WJ) making it 607 | unnecessary to insert U+FEFF in the middle of a file. For more information 608 | see Chapter 16 of [Unicode].

609 |

Problems when used in markup: Using U+FEFF as ZWNBSP makes it 610 | impossible to distinguish it from the case where a byte order mark was left 611 | in the middle of a file inadvertently due to incorrect splicing. U+FEFF can 612 | and in some cases (XML encoded in UTF-16) must be used at the start of a file 613 | containing markup, but as a signature, this is not part of actual markup or 614 | marked-up content. Some older versions of browsers and parsers may not 615 | correctly recognize U+FEFF at the start of a file encoded in UTF-8. For 616 | details of how U+FEFF participates in encoding detection of XML files, see 617 | Appendix F of [XML 1.0].

618 |

Problems with other uses: The use of byte order mark as ZWNBSP is 619 | also problematic when used in plain text, and has been deprecated for that 620 | purpose in favor of U+2060 word 621 | joiner. The use of U+FEFF in file signatures to indicate byte order is 622 | the only recommended use of this character.

623 |

Replacement markup: None. In locations other than the beginning 624 | of a text file, U+FEFF can be removed or replaced by U+2060 in an editing 625 | environment.

626 |

What to do if detected: When received by a browser as part of 627 | marked-up text, treat depending on location. At the start of an external 628 | entity, treat as byte order mark (i.e. as part of the character encoding, not 629 | as part of the parsed character stream, see e.g. Section 4.3.3 of [XML 1.0]). Otherwise, assume it is older data using it as 631 | ZWNBSP. When receiving plain text in an editing environment, editors may take 632 | one or more of several actions: replace ZWNBSP in the middle of a file with 633 | WJ or issue a warning to the user.

634 |

635 | 636 |

637 |

Interlinear Annotation Characters, U+FFF9-U+FFFB

638 |

Short description: The interlinear annotation characters are used 639 | to delimit interlinear annotations in certain circumstances. They are 640 | intended to provide text anchors and delimiters for interlinear annotation 641 | for in-process use and are not intended for interchange.

642 |

Reason for inclusion: The interlinear annotation characters were 643 | included in Unicode only in order to reserve code points for very frequent 644 | application-internal use. The interlinear annotation characters are used to 645 | delimit interlinear annotations in contexts where other delimiters are not 646 | available, and where non-textual means exist to carry formatting information. 647 | Many text-processing applications store the text and the associated markup 648 | (or in some cases styling information) of a document in separate structures. 649 | The actual text is kept in a single linear structure; additional information 650 | is kept separately with pointers to the appropriate text positions. This is 651 | called out-of-band information. The overall implementation makes sure that 652 | these two structures are kept in sync. If the text contains interlinear 653 | annotations, it is extremely helpful for implementations to have delimiters 654 | in the text itself; even though delimiters are not otherwise used for style 655 | markup. With this method, and unlike the case of the object replacement 656 | character, all textual information can remain in the standard text stream, 657 | but any additional formatting information is kept separately. In addition, 658 | the Interlinear Annotation Anchor serves as a placeholder for formatting 659 | information for the whole annotation object, the same way a paragraph mark 660 | can be a placeholder to attach paragraph formatting information.

661 |

Problems when used in markup: Including interlinear annotation 662 | characters in marked-up text does not work because the additional formatting 663 | information (how to position the annotation,...) is not available.

664 |

Problems with other uses: The interlinear annotation characters 665 | are also problematic when used in plain text, and are not intended for that 666 | purpose. In particular, on older display systems that simply ignore or 667 | replace the Interlinear Annotation Characters, the meaning of the text may be 668 | changed.

669 |

Replacement markup: The markup to be used in place of the 670 | Interlinear Annotation Characters depends on the formatting and nature of the 671 | interlinear annotation in question. For ruby, please see [Ruby].

673 |

What to do if detected: When received by a browser as part of 674 | marked-up text, they may be ignored. When receiving plain text in an editing 675 | environment, editors may take one or more of several actions: remove U+FFF9 676 | together with removing all characters between U+FFFA and following U+FFFB; 677 | ignore U+FFF9 and turn U+FFFA and U+FFFB into "[" and "]" respectively, or 678 | into similar characters; issue a warning to the user; or tentatively convert 679 | into appropriate ruby markup for further editing and formatting by the 680 | user.

681 |

682 | 683 |

684 |

Object Replacement Character, U+FFFC

685 |

Short description: The object replacement character is used to 686 | stand in place of an object (e.g. an image) included in a text.

687 |

Reason for inclusion: The object replacement character was 688 | included in Unicode only in order to reserve a codepoint for a very frequent 689 | application-internal use. Many text-processing applications store the text 690 | and the associated markup (or in some cases styling information) of a 691 | document in separate structures. The actual text is kept in a single linear 692 | structure; additional information is kept separately with pointers to the 693 | appropriate text positions. The overall implementation makes sure that these 694 | two structures are kept in sync. If the text contains objects such as images, 695 | it is extremely helpful for implementations to have a sentinel in the text 696 | itself; any additional information is kept separately.

697 |

Problems when used in markup: Including an object replacement 698 | character in markup text does not work because the additional information 699 | (what object to include,...) is not available.

700 |

Problems with other uses: The object replacement character is 701 | also problematic when used in plain text, because there is no way in plain 702 | text to provide the actual object information or a reference to it.

703 |

Replacement markup: The markup to be used in place of the Object 704 | Replacement Character depends on the object in question and the markup 705 | context it is used in. Typical cases are <xhtml:img src='...' />, 706 | <xhtml:object ...>, or <html:applet ...>. These constructs allow 707 | providing all additional information needed to identify and use the object in 708 | question.

709 |

What to do if detected: Browsers may ignore this character. When 710 | received in an editing context, if the actual object is accessible, editors 711 | may either replace the character by the appropriate markup for that object, 712 | or otherwise remove it, ideally providing a warning.

713 |

714 | 715 |

716 |

Musical Controls, U+1D173..U+1D17A

717 |

Short description: A series of characters for controlling scope 718 | in musical notation.

719 |

Reason for inclusion: These characters designate the start and 720 | end of common musical constructs. Full musical layout depends on additional 721 | information, for example pitch, that cannot be encoded using Unicode. 722 | However, many musical symbols may be depicted in isolation (and without 723 | assigning pitch) as part of a textual discussion of music. Plain text use of 724 | Unicode characters is primarily intended for this latter purpose. The scoping 725 | operators can be used to support limited renderings of beams, slurs, phrases, 726 | etc. in this context. However, in the context of markup languages, musical 727 | scoring calls for a dedicated markup language (analogous to MathML) which 728 | would be expected to contain markup for these constructs.

729 |

Problems when used in markup: These characters duplicate 730 | information that can in principle be expressed in markup.

731 |

Problems with other uses: Their special code range allows them to 732 | be easily filtered, but applications that do not expect them will treat them 733 | as garbage characters.

734 |

Replacement markup: Replace with equivalent markup if 735 | available.

736 |

What to do if detected: Browsers may ignore these characters. 737 | When received in an editing context, editors may remove or replace them by 738 | equivalent markup.

739 |

740 | 741 |

742 |

Language Tag Characters, U+E0000..U+E007F

743 |

Short description: A series of characters for expressing language 744 | tags, based on existing standards for language tags using the rules in 745 | Chapter 16 of [Unicode].

746 |

Reason for inclusion: These characters allow in-band language 747 | tagging in situations where full markup is not available, while allowing easy 748 | filtering by applications that do not support them. They were solely included 749 | for the benefit of those Internet protocols, such as ACAP, which require a 750 | standard mechanism for marking language in UTF-8 strings, and at the same 751 | time to avoid the use of other tagging schemes that relied on specific 752 | details of the encoding form used.

753 |

Problems when used in markup: These characters duplicate 754 | information that can be expressed in markup.

755 |

Problems with other uses: Their special code range allows them to 756 | be easily filtered, but applications that do not expect them will treat them 757 | as garbage characters.

758 |

Replacement markup: Replace with equivalent language markup. XML 759 | and XHTML have the xml:lang attribute. HTML has the lang attribute. These 760 | attributes follow different scoping rules than the tag characters, therefore 761 | this replacement will generally not be a simple 1:1 substitution.

762 |

What to do if detected: Browsers may ignore these characters. 763 | When received in an editing context, editors may remove or replace them by 764 | equivalent markup.

765 |

766 | 767 |

768 |

Other Characters Deprecated in Unicode

769 |

Short description: The Unicode Character Database [UnicodeData] lists all characters that have been 771 | deprecated in [Unicode]. This list may grow (slowly) 772 | over time. Deprecated characters remain valid characters forever, but their 773 | use is strongly discouraged. Deprecation of characters is applied only in 774 | exceptional circumstances. It is never the result of historical changes of a 775 | writing system: characters no longer in current, modern use are retained in 776 | Unicode, as they are needed for the representation of historical 777 | documents.

778 |

Reason for inclusion: Usually, characters that are deprecated 779 | were never needed, but were inadvertently added to the Unicode Standard, 780 | perhaps based on incomplete information available at the time of encoding.

781 |

Problems when used in markup: Except where noted elsewhere in 782 | this document, their presence in markup presents the same problems as in 783 | plain text, usually that of an unnecessary duplicate encoding.

784 |

Problems with other uses: Depends on the character and the reason 785 | for its deprecation. For more information see [Unicode].

787 |

Conversion for use with markup: For deprecated characters not 788 | discussed elsewhere in this document, see the relevant descriptions of those 789 | characters in [Unicode] for information on the 790 | recommended alternatives.

791 |

What to do if detected: Unless a specific recommendation is 792 | given elsewhere, deprecated characters are not ignored; where possible, in an 793 | editing environment, a preferred alternate encoding may be substituted.

794 |

795 |

799 |

Format Characters Suitable for Use with Markup

800 |

The following table contains format characters that do not exhibit the 801 | problems discussed at the start of Section 3. Despite 802 | their apparent relation to or similarity with characters in table 3.1, they are considered suitable for use with markup. 804 | It is not acceptable for user agents to ignore the characters in table 4.1. 805 | For a description of these characters see [Unicode].

807 |

808 |

Some characters that affect text format but are suitable for use with markup

809 | 810 | 811 | 812 | 815 | 818 | 821 | 822 | 823 | 824 | 825 | 826 | 827 | 828 | 829 | 830 | 831 | 832 | 833 | 834 | 835 | 836 | 837 | 838 | 839 | 840 | 841 | 842 | 843 | 844 | 845 | 846 | 847 | 848 | 849 | 850 | 851 | 852 | 853 | 854 | 855 | 856 | 857 | 858 | 859 | 860 | 861 | 862 | 863 | 864 | 865 | 866 | 867 | 868 | 869 | 870 | 871 | 872 | 873 | 874 | 875 | 876 | 877 | 878 | 879 | 881 | 882 | 883 | 884 | 885 | 886 | 887 | 888 | 889 | 890 | 891 | 892 | 893 | 894 | 895 | 896 | 897 | 898 | 899 | 900 | 901 | 902 | 903 | 904 | 905 | 906 | 907 | 908 | 909 | 910 | 911 | 912 | 913 | 914 | 915 | 916 | 917 | 918 | 919 | 920 | 921 | 922 | 923 | 924 | 925 | 926 | 927 | 928 | 929 | 930 | 931 | 932 | 933 | 934 | 935 | 936 | 937 | 938 | 939 | 940 | 941 | 942 | 943 | 944 | 945 | 946 | 947 | 948 | 949 |

Code 813 \| points 814 \|	Names/Description 817 \|	Short 819 \| Comment 820 \|
U+00A0	No-break Space	Line break control
U+00AD	Soft Hyphen	Line break control
U+034F	Combining Grapheme Joiner	Used in sorting
U+0600	Arabic Number Sign	Subtending mark
U+0601	Arabic Sign Sanah	Subtending mark
U+0602	Arabic Footnote Marker	Subtending mark
U+0603	Arabic Sign Safha	Subtending mark
U+06DD	Arabic End of Ayah	Enclosing mark
U+070F	Syriac Abbreviation Mark (SAM)	Supertending mark
U+0F0C	Tibetan Mark Delimiter Tsheg Bstar	Non-breaking form of 0F0B
U+115F..U+1160	Hangul Jamo Fillers	Filler
U+180B..U+180E	Mongolian Variation Selectors(FVS1..FVS3), Mongolian 880 \| Vowel Separator	Required for Mongolian
U+200B	Zero-width Space	Line break control
U+200C..U+200D	Zero-width Join Controls (ZWJ and ZWNJ)	Required for a.o. Persian and many Indic scripts
U+200E..U+200F	Implicit Directional Marks (LRM and RLM)	LRM and RLM are allowed
U+2011	Non-breaking Hyphen	Line break control
U+202F	Narrow No-break Space	Line break control/Mongolian
U+2044	Fraction Slash	Or use markup (MathML)
U+2060	Word Joiner	Use for that purpose instead of U+FEFF ZWNBSP
U+2061..U+2064	Invisible Mathematical Operators	Mathematical use
U+2FF0..U+2FFB	Ideographic Character Description	Graphic characters (not controls)
U+303E	Ideographic Variation Indicator	Graphic character (not a control)
U+FF80	Halfwidth Hangul Filler	Filler, not generally required
FE00..FE0F	Variation Selectors	Modify graphic characters
E0100..E01DF	Variation Selectors	Modify graphic characters

950 | 951 |

The following subsections briefly discuss some of the characters from the 952 | above list, particularly those that affect more than their immediately 953 | adjacent neighbors. Please see the Unicode Standard [Unicode] for full details.

955 |

956 |

Subtending Marks

957 |

Subtending marks are needed to represent a common feature in the Arabic 958 | and Syriac scripts where a mark can be placed below a range of characters, 959 | for example below a sequence of digits, to indicate a year. The Syriac 960 | abbreviation mark is placed above a series of characters, making it 961 | technically a supertending mark, and the ARABIC END OF AYAH is an enclosing 963 | mark. In the character stream, a subtending mark precedes the affected 964 | characters. The end of affected range of characters is defined implicitly, 965 | usually by the first non-alphanumeric character.

966 |

Unlike subtending marks, the scope of combining enclosing 967 | marks, such as combining 969 | enclosing circle, is limited to the preceding default grapheme 970 | cluster. For details on grapheme clusters see Unicode Standard Annex #29: 971 | "Text Boundaries", [UAX 29] .

972 |

There is currently no existing markup that can represent the 973 | scoping and layout functions defined by these characters, so they cannot be 974 | substituted. It is unresolved to what degree intervening markup affects the 975 | scope of these marks.

976 |

977 |

978 |

Fraction Slash

979 |

The fraction slash is used between sequences of decimal 980 | digits to form fractions. Whether the resulting fraction has a horizontal or 981 | diagonal fraction line is unspecified. The fallback is to leave the digits 982 | unchanged and display a regular slash. In order to separate a digit from a 983 | following fraction, as in 1¾, the use of U+2009 THIN SPACE is recommended.

985 |

For better control of fractions the use of [MathML] is suggested where appropriate.

987 |

988 |

989 |

Variation Selectors

990 |

A variation selector is intended to cause a specific variant form (or 991 | range of variant forms) when applied to a base character. For a variation 992 | selector to have an effect it must immediately follow its base character. 993 | Only pre-determined combinations of selected base characters and specific 994 | variation selectors have a defined effect. All other combinations are 995 | ill-formed and are to be ignored. The list of standardized combinations is 996 | documented in the Unicode Character Database, see [Variants]. In addition to the 256 generic variation 998 | selectors, there are 3 Mongolian free variation selectors. They 999 | function in all other ways like variation selectors, except they only apply 1000 | to base characters from the Mongolian script. Since Mongolian, like Arabic, 1001 | has positional character shapes, the variations are limited to particular 1002 | shaping contexts.

1003 |

1004 |

1005 |

Ideographic Description Characters

1006 |

Ideographic Description Characters are included in the Unicode Standard as 1007 | a means to indicate the composition of ideographs from a combination of 1008 | pieces (terms), where each piece or term is either a Unicode character or 1009 | composed. Ordinarily the result would be a human readable description of a 1010 | character, perhaps one for which a font is not available. However, at least 1011 | some vendors are interested in automatic conversion of these sequences into 1012 | single ideographs.

1013 |

1014 |

1015 |

Invisible Mathematical Operators

1016 |

These characters are needed to convey the intended meaning of a 1017 | mathematical expression to an automated parser whenever two elements are 1018 | simply written next to each other. See Unicode Technical Report #25: "Unicode 1019 | Support for Mathematics" [UTR25] for more details.

1020 |

1021 |

1022 |

Line Break Controls

1023 |

Most of these characters prevent line breaks adjacent to them, but ZWSP 1024 | and SHY provide invisible line break opportunities. The detailed function of 1025 | these characters is described in Unicode Standard Annex #14: "Line Breaking 1026 | Properties" [UAX14]. While high-end applications may be 1027 | able to deduce line breaking opportunities automatically solely with the help 1028 | of very generic markup or styling properties, the use of these characters 1029 | currently provides the most reliable and straight-forward way to control line 1030 | breaking and hyphenation. Note that [HTML4.01] uses 1031 | U+00A0 NO-BREAK SPACE also as a "hard space" (i.e. a space with a fixed 1032 | width), something that is not part of its character semantics in [Unicode].

1034 |

U+2011 NON-BREAKING HYPHEN (NBHY) is used to encode a hyphen that does not 1035 | provide a line break opportunity. In several languages, the sequence <SHY, 1036 | NBHY> may be used to handle special line breaking behavior for explicit 1037 | hyphens, see [UAX14].

1038 |

1039 |

1040 |

Hangul Fillers

1041 |

These should not be needed except for texts that need to have a fixed 1042 | number of jamos per Korean syllable block. See the description of Korean 1043 | Syllable Blocks in [Unicode].

1044 |

1045 |

1047 |

Characters with Compatibility Mappings

1048 |

The Unicode Standard provides compatibility mappings for a number of 1049 | characters. Compatibility mappings indicate a relationship to another 1050 | character, but the exact nature of the relationship varies. In some cases the 1051 | relationship means "is based on" in some other cases it denotes a property. 1052 | When plain text is marked up, it may make sense to map some of these 1053 | characters to a combination of their compatibility equivalents and suitable markup. It is important to 1055 | understand the nature of the distinctions between characters and their 1056 | compatibility equivalents and the context in which these distinctions matter. 1057 | It is never advisable to apply compatibility mappings indiscriminately. This 1058 | section provides guidance on when and how to apply compatibility mappings in 1059 | the case of importing text from non-XML (non-marked-up) sources. The section 1060 | is organized by the "compatibility tag" associated with each compatibility 1061 | mapping.

1062 |

1063 |

Overview

1064 |

The following table gives an overview of the various compatibility 1065 | characters, organized by "compatibility tag". The first column, Tag 1066 | value, contains the value of the "compatibility tag" from the Unicode 1067 | Character Database [UnicodeData]. Although these 1068 | tags use "<" and ">", they do not appear as such in markup and should 1069 | not be confused with XML tags. Code range indicates a further break 1070 | down by code points. Action summarizes the recommended action to be 1071 | taken whenever markup is first applied to non-XML text. Each entry indicates 1072 | whether the characters can be substituted using the compatibility equivalent 1073 | according to Normalization Form KC of [UAX 15], can be 1074 | replaced by equivalent markup where available, or should be retained. For 1075 | some cases, instead of or in addition to markup, style information [CSS] is needed. Description and usage provides 1077 | additional information. Sections 5.3 through 5.6 provide additional information for some of these 1079 | sets of compatibility characters including detailed recommended actions.

1080 |

1081 |

Characters with compatibility mappings

1082 | 1083 | 1084 | 1085 | 1086 | 1087 | 1088 | 1089 | 1090 | 1091 | 1092 | 1093 | 1094 | 1096 | 1097 | 1098 | 1099 | 1100 | 1101 | 1102 | 1103 | 1104 | 1105 | 1106 | 1108 | 1109 | 1110 | 1111 | 1112 | 1114 | 1115 | 1116 | 1117 | 1118 | 1120 | 1121 | 1122 | 1123 | 1125 | 1127 | 1128 | 1129 | 1130 | 1132 | 1134 | 1135 | 1136 | 1137 | 1139 | 1141 | 1142 | 1143 | 1144 | 1145 | 1147 | 1148 | 1149 | 1150 | 1152 | 1154 | 1155 | 1156 | 1157 | 1158 | 1160 | 1161 | 1162 | 1163 | 1164 | 1166 | 1167 | 1168 | 1169 | 1170 | 1171 | 1172 | 1173 | 1174 | 1175 | 1176 | 1177 | 1178 | 1179 | 1180 | 1181 | 1182 | 1184 | 1185 | 1186 | 1187 | 1188 | 1189 | 1191 | 1192 | 1193 | 1194 | 1195 | 1196 | 1197 | 1198 | 1199 | 1200 | 1201 | 1202 | 1203 | 1204 | 1205 | 1206 | 1207 | 1208 | 1209 | 1210 | 1211 | 1212 | 1213 | 1214 | 1215 | 1216 | 1217 | 1218 | 1219 | 1220 | 1223 | 1224 | 1225 | 1226 | 1227 | 1228 | 1230 | 1231 | 1232 | 1233 | 1234 | 1235 | 1237 | 1238 | 1239 | 1240 | 1241 | 1243 | 1244 | 1245 | 1246 | 1247 | 1249 | 1250 | 1251 | 1252 | 1253 | 1255 | 1256 | 1257 | 1258 | 1259 | 1260 | 1262 | 1263 | 1264 | 1265 | 1266 | 1268 | 1269 | 1270 | 1271 | 1272 | 1273 | 1275 | 1276 | 1277 | 1278 | 1279 | 1280 | 1281 | 1282 | 1283 | 1284 | 1285 | 1286 | 1287 | 1288 | 1290 | 1291 | 1292 | 1293 | 1294 | 1295 | 1296 | 1297 | 1298 | 1299 | 1300 | 1301 | 1302 | 1303 | 1304 |

Tag value	Code range	Action	Description and usage
<circled>	all	retain	Circled letters and digits used for list 1095 \| item markers, and in running text
<compat>	2002..200A	retain	Fixed width spaces
	2100..2101	retain	Variant letter forms that are used as 1107 \| symbols
	2105..2106	retain	Variant letter forms that are used as 1113 \| symbols
	2121, 213B	retain	For use as single code point in vertical 1119 \| layout
	2160..217F	retain, or use list item marker style, or 1124 \| normalize	For use as single code point in vertical 1126 \| layout, or as list item marker
	2474..249B	retain, or use list item marker style, or 1131 \| normalize	Parenthesized or dotted number used as 1133 \| list item marker
	249C..24B5	retain, or use list item marker style, or 1138 \| normalize	Parenthesized letters used as list item 1140 \| markers
	3131..318E	retain	Compatibility Hangul Jamo. These do not 1146 \| conjoin
	3200..3229	retain, or use list item marker style, or 1151 \| normalize	Parenthesized characters used as list item 1153 \| markers
	322A..3243	retain	Parenthesized characters used 1159 \| as symbols in vertical layout
	32C0..32CB	retain	String used as single code point in 1165 \| vertical layout
	all other	retain	Maintain, semantic distinctions apply
<final>	all	normalize	Arabic Presentation forms
<font>	all	retain	Variant letter forms that are used as 1183 \| symbols
<fraction>	all	normalize	As long as fraction slash is 1190 \| supported!
<initial>	all	normalize	Arabic Presentation forms
<isolated>	all	normalize	Arabic Presentation forms
<medial>	all	normalize	Arabic Presentation forms
<narrow>	all	retain	Half-width characters
<noBreak>	all	retain	The compatibility mapping merely indicates 1221 \| the equivalent breaking character. The noBreak distinction must be 1222 \| preserved
<small>	all	retain	Precise usage unknown. Maintain, but do 1229 \| not generate
<square>	3300..3357	retain	Single display cell cluster containing 1236 \| multiple lines of kana for vertical layout
	3358..337D	retain	For use as single code point in vertical 1242 \| layout
	33E0..33FE	retain	For use as single code point in vertical 1248 \| layout
	all other	retain	Variant letter form used as symbol in 1254 \| vertical layout
<sub>	2080..208E	retain, or use markup	Subscript digits 0-9, as well as minus, 1261 \| plus, equal and parens
<sub>	all other	retain	Subscript characters, usually used as 1267 \| modifier letters in phonetic notation
<super>	00B2..00B3	retain, or use markup	Superscript digits 0-9, as 1274 \| well as minus, plus, equal and parens
	00B9
	2070
	2074..207E
	all other	retain	Superscript characters, usually used as 1289 \| modifier letters in phonetic notation
<vertical>	all	normalize	East Asian Presentation forms
<wide>	all	retain	Full-width characters

1305 | 1306 |

Some symbols used in vertical layout exist as single code 1307 | points in legacy systems, but can also be composed on the fly by more 1308 | advanced display engines. There are currently no style properties that 1309 | could be used to express squared Kana clusters (kumimoji) or 1310 | horizontal in vertical writing mode (tate-chu-yoko).

1311 |

1312 | 1313 |

1314 |

Generating New Text

1315 |

Presentation forms and characters for which adequate representation exists 1316 | as marked up text should never be entered into new data. Many of the 1317 | characters with tag are however suitable for new data, as long 1318 | as they are used in the manner they are intended, that is as symbols, with 1319 | definite semantic differentiation between the different forms. The largest 1320 | set of these characters exists to carry essential semantic distinctions in 1321 | mathematical notation, where the any loss of markup during text export would 1322 | compromise the meaning of the text. Most of the characters with <super> 1323 | and tag have been encoded for use in phonetic or phonemic 1324 | transcriptions, where they act as ordinary letters and the use of style 1325 | markup is therefore deemed inappropriate. However, it is inappropriate to use 1326 | any of these classes of characters to create the appearance of styled text 1327 | runs.

1328 |

1340 |

When style is applied across entire words, sentences or paragraphs, the 1341 | use of markup is preferred. When style is applied to individual letters, 1342 | especially to letters inside a word, giving them a particular interpretation, 1343 | the use of character codes is preferred. See also Section 5.6.

1345 |

1346 |

1347 |

List Item Marker Characters

1348 |

Short description: Characters with a <circled> tag or 1349 | characters with <compat> tag and compatibility mapping to a 1350 | parenthesized string.

1351 |

Reason for inclusion: They are most frequently used for marking 1352 | enumerated list items, but the characters with a <circled> tag often 1353 | occur as dingbats or footnote markers in tables. The same characters are used 1354 | in regular text when citing an item from a corresponding ordered list.

1355 |

Problems when used in markup: These characters do not cause undue 1356 | interaction with markup

1357 |

Problems with other uses: None

1358 |

Replacement markup: (in text use) these characters are often used 1359 | in running text; sometimes, but not exclusively, in situations where the text 1360 | is to be associated with an item from a nearby numbered list. Replacement 1361 | markup may not be available, and the support for such markup is much more 1362 | limited today than was anticipated when this document was first written.

1363 |

(list item style) When generating marked up text these characters occur 1364 | only internal to the user agent when list item styles are rendered. When 1365 | marking up plain text data they could be converted to suitable list item 1366 | styles, if such use can be properly inferred. The default recommendation is 1367 | to retain the original character.

1368 |

(characters with compatibility mappings of the form "(n)" or 1369 | "n." or roman numerals) Unlike circled characters, these could be 1370 | rendered by sequences of regular characters. Using a list item marker style 1371 | would in theory allow the support of longer lists (the Unicode characters are 1372 | limited to the set (1) to (20) and "1." to "20."). Using regular character 1373 | sequences would also allow the use of fonts that match the text of the 1374 | list.

1375 |

What to do if detected: No action needs to be taken by browsers. 1376 | When received in an editing context, substitution of a list item marker style 1377 | may be appropriate. However, the same characters are very often used as 1378 | dingbat-like symbols in tables, or may appear in general text, whether or not 1379 | referring to an item from a list. Therefore the user must have the choice of 1380 | whether to replace the character.

1381 |

1382 |

1383 |

Fractions

1384 |

Short description: Single character fractions such as ½ or ¼.

1385 |

Reason for inclusion: Subsets of these occur in practically all 1386 | legacy character sets.

1387 |

Problems when used in markup: The character repertoire is limited 1388 | to a few common fractions. When used with more general methods of generating 1389 | fractions such as MathML [MathML] the usual problem of 1390 | dual representation arises.

1391 |

Problems with other uses: Other than normalization issues, these 1392 | characters present no undue problems in plain text. Where fraction slash is 1393 | supported, these can be expressed by substituting their compatibility 1394 | mappings.

1395 |

Replacement markup: MathML can represent fractions unambiguously. 1396 | When using fraction slash, care must be taken such that values like 3½ do not 1397 | turn into 31/2 (=15.5).

1398 |

What to do if detected: No action needs to be taken by browsers 1399 | or editors, except when converting plain text to MathML.

1400 |

1401 |

1402 |

Squared or Horizontal

1403 |

Short description: Characters that are symbols composed of groups 1404 | of typically kana or Latin letters, digits plus slash for use in a single 1405 | display cell in vertical display of text.

1406 |

Reason for inclusion: Many existing character sets contain these 1407 | as precomposed characters since for simple implementations this is the only 1408 | way to support the common use of providing metric units and other 1409 | abbreviations in a single character cell for vertical text layout.

1410 |

Problems when used in markup: Proposed markup, including CSS 1411 | styling, would be able express an unbounded set of these abbreviations, 1412 | obviating the need of cataloguing these in the character encoding standard 1413 | and making them more directly accessible to text based processing, for 1414 | example searching.

1415 |

Problems with other uses: The repertoire of these legacy 1416 | characters is limited; many more combinations are in actual use than are 1417 | accounted for in character sets. Pre-composed symbols do not make their text 1418 | content available to search engines. They also require re-encoding for text 1419 | laid out horizontally.

1420 |

Replacement markup: None available.

1421 |

What to do if detected: No action required. (Subject to change 1422 | pending the outcome of current proposals.)

1423 |

1424 |

1425 |

Superscripts and Subscripts

1426 |

Short description: Mainly super and subscript digits, but also 1427 | signs, parentheses and a large number of letters.

1428 |

Reason for inclusion: Super and subscripted letters and digits 1429 | are quite common in some forms of phonetic or phonemic transcriptions, where 1430 | the use of styles is both awkward and prone to data integrity issues when 1431 | exported to plain text. For super or subscripted letters in phonetic 1432 | transcription in particular, a change from superscript of subscript to 1433 | regular style would alter the meaning. Note that such use in transcription is 1434 | not limited to letters: superscripted small digits are often used to indicate 1435 | tone. When used for these purposes, these characters should be retained and 1436 | markup should not be used.

1437 |

A few super and subscript characters, primarily the digits, also occur in 1438 | many legacy character sets, including Latin-1. Their use in pure plain text 1439 | is common for databases, e.g. including metric units for part descriptions 1440 | (viz. cm²) or for (usually simplified) formulae as occur in titles 1441 | of scientific publications.

1442 |

When used in mathematical context (MathML) it is recommended to 1443 | consistently use style markup for superscripts and subscripts. This is 1444 | because mathematical layout allows not just individual symbols, but entire 1445 | expressions to be superscripted or subscripted in a regular, nested 1446 | manner.

1447 |

Problems when used in markup: Mixing direct use of these 1448 | characters with the use of style markup provides multiple representations of 1449 | the same text, leading to potentially different treatment by search and 1450 | display engines.

1451 |

However, when super and sub-scripts are to reflect semantic distinctions, 1452 | it is easier to work with these meanings encoded in text rather than markup, 1453 | for example, in phonetic or phonemic transcription. Otherwise, they would 1454 | require markup in the middle of words, and they may also be inadvertently 1455 | changed to normal style text, when exporting to plain text. This applies to 1456 | the majority of super and subscripted characters in Unicode. On the other 1457 | hand, some user agent may support certain superscripted or subscripted 1458 | characters only when used as marked up text for example, because of lack of 1459 | font support for them.

1460 |

Problems with other uses: none

1461 |

Replacement markup: Unless used as letters, <xhtml:sup> and 1462 | <xhtml:sub> or <mathml:msup> and <mathml:msub> may be 1463 | used.

1464 |

What to do if detected: Both representations (with or without 1465 | style markup) should be equivalent for search purposes. Input methods for 1466 | mathematical texts might enforce the use of styles. If superscript 1467 | characters are encountered during display of mathematical formulae, it is 1468 | recommended that they be displayed in a manner indistinguishable from that 1469 | achieved by using regular characters with corresponding style markup..

1470 |

1471 |

1472 |

Other Characters Marked <compat>

1473 |

Short description: The <compat> label was given to a set of 1474 | compatibility characters whose further classification was not settled at the 1475 | time the standard was created. The largest components are list item marker 1476 | characters.

1477 |

Reason for inclusion: These characters occur in many legacy 1478 | character sets.

1479 |

Problems when used in markup: none. There usually is no 1480 | equivalent markup.

1481 |

Problems with other uses: none

1482 |

Replacement markup: none.

1483 |

What to do if detected: No action required.

1484 |

1485 |

1496 |

White Space

1497 |

This section presents common issues with white space characters in markup 1498 | languages, mostly based on their difference in function as part of the 1499 | structure of the markup source (syntactic white space) on the one hand and as 1500 | part of the document content on the other hand.

1501 |

The set of characters in the Unicode standard that have the property 1502 | "White_Space" (see 'White Space' in the [UCD]) is 1503 | quite large. It includes white space characters with different line breaking 1504 | properties, different ligating properties, and different widths. It is 1505 | appropriate to use these characters as part of markup content for their very 1506 | specific purpose. It is preferable to place them in the markup source so 1507 | that they are surrounded by ordinary characters rather than line breaks for 1508 | example. The set of white space characters defined by typical markup 1509 | language specifications is a subset of the characters that are considered 1510 | white space by [Unicode] .

1511 |

Each markup language defines the set of characters that it accepts as part 1512 | of the markup syntax, this is usually a very small set. The XML [XML1.0] and [XML1.1] specifications 1514 | define white space as a combination of one or more of the following 1515 | characters: U+0020 SPACE, carriage return (U+000D), line feed (U+000A), or 1516 | tab (U+0009). [HTML4.01] adds to these the form feed 1517 | character (U+000C), but that character cannot be used in any XHTML 1518 | version.

1519 |

In addition, markup languages may use conventions for converting or 1520 | removing some kinds of white space. XML processors replace some combinations 1521 | of end-of-line characters by a single line feed character. [XML1.0] normalizes any two character sequences of (U+000D 1523 | U+000A) or any U+000D not followed by U+000A to a single U+000A. [XML1.1] also normalizes NEL (U+0085) and U+2028 LINE 1525 | SEPARATOR, but U+2029 PARAGRAPH SEPARATOR is not treated that way. Additional 1526 | processing of white space before it is handled to an application also occurs 1527 | for attribute values: line breaks are replaced by spaces, leading and 1528 | trailing spaces are removed, and subsequent spaces are replaced by a single 1529 | space.

1530 |

In XML, white space is purely syntactic inside tags, for example, to 1531 | separate the element name from attributes, and between elements in element 1532 | content models (as they are typical for data-oriented applications). White 1533 | space in element content models is used to lay out the markup source, using 1534 | line breaks and indentation, to improve readability. The same use of white 1535 | space is possible in many cases in mixed content (typical for text-oriented 1536 | applications).

1537 |

Because XML is used for a very wide range of applications, after the 1538 | processing steps mentioned above it passes all white space to the 1539 | application. Some XML applications such as [XHTML] may 1540 | have their own white space processing rules when processing white space 1541 | characters. Also, applications and software transforming XML (e.g. [XSLT]) have specific conventions of how they handle white 1543 | space, and specific ways of how to control this behavior. To appropriately 1544 | use white space characters, readers are advised to examine all involved 1545 | standards and software.

1546 |

If the characters U+2028 and U+2029 appear in text, they may be treated as 1547 | zero-width characters without semantic meaning (see Section 3.2).

1548 |

1549 |

Converting Newline Functions to White Space

1550 |

White space that is not purely syntactic, including control codes that 1551 | define a newline function (see Section 5.8, Newline Guidelines, in [Unicode]), can be handled in three main ways.

1553 |

For data-oriented applications, the textual content of elements is 1555 | treated according to the needs of the data type in question. In many 1556 | cases, processing by the application includes aspects similar to those of 1557 | the processing of attribute values by the XML parser itself. For some 1558 | types of data, in particular small data items, some applications may also 1559 | simply prohibit the use of white space.
For running text in text-oriented applications, reflowing is used, i.e. 1561 | the line breaks in the markup source are removed and the text is reflown 1562 | into lines whose length is determined by the output medium and styling 1563 | properties. In the context of Unicode, this reflowing process requires 1564 | care; it is described in more detail below.
For preformatted text, such as program source code, line breaks must be 1566 | preserved. Text-oriented applications usually contain special markup for 1567 | preformatted text, e.g. <xhtml:pre>. XML itself defines an 1568 | xml:space attribute that applications may use for a similar purpose.

1570 |

When reflowing, line breaks and adjacent white space can be treated as 1571 | space, removed, collapsed with adjacent control characters of the same type, 1572 | or treated as zero-width space. Which choice is appropriate depends on the 1573 | script of the surrounding text. The assumption is that line breaks and 1574 | adjacent white space (in particular following white space, used for 1575 | indentation) was added to make the markup source more readable, in particular 1576 | to make each line fit on a line of a plain text editor. For scripts that use 1577 | spaces, line breaks will have been inserted where there originally was a 1578 | space; treating them as spaces therefore preserves the intended separation 1579 | between words. For scripts which do not use spaces, such as Ideographic 1580 | scripts or certain South East Asian scripts, such as Thai, line feeds should 1581 | be removed, or replaced by U+200B zero width space. The choice of treatment 1582 | can depend on the script value of the characters preceding and following the 1583 | line feed character, assuming these characters belong to the same run of 1584 | text.

1585 |

The Unicode Standard [Unicode] 1586 | specifies that the zero width space is considered a valid line-break point 1587 | and that if two characters with a zero width space in between are placed on 1588 | the same line they are placed with no space between them; and that if they 1589 | are placed on two lines no additional glyph area is created at the 1590 | line-break.

1591 |

The details of reflowing are the responsibility of the various markup 1592 | applications (e.g. [XHTML]). However, there is a 1593 | tendency to move this functionality from markup applications to styling, so 1594 | that it can be shared across applications.

1595 |

Authors should be aware of the fact that the above script-specific 1596 | treatment of line breaks when reflowing text is not yet available in all 1597 | implementations (e.g. browsers). For scripts that do not use white space to 1598 | separate words, it may therefore still be advisable to not split long 1599 | lines.

1600 |

Editing tools should try to support the user in the appropriate use of 1601 | white space. Some white space characters cannot easily be entered via a 1602 | keyboard, but some others, e.g. U+3000 Ideographic Space, can. Editing tools 1603 | should try to make sure that only line breaks and white space that is 1604 | accepted as syntactic white space by the relevant markup language are used to 1605 | improve markup source readability.

1606 |

While the styling possibilities provided by CSS and its implementations 1607 | have not reached the level of professional typesetting systems, they offer a 1608 | wide range of ways to control layout and spacing of text. A very simple 1609 | example is text centering, which would have been done by inserting an 1610 | appropriate number of spaces on each line in pure plain text.

1611 |

1612 |

This document has been withdrawn

Introduction

Notation