├── .github └── FUNDING.yml ├── .gitignore ├── README.md ├── index.bs └── index.html /.github/FUNDING.yml: -------------------------------------------------------------------------------- 1 | # These are supported funding model platforms 2 | 3 | github: [Ygg01] 4 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .idea 2 | index.bk -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Draft for XML5 proposal. 2 | ========== 3 | 4 | The original version of this proposal is hosted on [XML5 code repository](https://github.com/annevk/xml5). 5 | 6 | What's XML5 7 | ========== 8 | 9 | XML5 is simply put, a more relaxed version of XML syntax. Basically, take the best aspects of HTML(5) syntax and apply it to XML. What this means in practice is following: 10 | 11 | * DOCTYPE is simplified and optional. Also no known laughing bomb exploits. 12 | * Creates a detailed rules for state transitions, to simplify implementations. 13 | * Allows mixed content everywhere `A bold new world` becomes: 14 | ``` 15 | Tag('text') 16 | | 17 | +---Text('A ') 18 | | 19 | +---Tag('b') 20 | | | 21 | | +---Text('bold') 22 | | 23 | +---Text(' new world') 24 | ``` 25 | * Writing one of escaped character inside tags automatically escapes them `Tom & Jerry` when parsed becomes: 26 | ``` 27 | Tag('tag') 28 | | 29 | +---Text('Tom & Jerry') 30 | ``` 31 | * XML comments are a lot more flexible (and a bit more difficult to parse), i.e. `--` is allowed in comments. Nested comments are prohibited. 32 | 33 | What XML5 strive to do is, to drop all notion of well-formedness out of the window and replace it with HTML-like error handling. XML5 parser will be able to parse any XML 1.0 and XML 1.1 documents, but a valid XML5 document wouldn't be parseable by a XML 1.0 or even XML 1.1 parser. 34 | Motivation behind this is to allow easier parsing of XML generated using string concatenation and used often on the web. 35 | 36 | How to read this? 37 | ================= 38 | 39 | Here is the [rendered version](https://ygg01.github.io/xml5_draft/). 40 | 41 | Otherwise you can download and compile the source yourself 42 | 43 | 1. [Install bikeshed](https://github.com/tabatkins/bikeshed/blob/master/docs/install.md) 44 | 2. Clone the the repository ````$ git clone https://github.com/Ygg01/xml5_draft```` 45 | 3. Change to xml5_draft the repository ````$ cd xml5_draft```` 46 | 4. Run bikeshed ````$ bikeshed```` 47 | -------------------------------------------------------------------------------- /index.bs: -------------------------------------------------------------------------------- 1 |
   2 | Title: XML5 Standard
   3 | H1: XML5
   4 | Status: LS
   5 | Logo: https://resources.whatwg.org/logo.svg
   6 | Shortname: xml5
   7 | Level:1
   8 | Editor: Anne van Kesteren, Mozilla, < annevk@annevk.nl >
   9 | Abstract: XML with well-defined error handling.
  10 | Editor: Daniel Fath, Unaffiliated,  < daniel.fath7@gmail.com >
  11 | Group: WGORWHATEVER
  12 | 
13 | 66 | 67 |

68 | Parsing XML documents 69 |

70 | 71 | This section and its subsection define the XML parser. 72 | 73 |

This specification defines the parsing rules for XML documents, whether they are syntactically correct or not. 74 | Certain points in the parsing algorithm are said to be parse errors. The handling for 75 | parse errors is well-defined: user agents must either act as described below when encountering such problems, or 76 | must terminate processing at the first error that they encounter for which they do not wish to apply the rules 77 | described below.

78 | 79 |

80 | Overview 81 |

82 | 83 | The input to the XML parsing process consists of a stream of octets which is converted to a stream of code points, which in turn are tokenized, and finally those tokens are used to construct a tree. 84 | 85 |

86 | Parse Errors 87 |

88 | 89 | This specification defines the parsing rules for XML5 documents, whether they are syntactically correct or not. 90 | Certain points in the parsing algorithm are said to be parse errors. 91 | The error handling for parse errors is well-defined (that's the processing rules described throughout this specification), 92 | but user agents, while parsing an HTML document, may abort the parser at the first parse error that they encounter for 93 | which they do not wish to apply the rules described in this specification. 94 | 95 | 96 | 97 | 98 | 101 | 104 | 105 | 106 | 107 | 108 | 109 | 114 | 115 | 116 | 117 | 122 | 123 | 124 | 125 | 129 | 130 | 131 | 132 | 136 | 137 | 138 | 139 | 143 | 144 | 145 | 146 | 149 | 150 | 151 | 152 | 156 | 157 | 158 | 159 | 162 | 163 | 164 | 165 | 170 | 171 | 172 | 176 | 177 | 178 | 179 | 182 | 183 | 184 | 185 | 188 | 189 | 190 |
99 | Code 100 | 102 | Description 103 |
abrupt-closing-of-empty-commentThis error occurs if the parser encounters an empty comment that is abruptly closed by a U+003E 110 | (>) code 111 | point (i.e., <!--> or <!--->). The parser behaves as if the comment is 112 | closed correctly. 113 |
abrupt-closing-xml-declaration 118 | This error occur if the parser encounters an unclosed quote in XML declaration. E.g. 119 | <?xml version="1?> 120 | 121 |
colon-before-attrThis error occurs if the parser encounters a U+003A COLON (:) in tag after name but before 126 | attribute name (e.g. <tag :attr). Attributes can have namespaces but U+003A COLON but 127 | namespaces can't be empty. 128 |
eof-in-cdata 133 | This error occurs if the parser encounters the end of the input stream in a CDATA section. 134 | The parser treats such CDATA sections as if they are closed immediately before the end of the input stream.. 135 |
eof-in-comment 140 | This error occurs if the parser encounters the end of the input stream in a comment. 141 | The parser treats such comments as if they are closed immediately before the end of the input stream. 142 |
eof-in-doctype 147 | This error occurs if the parser encounters the end of the input stream in a DOCTYPE section. 148 |
eof-in-tag 153 | This error occurs if the parser encounters the end of the input stream in a start tag or an end tag 154 | (e.g.,<div id=). Such a tag is ignored. 155 |
eof-in-xml-declaration 160 | This error occurs if the parser encounters the end of the input stream in a XML Declaration e.g. <?xml 161 |
incorrectly-opened-commentThis error occurs if the parser encounters the <! code point sequence that is not 166 | immediately 167 | followed by two U+002D (-) code points and that is not the start of a DOCTYPE or a CDATA 168 | section. 169 |
invalid-xml-declarationThis error occurs if the parser encounters any code point sequence other than "PUBLIC" 173 | and "SYSTEM" keywords after a DOCTYPE name. In such a case, the parser ignores any following 174 | public or system identifiers 175 |
missing-whitespace-before-doctype-nameThis error occurs if the parser encounters a DOCTYPE keyword and name are not separated by ASCII whitespace. 180 | (e.g. <!DOCTYPE) In this case the parser behaves as if ASCII whitespace is present. 181 |
missing-doctype-nameThis error occurs if the parser encounters a DOCTYPE that is missing a name (e.g., 186 | <!DOCTYPE>). 187 |
191 | 192 |

193 | Input stream 194 |

195 | 196 | The stream of Unicode characters that consists the input to the tokenization stage will be initially seen by the user agent as a stream of octets (typically coming over the network or from the local file system). The octets encode Unicode code points according to a particular encoding, which the user agent must use to decode the octets into code points. 197 | 198 |

Define how to find the encoding

199 |

Decide how to deal with null values

200 | 201 | 202 |

203 | Tokenization 204 |

205 | 206 | Implementations must act as if they used the following state machine to tokenise 207 | HTML. The state machine must start in the data state. Most states consume a 208 | single character, which may have various side-effects, and either switches the 209 | state machine to a new state to reconsume the current input character, or 210 | switches it to a new state to consume the next character, or stays in the same 211 | state to consume the next character. Some states have more complicated behavior 212 | and can consume several characters before switching to another state. In some 213 | cases, the tokenizer state is also changed by the tree construction stage. 214 | 215 | When a state says to reconsume a matched character in a specified state, that 216 | means to switch to that state, but when it attempts to consume the next input 217 | character, provide it with the current input character instead. 218 | 219 | The next input character is the first character in the input stream that has 220 | not yet been consumed or explicitly ignored by the requirements in this 221 | section. Initially, the next input character is the first character in the 222 | input. The current input character is the last character to have been consumed. 223 | 224 |

Decide how to deal with namespaces

225 | 226 |
227 |

228 | Data state 229 |

230 | 231 |
232 | 233 | Consume the next input character: 234 | 235 |
236 |
U+0026 AMPERSAND (&) 237 |
Switch to character reference in data state.
238 | 239 |
U+003C LESSER-THAN SIGN (<)
240 |
Switch to the tag open state.
241 | 242 |
EOF
243 |
Emit an end-of-file token.
244 | 245 |
Anything else
246 |
Emit the current input character as character. Stay in this state.
247 |
248 |
249 | 250 |

251 | Character reference in data state 252 |

253 | 254 |
255 | Switch to the data state. 256 | 257 | Attempt to consume a character reference. 258 | 259 | If nothing is returned emit a U+0026 AMPERSAND character (&) token. 260 | 261 | Otherwise, emit character tokens that were returned. 262 |
263 | 264 |

265 | Tag open state 266 |

267 | 268 |
269 | Consume the next input character: 270 |
271 |
U+002F SOLIDUS (/)
272 |
Switch to the end tag open state.
273 | 274 |
U+003F QUESTION MARK(?)
275 | 276 |
Switch to the pi state.
277 | 278 |
U+0021 (!)
279 |
Switch to the markup declaration state.
280 | 281 |
U+0009 CHARACTER TABULATION (Tab)
282 |
U+000A LINE FEED (LF)
283 |
U+0020 SPACE (Space)
284 |
U+003A (:)
285 |
U+003C LESSER-THAN SIGN (<)
286 |
U+003E GREATER-THAN SIGN (>)
287 |
EOF
288 | 289 |
Parse error. Emit a U+003C LESSER-THAN SIGN (<) character. 290 | Reconsume the current input character in the data state. 291 |
292 | 293 |
Anything else
294 | 295 |
Create a new tag token, then reconsume current input character 296 | in tag name state. 297 |
298 |
299 |
300 | 301 |

302 | End tag open state 303 |

304 | 305 |
306 | 307 | Consume the next input character: 308 |
309 |
U+003E GREATER-THAN SIGN (>)
310 |
Emit a short end tag token and then switch to the data 311 | state. 312 |
313 | 314 |
U+0009 CHARACTER TABULATION (Tab)
315 |
U+000A LINE FEED (LF)
316 |
U+0020 SPACE (Space)
317 |
U+003C LESSER-THAN SIGN (<)
318 |
U+003A (:)
319 |
EOF
320 |
Parse error. Emit a U+003C LESSER-THAN SIGN (<) character 321 | token and a U+002F SOLIDUS (/) character token. Reconsume the current 322 | input character in the data state. 323 |
324 | 325 |
Anything else
326 | 327 |
Create an end tag token, then reconsume the current input character in the end tag name 328 | state. 329 |
330 |
331 |
332 | 333 |

334 | End tag name state 335 |

336 | 337 |
338 | 339 | Consume the next input character: 340 |
341 |
U+0009 CHARACTER TABULATION (Tab)
342 |
U+000A LINE FEED (LF)
343 |
U+0020 SPACE (Space)
344 |
Switch to the end tag name after state.
345 | 346 |
U+002F SOLIDUS (/)
347 |
Parse error. Switch to the end tag name after state.
348 | 349 |
EOF
350 |
Parse error. Emit the start tag token and then reprocess the 351 | current input character in the data state. 352 |
353 | 354 |
U+003E GREATER-THAN SIGN (>)
355 |
Emit the end tag token and then switch to the data 356 | state. 357 |
358 | 359 |
Anything else
360 |
Append the current input character to the tag name and stay in the 361 | current state. 362 |
363 |
364 |
365 | 366 |

367 | End tag name after state 368 |

369 | 370 |
371 | 372 | Consume the next input character: 373 |
374 |
U+003E GREATER-THAN SIGN (>)
375 |
Emit the end tag token and then switch to the data state.
376 | 377 |
U+0009 CHARACTER TABULATION (Tab)
378 |
U+000A LINE FEED (LF)
379 |
U+0020 SPACE (Space)
380 |
Stay in the current state.
381 | 382 |
EOF
383 |
Parse error. Emit the current token and then reprocess the 384 | current input character in the data state. 385 |
386 | 387 |
Anything else
388 |
Parse error. Stay in the current state.
389 |
390 |
391 | 392 |

393 | Tag name state 394 |

395 | 396 |
397 | Consume the next input character: 398 |
399 |
U+0009 CHARACTER TABULATION (Tab)
400 |
U+000A LINE FEED (LF)
401 |
U+0020 SPACE (Space)
402 |
Switch to the tag attribute name before state.
403 | 404 |
U+003E GREATER-THAN SIGN (>)
405 |
Emit the start tag token and then switch to the data state.
406 | 407 |
EOF
408 |
This an eof-in-tag parse error. Emit the current token and then reprocess the 409 | current input character in the data state. 410 |
411 | 412 |
U+002F SOLIDUS (/)
413 |
Set current tag to empty tag. Switch to the empty tag state.
414 | 415 |
Anything else
416 |
Append the current input character to the tag name and stay in the 417 | current state. 418 |
419 |
420 |
421 | 422 | 423 |

424 | Empty tag state 425 |

426 | 427 |
428 | Consume the next input character: 429 | 430 |
431 |
U+003E GREATER-THAN SIGN (>)
432 |
Emit the current tag token as empty tag token and then switch to the 433 | data state. 434 |
435 | 436 |
Anything else
437 |
Reconsume in tag attribute value before state. 438 |
439 |
440 |
441 | 442 | 443 |

444 | Tag attribute name before state 445 |

446 | 447 |
448 | 449 | Consume the next input character: 450 | 451 |
452 |
U+0009 CHARACTER TABULATION (Tab)
453 |
U+000A LINE FEED (LF)
454 |
U+0020 SPACE (Space)
455 | 456 |
Stay in the current state.
457 | 458 |
U+003E GREATER-THAN SIGN(>)
459 |
Emit the current token and then switch to the data state.
460 | 461 |
U+002F SOLIDUS (/)
462 |
Set current tag to empty tag. Switch to the empty tag state.
463 | 464 |
U+003A COLON (:)
465 |
This is a colon-before-attr parse error. Stay in the current state.
466 | 467 |
EOF
468 |
This is an eof-in-tag parse error. Emit the current token and then reprocess the 469 | current input character in the data state. 470 |
471 | 472 |
Anything else
473 |
Start a new attribute in the current tag token. Set that attribute's 474 | name to the current input character and its value to the empty string and 475 | then switch to the tag attribute name state. 476 |
477 |
478 |
479 | 480 | 481 |

482 | Tag attribute name state 483 |

484 | 485 |
486 | 487 | Consume the next input character: 488 | 489 |
490 |
U+003D EQUALS SIGN (=)
491 |
Switch to the tag attribute value before state.
492 | 493 |
U+003E GREATER-THEN SIGN (>)
494 |
Emit the current token as start tag token. Switch to the data 495 | state. 496 |
497 | 498 |
U+0009 CHARACTER TABULATION (Tab)
499 |
U+000A LINE FEED (LF)
500 |
U+0020 SPACE (Space)
501 |
Switch to the tag attribute name after state.
502 | 503 |
U+002F SOLIDUS (/)
504 |
Set current tag to empty tag. Switch to the empty tag state.
505 | 506 |
EOF
507 |
This is an eof-in-tag parse error. Emit the current token as start tag token and 508 | then reprocess the current input character in the data 509 | state. 510 |
511 | 512 |
Anything else
513 |
Append the current input character to the current attribute's name. 514 | Stay in the current state. 515 |
516 |
517 | 518 | When the user agent leaves this state (and before emitting the tag token, 519 | if appropriate), the complete attribute's name must be 520 | compared to the other attributes on the same token; if there is already an 521 | attribute on the token with the exact same name, then this is a parse error 522 | and the new attribute must be dropped, along with the 523 | value that gets associated with it (if any). 524 | 525 |
526 | 527 | 528 |

529 | Tag attribute name after state 530 |

531 | 532 |
533 | 534 | Consume the next input character: 535 | 536 |
537 |
U+0009 CHARACTER TABULATION (Tab)
538 |
U+000A LINE FEED (LF)
539 |
U+0020 SPACE (Space)
540 |
Stay in the current state.
541 | 542 |
U+003D EQUALS SIGN(=)
543 |
Switch to the tag attribute value before state.
544 | 545 |
U+003E GREATER-THEN SIGN(>)
546 |
Emit the current token and then switch to the data state.
547 | 548 |
U+002F SOLIDUS (/)
549 |
Set current tag to empty tag. Switch to the empty tag state.
550 | 551 |
EOF
552 |
This is an eof-in-tag parse error. Emit the current token and then reprocess the 553 | current input character in the data state. 554 |
555 | 556 |
Anything else
557 |
Start a new attribute in the current tag token. Set that attribute's 558 | name to the current input character and its value to the empty string and 559 | then switch to the tag attribute name state. 560 |
561 |
562 |
563 | 564 | 565 |

566 | Tag attribute value before state 567 |

568 | 569 |
570 | Consume the next input character: 571 |
572 |
U+0009 CHARACTER TABULATION (Tab)
573 |
U+000A LINE FEED (LF)
574 |
U+0020 SPACE (Space)
575 |
Stay in the current state.
576 | 577 |
U+0022 QUOTATION MARK (")
578 |
Switch to the tag attribute value double quoted state.
579 | 580 |
U+0027 APOSTROPHE (')
581 |
Switch to the tag attribute value single quoted state.
582 | 583 |
U+0026 AMPERSAND (&): 584 |
Reprocess the input character in the tag attribute value unquoted 585 | state. 586 |
587 | 588 |
U+003E GREATER-THAN SIGN(>)
589 |
Emit the current token and then switch to the data state.
590 | 591 |
EOF
592 |
This is an eof-in-tag parse error. Emit the current token and then reprocess the 593 | current input character in the data state. 594 |
595 | 596 |
Anything else
597 |
Append the current input character to the current attribute's value and 598 | then switch to the tag attribute value unquoted state. 599 |
600 |
601 |
602 | 603 |

604 | Tag attribute value double quoted state 605 |

606 |
607 | Consume the next input character: 608 | 609 |
610 |
U+0022 QUOTATION MARK (")
611 |
Switch to the tag attribute name before state.
612 | 613 |
U+0026 AMPERSAND (&)
614 |
Switch to character reference in attribute value state, with the 615 | additional allowed character being U+0022 QUOTATION MARK("). 616 |
617 | 618 |
EOF
619 |
This is an eof-in-tag parse error. Emit the current token and then reprocess the 620 | current input character in the data state. 621 |
622 | 623 |
Anything else
624 |
Append the input character to the current attribute's value. Stay in 625 | the current state. 626 |
627 |
628 |
629 | 630 | 631 |

632 | Tag attribute value single quoted state 633 |

634 |
635 | Consume the next input character: 636 |
637 |
U+0022 QUOTATION MARK (')
638 |
Switch to the tag attribute name before state.
639 | 640 |
U+0026 AMPERSAND (&)
641 |
Switch to character reference in attribute value state, with the 642 | additional allowed character being APOSTROPHE ('). 643 |
644 | 645 |
EOF
646 |
This is an eof-in-tag parse error. Emit the current token and then reprocess the 647 | current input character in the data state. 648 |
649 | 650 |
Anything else
651 |
Append the input character to the current attribute's value. Stay in 652 | the current state. 653 |
654 |
655 |
656 | 657 | 658 |

659 | Tag attribute value unquoted state 660 |

661 |
662 | Consume the next input character: 663 |
664 |
U+0009 CHARACTER TABULATION (Tab)
665 |
U+000A LINE FEED (LF)
666 |
U+0020 SPACE (Space)
667 |
Switch to the tag attribute name before state.
668 | 669 |
U+0026 AMPERSAND (&): 670 |
671 | Switch to character reference in attribute value state, with the 672 | additional allowed character being U+003E GREATER-THAN SIGN(>). 673 |
674 | 675 |
U+003E GREATER-THAN SIGN (>)
676 |
Emit the current token as start tag token and then switch to the 677 | data state. 678 |
679 | 680 |
EOF
681 |
This is an eof-in-tag parse error. Emit the current token as start tag token and 682 | then reprocess the current input character in the 683 | data state. 684 |
685 | 686 |
Anything else
687 |
Append the input character to the current attribute's value. Stay in 688 | the current state. 689 |
690 |
691 |
692 | 693 | 694 |

695 | Pi state 696 |

697 | 698 |
699 | If the next few characters are: 700 |
701 |
Exact match for word "xml".
702 |
703 | Consume those characters and switch to xml declaration state 704 |
705 | 706 |
U+0009 CHARACTER TABULATION (Tab)
707 |
U+000A LINE FEED (LF)
708 |
U+0020 SPACE (Space)
709 |
EOF
710 |
Parse error. Reconsume current input characters in the 711 | bogus comment state. 712 |
713 | 714 |
Anything else
715 |
Create a new processing instruction token. Reconsume current characters in pi 716 | target state. 717 |
718 |
719 |
720 | 721 |

722 | XML declaration state 723 |

724 | 725 |
726 | Consume the next input character: 727 |
728 |
U+0009 CHARACTER TABULATION (Tab)
729 |
U+000A LINE FEED (LF)
730 |
U+0020 SPACE (Space)
731 |
Stay in current state
732 |
U+0076 LATIN SMALL LETTER V (v)
733 |
U+0065 LATIN SMALL LETTER E (E)
734 |
U+0073 LATIN SMALL LETTER S (S)
735 |
Reconsume current character in XML declaration attribute name state
736 |
737 | U+003F QUESTION MARK (?) 738 |
739 |
Switch to XML Declaration after state.
740 |
EOF
741 |
This is a eof-in-xml-declaration parse error. Append string "xml" 742 | to the processing instruction target, emit current processing instruction token and emit end-of-file 743 | token. 744 |
745 |
Anything else
746 |
This is an invalid-xml-declaration parse error. Append string "xml" 747 | to the processing instruction target, then reconsume current character in pi data state
748 |
749 |
750 | 751 |

752 | XML declaration attribute name state 753 |

754 | 755 |
756 | If the next few characters are: 757 |
758 |
Exact match for word "version".
759 |
760 | Set current xml declaration attribute name to version. Switch to XML declaration attribute name 761 | after. 762 |
763 | 764 |
Exact match for word "encoding".
765 |
766 | Set current xml declaration attribute name to encoding. Switch to XML declaration attribute name 767 | after. 768 |
769 | 770 |
Exact match for word "standalone".
771 |
772 | Set current xml declaration attribute name to standalone. Switch to XML declaration attribute name 773 | after. 774 |
775 | 776 |
Anything else
777 |
This is an invalid-xml-declaration parse error. Switch to pi target state
778 |
779 |
780 | 781 |

782 | XML declaration attribute name after 783 |

784 | 785 |
786 | Consume the next input character: 787 |
788 |
U+0009 CHARACTER TABULATION (Tab)
789 |
U+000A LINE FEED (LF)
790 |
U+0020 SPACE (Space)
791 |
Stay in current state.
792 | 793 |
U+003D EQUALS SIGN (=)
794 |
Switch to XML declaration attribute before value state.
795 | 796 |
EOF
797 |
This is an eof-in-xml-declaration parse error. Push to processing instruction target 798 | xml, then push to processing instruction data version=. 799 | Emit processing instruction token. 800 |
801 | 802 |
Anything else
803 |
This is an invalid-xml-declaration parse error. Push to processing instruction target 804 | xml, then push to processing instruction data version=. 805 | Reconsume in pi target state. 806 |
807 |
808 |
809 | 810 | 811 |

812 | XML declaration attribute before value state 813 |

814 | 815 |
816 | Consume the next input character: 817 |
818 |
U+0009 CHARACTER TABULATION (Tab)
819 |
U+000A LINE FEED (LF)
820 |
U+0020 SPACE (Space)
821 |
Stay in current state.
822 | 823 |
U+0027 APOSTROPHE (')
824 |
Switch to XML declaration attribute value (single-quoted) state.
825 | 826 |
U+0022 QUOTATION MARK (")
827 |
Switch to XML declaration attribute value (double-quoted) state.
828 | 829 |
EOF
830 |
This is an eof-in-xml-declaration parse error. Push to processing instruction target 831 | xml, then push to processing instruction data version=. 832 | Emit processing instruction token. 833 |
834 | 835 |
Anything else
836 |
This is an invalid-xml-declaration parse error. Push to processing instruction target 837 | xml, then push to processing instruction data version=. 838 | Reconsume in pi target state. 839 |
840 |
841 | 842 | 843 | 844 | 845 |

846 | XML declaration attribute value (single-quoted) state 847 |

848 | 849 |
850 | If the next few characters are: 851 |
852 |
U+0027 APOSTROPHE (')
853 |
Switch to XML declaration state.
854 | 855 |
U+003F QUESTION MARK (?)
856 |
This is an abrupt-closing-xml-declaration parse error. 857 | Switch to XML Declaration after state. 858 |
859 | 860 |
EOF
861 |
This is an eof-in-xml-declaration parse error. Emit current xml declaration. 862 | Emit end-of-file token. 863 |
864 | 865 |
Anything else
866 |
This is an invalid-xml-declaration parse error. Switch to pi target state 867 |
868 |
869 |
870 | 871 |

872 | XML declaration attribute value (double-quoted) state 873 |

874 | 875 |
876 | If the next few characters are: 877 |
878 |
U+0022 QUOTATION MARK (")
879 |
Switch to XML declaration state.
880 | 881 |
U+003F QUESTION MARK (?)
882 |
This is an abrupt-closing-xml-declaration parse error. 883 | Switch to XML Declaration after state. 884 |
885 | 886 |
EOF
887 |
This is an eof-in-xml-declaration parse error. Emit current xml declaration. 888 | Emit end-of-file token. 889 |
890 | 891 |
Anything else
892 |
This is an invalid-xml-declaration parse error. Switch to pi target state 893 |
894 |
895 |
896 | 897 |

898 | XML declaration after state 899 |

900 | 901 |
902 | If the next few characters are: 903 |
904 |
U+003E GREATER-THAN SIGN (>)
905 |
Emit the xml declaration token and then switch to the data state.
906 | 907 |
U+003F QUESTION MARK(?)
908 |
Append the current input character to the PI's data and stay in the 909 | current state. 910 |
911 | 912 |
Anything else
913 |
Reprocess the current input character in the pi data 914 | state. 915 |
916 |
917 |
918 | 919 |

920 | Pi target state 921 |

922 | 923 |
924 | 925 | Consume the next input character: 926 |
927 |
U+0009 CHARACTER TABULATION (Tab)
928 |
U+000A LINE FEED (LF)
929 |
U+0020 SPACE (Space)
930 |
Switch to the pi target after state.
931 | 932 |
EOF
933 |
Parse error. Emit the current processing instruction token and then reprocess the 934 | current input character in the data state. 935 |
936 | 937 |
U+003F QUESTION MARK(?)
938 |
Switch to the pi after state.
939 | 940 |
Anything else
941 |
Append the current input character to the processing instruction target and stay in the 942 | current state. 943 |
944 |
945 |
946 | 947 |

948 | Pi target after state 949 |

950 | 951 |
952 | 953 | Consume the next input character: 954 |
955 |
U+0009 CHARACTER TABULATION (Tab)
956 |
U+000A LINE FEED (LF)
957 |
U+0020 SPACE (Space)
958 |
Stay in the current state.
959 | 960 |
Anything else
961 |
Reprocess the current input character in the pi data 962 | state. 963 |
964 |
965 |
966 | 967 |

968 | Pi data state 969 |

970 | 971 |
972 | 973 | Consume the next input character: 974 |
975 |
U+003F QUESTION MARK(?)
976 |
Switch to the pi after state.
977 | 978 |
EOF
979 |
This is a eof-in-cdata parse error. Emit the current processing instruction token 980 | and then 981 | reprocess the 982 | current input character in the data state. 983 |
984 | 985 |
Anything else
986 |
Append the current input character to the pi's data and stay in the 987 | current state. 988 |
989 |
990 |
991 | 992 |

993 | Pi after state 994 |

995 | 996 |
997 | 998 | Consume the next input character: 999 |
1000 |
U+003E GREATER-THAN SIGN (>)
1001 |
Emit the current token and then switch to the data state.
1002 | 1003 |
U+003F QUESTION MARK(?)
1004 |
Append the current input character to the PI's data and stay in the 1005 | current state. 1006 |
1007 | 1008 |
Anything else
1009 |
Reprocess the current input character in the pi data 1010 | state. 1011 |
1012 |
1013 |
1014 | 1015 |

1016 | Markup declaration state 1017 |

1018 | 1019 |
1020 | 1021 | If the next few characters are: 1022 | 1023 |
1024 |
Two U+002D HYPEN-MINUS characters (-)
1025 |
Consume those two characters, create a comment token whose data is the empty string and switch 1026 | to comment 1027 | start state. 1028 |
1029 | 1030 |
Exact match for word "DOCTYPE"
1031 |
Consume those characters and switch to Doctype state
1032 | 1033 |
Exact match for word "[CDATA[" with a (the five uppercase letters 1034 | "CDATA" with a U+005B LEFT 1035 | SQUARE BRACKET character before and after) 1036 |
1037 |
Consume those characters and switch to CDATA state
1038 | 1039 |
Anything else
1040 |
Emit an incorrectly-opened-comment parse error. Create a comment token whose data 1041 | is an 1042 | empty string. 1043 | Switch to bogus comment state 1044 | (don't consume any characters) 1045 |
1046 |
1047 |
1048 | 1049 |

1050 | Comment start state 1051 |

1052 | 1053 |
1054 | Consume the next input character: 1055 | 1056 |
1057 |
U+002D HYPHEN-MINUS (-)
1058 |
Switch to comment start dash state
1059 | 1060 |
U+003E GREATER-THAN SIGN (>)
1061 |
This is an abrupt-closing-of-empty-comment parse error. Switch to data 1062 | state. 1063 | Emit the current comment token. 1064 |
1065 | 1066 |
Anything else
1067 |
Reconsume in the comment state
1068 |
1069 |
1070 | 1071 |

1072 | Comment start dash state 1073 |

1074 | 1075 |
1076 | Consume the next input character: 1077 | 1078 |
1079 |
U+002D HYPHEN-MINUS (-)
1080 |
Switch to comment end state
1081 | 1082 |
U+003E GREATER-THAN SIGN (>)
1083 |
This is an abrupt-closing-of-empty-comment parse error. Switch to data 1084 | state. 1085 | Emit the current comment token. 1086 |
1087 | 1088 |
EOF
1089 |
This is an eof-in-comment parse error. Emit the comment token. Emit an 1090 | end-of-file-token. 1091 |
1092 | 1093 |
Anything else
1094 |
Append a U+002D HYPHEN-MINUS character (-) to the comment token's data. 1095 | Reconsume in the comment state. 1096 |
1097 |
1098 |
1099 | 1100 | 1101 |

1102 | Comment state 1103 |

1104 | 1105 |
1106 | 1107 | Consume the next input character: 1108 | 1109 |
1110 |
U+003C LESS-THAN SIGN (<)
1111 |
Append the current input character to the comment token's data. 1112 | Switch to the comment less-than sign state. 1113 |
1114 | 1115 |
U+002D HYPHEN-MINUS (-)
1116 |
Switch to the comment end dash state.
1117 | 1118 |
EOF
1119 |
This is an eof-in-comment parse error. Emit the current comment token. Emit an 1120 | end-of-file 1121 | token. 1122 |
1123 | 1124 |
Anything else
1125 |
Append the current input character to the comment token's data.
1126 |
1127 |
1128 | 1129 | 1130 |

1131 | Comment less-than sign state 1132 |

1133 | 1134 |
1135 | 1136 | Consume the next input character: 1137 | 1138 |
1139 |
U+0021 EXCLAMATION-MARK (!)
1140 |
Append the current input character to the comment token's data. 1141 | Switch to the comment less-than sign bang state. 1142 |
1143 | 1144 |
U+003C LESS-THAN SIGN (<)
1145 |
Append the current input character to the comment token's data. 1146 |
1147 | 1148 |
Anything else
1149 |
Reconsume in the comment state.
1150 |
1151 |
1152 | 1153 |

1154 | Comment less-than sign bang state 1155 |

1156 | 1157 |
1158 | Consume the next input character: 1159 | 1160 |
1161 |
U+002D HYPHEN-MINUS (-)
1162 |
Switch to the comment less-than sign bang dash state.
1163 | 1164 |
Anything else
1165 |
Reconsume in the comment state.
1166 |
1167 |
1168 | 1169 |

1170 | Comment less-than sign bang dash state 1171 |

1172 | 1173 |
1174 | Consume the next input character: 1175 | 1176 |
1177 |
U+002D HYPHEN-MINUS (-)
1178 |
Switch to the comment less-than sign bang dash dash state.
1179 | 1180 |
Anything else
1181 |
Reconsume in the comment end dash state.
1182 |
1183 |
1184 | 1185 |

1186 | Comment less-than sign bang dash dash state 1187 | 1188 |

1189 | 1190 |
1191 | Consume the next input character: 1192 | 1193 |
1194 |
U+003E GREATER-THAN-SIGN (>)
1195 |
EOF
1196 |
Reconsume in the comment end state.
1197 | 1198 |
Anything else
1199 |
Parse error.Reconsume in the comment end state. 1200 |
1201 |
1202 |
1203 | 1204 |

1205 | Comment end dash state 1206 |

1207 | 1208 |
1209 | 1210 | Consume the next input character: 1211 | 1212 |
1213 |
U+002D HYPHEN-MINUS (-)
1214 |
Switch to the comment end state.
1215 | 1216 |
EOF
1217 |
Parse error. Emit the comment token. Emit an end-of-file token.
1218 | 1219 |
Anything else
1220 |
Append a U+002D HYPHEN-MINUS (-) to the comment's token 1221 | data. Reconsume in the comment state. 1222 |
1223 |
1224 |
1225 | 1226 |

1227 | Comment end state 1228 |

1229 | 1230 |
1231 | 1232 | Consume the next input character: 1233 | 1234 |
1235 |
U+003E GREATER-THAN SIGN (>)
1236 |
Switch to the data state.Emit the comment token.
1237 | 1238 |
U+0021 EXCLAMATION MARK(!)
1239 |
Switch to the comment end bang state.
1240 | 1241 |
U+002D HYPHEN-MINUS (-)
1242 |
Append a U+002D HYPHEN-MINUS character (-) to the comment 1243 | token's data. 1244 |
1245 | 1246 |
EOF
1247 |
Parse error. Emit the comment token. Emit an end-of-file token. 1248 |
1249 | 1250 |
Anything else
1251 |
Append two U+002D (-) characters and the current input 1252 | character to the comment token's data. Reconsume in the comment 1253 | state. 1254 |
1255 |
1256 |
1257 | 1258 |

1259 | Comment end bang state 1260 |

1261 | 1262 |
1263 | Consume the next input character: 1264 | 1265 |
1266 |
U+002D HYPHEN-MINUS (-)
1267 |
Append a U+002D HYPHEN-MINUS character (-) and U+0021 EXCLAMATION MARK 1268 | character(!) to the comment token's data. Switch to the comment end dash 1269 | state. 1270 |
1271 | 1272 |
U+003E GREATER-THAN SIGN (>)
1273 |
Parse error. Switch to the data state.Emit the comment token.
1274 | 1275 |
EOF
1276 |
Parse error. Emit the comment token. Emit an end-of-file token.
1277 | 1278 |
Anything else
1279 |
Append two U+002D (-) characters and U+0021 EXCLAMATION MARK 1280 | character(!) to 1281 | the comment token's data. Reconsume in the comment 1282 | state. 1283 |
1284 |
1285 |
1286 | 1287 |

1288 | CDATA state 1289 |

1290 | 1291 |
1292 | 1293 | Consume the next input character: 1294 | 1295 |
1296 |
U+005D RIGHT SQUARE BRACKET (])
1297 |
Switch to the CDATA bracket state.
1298 | 1299 |
EOF
1300 |
Parse error. Reprocess the current input character in the 1301 | data state. 1302 |
1303 | 1304 |
Anything else
1305 |
Emit the current input character as character token. Stay in the 1306 | current state. 1307 |
1308 |
1309 |
1310 | 1311 |

1312 | CDATA bracket state 1313 |

1314 | 1315 |
1316 | 1317 | Consume the next input character: 1318 | 1319 |
1320 |
U+005D RIGHT SQUARE BRACKET (])
1321 |
Switch to the CDATA end state.
1322 | 1323 |
EOF
1324 |
Parse error. Reprocess the current input character in the 1325 | data state. 1326 |
1327 | 1328 |
Anything else
1329 |
Emit a U+005D RIGHT SQUARE BRACKET (]) character as character token and also 1330 | emit the current input character as character token. Switch to CDATA bracket state. 1331 |
1332 |
1333 |
1334 | 1335 | 1336 |

1337 | CDATA end state 1338 |

1339 | 1340 |
1341 | 1342 | Consume the next input character: 1343 | 1344 |
1345 |
U+003E GREATER-THAN SIGN (>)
1346 |
Switch to the data state.
1347 | 1348 |
U+005D RIGHT SQUARE BRACKET (])
1349 |
Emit the current input character as character token. Stay in the 1350 | current state. 1351 |
1352 | 1353 |
EOF
1354 |
Parse error. Reconsume the current input character in the 1355 | data state. 1356 |
1357 | 1358 |
Anything else
1359 |
Emit two U+005D RIGHT SQUARE BRACKET (]) characters as character tokens and 1360 | also emit the current input character as character token. Switch to the 1361 | CDATA state. 1362 |
1363 |
1364 |
1365 | 1366 |

1367 | Character reference in attribute value state 1368 |

1369 | 1370 |
1371 | 1372 | Attempt to consume a character reference. 1373 | 1374 | If nothing is returned, append a U+0026 AMPERSAND (&) character to current attribute's value. 1375 | 1376 | Otherwise, append returned character tokens to current attribute's value. 1377 | 1378 | Finally, switch back to attribute value state that switched to this state. 1379 |
1380 | 1381 | 1382 |

1383 | Bogus comment state 1384 |

1385 | 1386 |
1387 | Consume the next input character: 1388 |
1389 |
U+003E GREATER-THAN SIGN (>)
1390 |
Switch to the data state. Emit the current comment token.
1391 | 1392 |
EOF
1393 |
Emit the comment. Emit an end-of-file token
1394 | 1395 |
Anything else
1396 |
1397 | Append the current input character to the comment token's data. 1398 |
1399 |
1400 |
1401 | 1402 | 1403 |

1404 | Tokenizing character references 1405 |

1406 | 1407 |
1408 | 1409 | This section defines how to consume a character reference, optionally with an additional 1410 | allowed 1411 | character, which, if specified where the algorithm is invoked, adds a character to the list of 1412 | characters 1413 | that cause there to not be a character reference. 1414 | 1415 | This definition is used when parsing character in text and in attributes. 1417 | 1418 | The behavior depends on identity of next character (the one immediately after the U+0026 AMPERSAND 1419 | character), 1420 | as follows: 1421 | 1422 |
1423 |
U+0009 CHARACTER TABULATION (Tab)
1424 |
U+000A LINE FEED (LF)
1425 |
U+0020 SPACE (Space)
1426 |
U+003C LESS-THAN SIGN (<)
1427 |
U+0025 PERCENT SIGN (%)
1428 |
U+0026 AMPERSAND (&)
1429 |
EOF
1430 |
The additional allowed character if there is one
1431 |
Not a character reference. No characters are consumed and nothing is returned (This is not an 1432 | error, 1433 | either). 1434 |
1435 |
U+0023 NUMBER SIGN (#) 1436 |
1437 | 1438 | Consume the U+0023 NUMBER SIGN. 1439 | 1440 | The behaviour further depends on the character after the U+0023 NUMBER SIGN. 1441 | 1442 |
1443 |
U+0078 LATIN SMALL LETTER X
1444 |
U+0078 LATIN CAPITAL LETTER X
1445 |
1446 |

Consume the X.

1447 |

Follow the steps below, but using ASCII hex digits.

1448 |

When it comes to interpreting the number, interpret it as a hexadecimal number.

1449 |
1450 |
Anything else
1451 |
1452 | Follow the steps below, but using ASCII digits. 1453 | 1454 | When it comes to interpreting the number, interpret it as a decimal number. 1455 |
1456 |
1457 | 1458 | Consume as many characters as match the range of characters given above (ASCII hex digits 1459 | or ASCII 1460 | digits). 1461 | 1462 | If no characters match the range, then don't consume any characters. This is a parse 1463 | error; 1464 | return the U+0023 NUMBER SIGN character and if appropriate X character as string of text. 1465 | 1466 | Otherwise, if the next character is a U+003B SEMICOLON, consume that too. If it isn't, there is 1467 | a parse 1468 | error. 1469 | 1470 | If one or more characters match the range, then take them all and interpret the string of 1471 | characters as 1472 | a number (either hexadecimal or decimal as appropriate). 1473 | 1474 |

Should we do HTML like replacement? At least for null?

1475 | 1476 | Otherwise, if the number is in the range 0xD800 to 0xDFFF or is greater than 0x10FFFF, then this 1477 | is a 1478 | parse error. Return a U+FFFD REPLACEMENT CHARACTER character token. 1479 | 1480 |
Should we refuse Unicode from ranges listed (0x0001 to 0x0008, 0x000D to 1481 | 0x001F, 1482 | 0x007F to 0x009F, 0xFDD0 to 0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 1483 | 0x2FFFE, 1484 | 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 1485 | 0x7FFFF, 1486 | 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 1487 | 0xDFFFE, 1488 | 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, or 0x10FFFF)?
1489 | 1490 | I've noted that Javascript implementation of XML5 is having to go around some characters in 1491 | its 1492 | version. 1493 |
1494 |
1495 |
Anything else
1496 |
1497 | 1498 | Consume characters until you reach a U+003B SEMICOLON character (;). 1499 | 1500 |

What happens if there is no semicolon? Does it read rest of the file? Maybe 1501 | better 1502 | solution is to read all characters that are part of name char according to XML 1.1. spec.

1504 | 1505 | Otherwise, a character reference is parsed. If the last character matched is not a U+003B 1506 | SEMICOLON 1507 | character (;), there is a parse error. 1508 | 1509 | If there was a parse error the consumed characters are interperted as part of a string and are 1510 | returned. 1511 | 1512 | If there wasn't a parse error return a reference with name equal to consumed characters, 1513 | omitting the 1514 | U+003B SEMICOLON character (;). 1515 | 1516 |
1517 | If the markup contains following attribute This is a &ref;, character 1518 | tokenizer 1519 | should return this as a reference named ref. However if the attribute defined is defined as 1520 | This 1521 | is &notref, then the tokenizer will interpret this as a text This is 1522 | &notref, while emitting a parse error. 1523 |
1524 | 1525 |
1526 |
1527 |
1528 | 1529 |

1530 | DOCTYPE state 1531 |

1532 |
1533 | Consume the next input character: 1534 |
1535 |
U+0009 CHARACTER TABULATION (Tab)
1536 |
U+000A LINE FEED (LF)
1537 |
U+0020 SPACE (Space)
1538 |
Switch to the before DOCTYPE name state.
1539 | 1540 |
EOF
1541 |
Emit an eof-in-doctype parse error. Switch to data state.
1542 | Create new DOCTYPE token. Emit DOCTYPE token. Emit an end-of-file token. 1543 | 1544 |
Anything else
1545 |
Emit an missing-whitespace-before-doctype-name parse error parse error. 1546 | Reconsume character in before DOCTYPE name state. 1547 |
1548 |
1549 |
1550 | 1551 |

1552 | Before DOCTYPE name state 1553 |

1554 | 1555 |
1556 | Consume the next input character: 1557 |
1558 |
U+0009 CHARACTER TABULATION (Tab)
1559 |
U+000A LINE FEED (LF)
1560 |
U+0020 SPACE (Space)
1561 |
Ignore the character.
1562 | 1563 |
Uppercase ASCII letter
1564 |
Create a new DOCTYPE token. Set the token name to lowercase version 1565 | of the current input 1566 | character. 1567 | Switch to the DOCTYPE name state. 1568 |
1569 | 1570 |
U+003E GREATER-THAN SIGN(>)
1571 |
This is a missing-doctype-name parse error. Create a new DOCTYPE token. 1572 | Emit DOCTYPE token. Switch to data state. 1573 |
1574 | 1575 |
EOF
1576 |
This is eof-in-doctype parse error. Switch to data state.
1577 | Create new DOCTYPE token. Emit DOCTYPE token. Emit an end-of-file token. 1578 | 1579 |
Anything else
1580 |
Create new DOCTYPE token. Set the token's name to current input character. Switch to DOCTYPE 1581 | name state. 1582 |
1583 |
1584 |
1585 | 1586 |

1587 | DOCTYPE name state 1588 |

1589 | 1590 |
1591 | Consume the next input character: 1592 | 1593 |
1594 |
U+0009 CHARACTER TABULATION (Tab)
1595 |
U+000A LINE FEED (LF)
1596 |
U+0020 SPACE (Space)
1597 |
Set doctype depth to 0. Switch to the after DOCTYPE name state.
1598 | 1599 |
Uppercase ASCII letter
1600 |
Append the lowercase of current input character to current DOCTYPE 1601 | token. 1602 |
1603 | 1604 |
U+003E GREATER-THAN SIGN(>)
1605 |
Create a new DOCTYPE token. Emit token. Switch to data state.
1606 | 1607 |
EOF
1608 |
This is eof-in-doctype parse error. Emit the current DOCTYPE token. 1609 | Emit an end-of-file token. 1610 |
1611 | 1612 |
Anything else
1613 |
Append the current input character to the current DOCTYPE token's name. 1614 | Reconsume the EOF character. 1615 |
1616 |
1617 |
1618 | 1619 |

1620 | After DOCTYPE name state 1621 |

1622 | 1623 |
1624 | Consume the next input character: 1625 | 1626 |
1627 |
U+005B LEFT SQUARE BRACKET ([)
1628 |
Increase doctype depth by 1. Remain in current state.
1629 | 1630 |
U+005D RIGHT SQUARE BRACKET (])
1631 |
If current doctype depth is 0 switch to Bogus doctype state, 1632 | otherwise decrease doctype depth by 1. Remain in current state. 1633 |
1634 | 1635 |
U+003E GREATER-THAN SIGN(>)
1636 |
If current doctype depth is 0, emit current doctype and switch to data state.
1637 | 1638 |
EOF
1639 |
This is eof-in-doctype parse error. Switch to the data state. Emit DOCTYPE 1640 | token. 1641 | Emit an end-of-file token. 1642 |
1643 | 1644 |
Anything else
1645 |
Remain in current state
1646 |
1647 |
1648 | 1649 |

1650 | Bogus DOCTYPE state 1651 |

1652 |
1653 | Consume the next input character: 1654 |
1655 |
U+003E GREATER-THAN SIGN(>)
1656 |
Switch to data state. Emit DOCTYPE token.
1657 | 1658 |
EOF
1659 |
Emit DOCTYPE token. Emit the end-of-file token.
1660 | 1661 |
Anything else
1662 |
Ignore character.
1663 |
1664 |
1665 | 1666 | 1667 |

1668 | Tree construction 1669 |

1670 | 1671 | The input to the tree construction stage is a sequence of tokens from the 1672 | tokenization stage. The output of this stage is a tree model 1673 | represented by a Document object. 1674 | 1675 | The tree construction stage passes through several phases. The initial 1676 | phase is the start phase. 1677 | 1678 | The stack of open elements contains all elements of which the 1679 | closing tag has not yet been encountered. Once the first start tag token in 1680 | the start phase is encountered it will contain one open element. 1681 | The rest of the elements are added during the main phase. 1682 | 1683 | The current element is the bottommost node in this stack. 1684 | 1685 | The stack of open elements is said to have an element in 1686 | scope if the target element is in the stack of open elements. 1687 | 1688 | When the steps below require the user agent to append a 1689 | character to a node, the user agent must collect it 1690 | and all subsequent consecutive characters that would be appended to that node 1691 | and insert one Text node whose data is the concatenation of all 1692 | those characters. 1693 | 1694 | Need to define create an element for the token... 1695 | 1696 | When the steps below require the user agent to insert an element 1697 | for a token the user agent must create an element 1698 | for the token and then append it to the current element 1699 | and push it into the stack of open elements so 1700 | that it becomes the new current element. 1701 | 1702 |
1703 | 1704 |
Start phase
1705 | 1706 |

1707 | Each token emitted from the tokenization stage must be 1708 | processed as follows until the algorithm below switches to a different 1709 | phase: 1710 | 1711 |

1712 |
A start tag token
1713 |
1714 | 1715 | Create an element for the token and then append it to the 1716 | Document node and push it into the stack of open elements. 1717 | 1718 | This element is the root element and the first current 1719 | element. Then switch to the main phase. 1720 |
1721 | 1722 |
An empty tag token
1723 |
1724 | 1725 | Create an element for the token and append it to the 1726 | Document node. Then switch to the end phase. 1727 | 1728 |
1729 | 1730 |
A comment token
1731 |
1732 | 1733 | Append a Comment node to the Document node 1734 | with the data attribute set to the data given in the 1735 | token. 1736 | 1737 |
1738 | 1739 |
A processing instruction token
1740 |
1741 | 1742 | Append a ProcessingInstruction node to the 1743 | Document node with the target and data 1744 | attributes set to the target and data given in the token. 1745 | 1746 |
1747 | 1748 |
An end-of-file token
1749 |
1750 | 1751 | Parse error. Reprocess the token in the end 1752 | phase. 1753 | 1754 |
1755 | 1756 |
Anything else
1757 |
1758 | Parse error. Ignore the token. 1759 |
1760 |
1761 | 1762 |
Main phase
1763 | 1764 |

1765 | Once a start tag token has been encountered (as detailed in the previous 1766 | phase) each token must be process using the following 1767 | steps until further notice: 1768 | 1769 |

1770 |
A character token
1771 |
1772 | 1773 | Append a character to the current 1774 | element. 1775 | 1776 |
1777 | 1778 |
A start tag token
1779 |

Insert an element for the token.

1780 | 1781 |
An empty tag token
1782 |

Create an element for the token and append it to the 1783 | current element.

1784 | 1785 |
An end tag token
1786 |
1787 | 1788 | If the tag name of the current node does not match the tag 1789 | name of the end tag token this is a parse error. 1790 | 1791 | If there is an element in scope with the same tag name as 1792 | that of the token pop nodes from the stack of open elements 1793 | until the first such element has been popped from the stack. 1794 | 1795 | If there are no more elements on the stack of open elements at this point 1796 | switch to the end phase. 1797 |
1798 | 1799 |
A short end tag token
1800 |
1801 | 1802 | Pop an element from the stack of open elements. If there 1803 | are no more elements on the stack of open elements switch to the end 1804 | phase. 1805 | 1806 |
1807 | 1808 |
A comment token
1809 |
1810 | 1811 | Append a Comment node to the current element 1812 | with the data attribute set to the data given in the 1813 | token. 1814 | 1815 |
1816 | 1817 |
A processing instruction token
1818 |
1819 | Append a ProcessingInstruction node to the current 1820 | element with the target and data attributes 1821 | set to the target and data given in the token. 1822 | 1823 |
1824 | 1825 |
An end-of-file token
1826 |
1827 | Parse error. Reprocess the token in the end phase. 1828 | 1829 |
1830 |
1831 | 1832 |
End phase
1833 | before 1834 |

1835 | Tokens in end phase must be handled as follows: 1836 | 1837 |

1838 |
A comment token
1839 |
Append a Comment node to the Document node 1840 | with the data attribute set to the data given in the 1841 | token. 1842 |
1843 | 1844 |
A processing instruction token
1845 |
1846 | 1847 | Append a ProcessingInstruction node to the 1848 | Document node with the target and data 1849 | attributes set to the target and data given in the token. 1850 | 1851 |
1852 | 1853 |
An end-of-file token
1854 |
1855 | 1856 | Stop parsing. 1857 |
1858 | 1859 |
Anything else
1860 |
1861 | 1862 | Parse error. Ignore the token. 1863 |
1864 |
1865 | 1866 |
1867 | 1868 |

Once the user agent stops parsing the 1869 | document, it must follow these steps: 1870 | 1871 |

TODO

1872 | 1873 |

1874 | Writing XML documents 1875 |

1876 | 1877 |

1878 | Common parser idioms 1879 |

1880 | 1881 | The ASCII digits are the characters in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT 1882 | NINE ( 1883 | 9). 1884 | 1885 | The ASCII hex digits are the characters in the ranges U+0030 DIGIT ZERO ( 1886 | 0) to U+0039 DIGIT NINE ( 1887 | 9), U+0041 LATIN CAPITAL LETTER A to U+0046 LATIN CAPITAL LETTER F, and U+0061 LATIN SMALL LETTER A 1888 | to U+0066 LATIN SMALL LETTER F. 1889 | 1890 | The lowercase ASCII 1891 | letters are characters in the range between U+0061 LATIN SMALL LETTER A to U+007A LATIN SMALL LETTER Z. 1892 | 1893 | The uppercase ASCII 1894 | letters are characters in the range between U+0041 LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL LETTER 1895 | Z. 1896 | 1897 | Comparing two strings in an ASCII case-insensitive manner means comparing them exactly, code point for code 1898 | point, except that the characters in the range U+0041 to U+005A (i.e. LATIN CAPITAL LETTER A to LATIN CAPITAL LETTER Z) 1899 | and the corresponding characters in the range U+0061 to U+007A (i.e. LATIN SMALL LETTER A to LATIN SMALL LETTER Z) are 1900 | considered to also match. 1901 | --------------------------------------------------------------------------------