├── .gitignore ├── 1.1 ├── spec.md └── spec_zh_CN.md ├── 1.2 ├── defs.yml ├── index.html ├── spec.after.html ├── spec.before.html ├── spec.md └── templates │ ├── element │ └── property ├── Makefile ├── README.md ├── biblio.json ├── gen-defs.py ├── hocr-spec.md ├── images ├── baseline.png ├── bbox-crop.png └── bbox.odg ├── index.html └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | 1.2/include/defs 2 | -------------------------------------------------------------------------------- /1.1/spec.md: -------------------------------------------------------------------------------- 1 | # The hOCR Embedded OCR Workflow and Output Format, version 1.1 2 | 3 | **OBSOLETE**: [Version 1.2](http://kba.github.io/hocr-spec/1.2/) supersedes this document. 4 | 5 | The purpose of this document is to define an open standard for representing OCR 6 | results. The goal is to reuse as much existing technology as possible, and to 7 | arrive at a representation that makes it easy to reuse OCR results. 8 | 9 | This is the english translation, a [chinese translation is available as well](./spec_zh_CN.md). 10 | 11 | 12 | ## Table of Contents 13 | 14 | 15 | * [Table of Contents](#table-of-contents) 16 | * [Revision History](#revision-history) 17 | * [1 Rationale](#1-rationale) 18 | * [2 Getting Started](#2-getting-started) 19 | * [3 Terminology and Representation](#3-terminology-and-representation) 20 | * [General Properties](#general-properties) 21 | * [`bbox`](#bbox) 22 | * [`textangle`](#textangle) 23 | * [Non-recommended general properties](#non-recommended-general-properties) 24 | * [`poly`](#poly) 25 | * [`order`](#order) 26 | * [`presence`](#presence) 27 | * [`cflow`](#cflow) 28 | * [`baseline`](#baseline) 29 | * [4 Logical Structuring Elements](#4-logical-structuring-elements) 30 | * [`ocr_document`](#ocr_document) 31 | * [`ocr_title`](#ocr_title) 32 | * [`ocr_author`](#ocr_author) 33 | * [`ocr_abstract`](#ocr_abstract) 34 | * [`ocr_part`](#ocr_part) 35 | * [`ocr_chapter`](#ocr_chapter) 36 | * [`ocr_section`](#ocr_section) 37 | * [`ocr_subsubsection`](#ocr_subsubsection) 38 | * [`ocr_display`](#ocr_display) 39 | * [`ocr_blockquote`](#ocr_blockquote) 40 | * [`ocr_par`](#ocr_par) 41 | * [`ocr_linear`](#ocr_linear) 42 | * [`ocr_caption`](#ocr_caption) 43 | * [5 Typesetting Related Elements](#5-typesetting-related-elements) 44 | * [Classes for typesetting elements](#classes-for-typesetting-elements) 45 | * [`ocr_page`](#ocr_page) 46 | * [`ocr_column`](#ocr_column) 47 | * [`ocr_carea`](#ocr_carea) 48 | * [`ocr_line`](#ocr_line) 49 | * [`ocr_separator`](#ocr_separator) 50 | * [`ocr_noise`](#ocr_noise) 51 | * [Recommended Properties for typesetting elements](#recommended-properties-for-typesetting-elements) 52 | * [`bbox (typesetting)`](#bbox-typesetting) 53 | * [`image`](#image) 54 | * [`imagemd5`](#imagemd5) 55 | * [`ppageno`](#ppageno) 56 | * [`lpageno`](#lpageno) 57 | * [Optional Properties for typesetting elements](#optional-properties-for-typesetting-elements) 58 | * [`scan_res`](#scan_res) 59 | * [`x_scanner`](#x_scanner) 60 | * [`x_source`](#x_source) 61 | * [`hardbreak`](#hardbreak) 62 | * [Classes for floats](#classes-for-floats) 63 | * [`ocr_float`](#ocr_float) 64 | * [`ocr_separator`](#ocr_separator-1) 65 | * [`ocr_textfloat`](#ocr_textfloat) 66 | * [`ocr_textimage`](#ocr_textimage) 67 | * [`ocr_image`](#ocr_image) 68 | * [`ocr_linedrawing`](#ocr_linedrawing) 69 | * [`ocr_photo`](#ocr_photo) 70 | * [`ocr_header`](#ocr_header) 71 | * [`ocr_footer`](#ocr_footer) 72 | * [`ocr_pageno`](#ocr_pageno) 73 | * [`ocr_table`](#ocr_table) 74 | * [6 Inline Representations](#6-inline-representations) 75 | * [Classes for Inline Representation](#classes-for-inline-representation) 76 | * [`ocr_glyph`](#ocr_glyph) 77 | * [`ocr_glyphs`](#ocr_glyphs) 78 | * [`ocr_dropcap`](#ocr_dropcap) 79 | * [`ocr_chem`](#ocr_chem) 80 | * [`ocr_math`](#ocr_math) 81 | * [Non-breaking space](#non-breaking-space) 82 | * [Non-default spaces](#non-default-spaces) 83 | * [Hyphenation](#hyphenation) 84 | * [Superscript and Subscript](#superscript-and-subscript) 85 | * [Ruby characters](#ruby-characters) 86 | * [7 Character Information](#7-character-information) 87 | * [Classes for Character Information](#classes-for-character-information) 88 | * [`ocr_cinfo`](#ocr_cinfo) 89 | * [Properties for Character Information](#properties-for-character-information) 90 | * [`cuts`](#cuts) 91 | * [`nlp`](#nlp) 92 | * [8 OCR Engine-Specific Markup](#8-ocr-engine-specific-markup) 93 | * [Classes for engine specific markup](#classes-for-engine-specific-markup) 94 | * [`ocrx_block`](#ocrx_block) 95 | * [`ocrx_line`](#ocrx_line) 96 | * [`ocrx_word`](#ocrx_word) 97 | * [Properties for engine-specific markup](#properties-for-engine-specific-markup) 98 | * [`x_font`](#x_font) 99 | * [`x_fsize`](#x_fsize) 100 | * [`x_boxes`](#x_boxes) 101 | * [`x_confs`](#x_confs) 102 | * [`x_wconf`](#x_wconf) 103 | * [9 Font, Text Color, Language, Direction](#9-font-text-color-language-direction) 104 | * [10 Alternative Segmentations / Readings](#10-alternative-segmentations--readings) 105 | * [11 Grouped Elements and Multiple Hierarchies](#11-grouped-elements-and-multiple-hierarchies) 106 | * [12 Capabilities](#12-capabilities) 107 | * [`ocrp_lang`](#ocrp_lang) 108 | * [`ocrp_dir`](#ocrp_dir) 109 | * [`ocrp_poly`](#ocrp_poly) 110 | * [`ocrp_font`](#ocrp_font) 111 | * [`ocrp_nlp`](#ocrp_nlp) 112 | * [`ocr_embeddedformat_`](#ocr_embeddedformat_) 113 | * [`ocr__unordered`](#ocr__unordered) 114 | * [13 Profiles](#13-profiles) 115 | * [14 Required Meta Information](#14-required-meta-information) 116 | * [15 HTML Markup](#15-html-markup) 117 | * [`html_none`](#html_none) 118 | * [`html_ocr`](#html_ocr) 119 | * [`html_absolute`](#html_absolute) 120 | * [`html_xytable`](#html_xytable) 121 | * [`html_simpl`](#html_simpl) 122 | * [15.1 Restrictions on HTML Content](#151-restrictions-on-html-content) 123 | * [15.2 Recommendations for Mappings](#152-recommendations-for-mappings) 124 | * [15.2.1 html_none](#1521-html_none) 125 | * [15.2.2 html_simple](#1522-html_simple) 126 | * [15.2.3 html_ocr_](#1523-html_ocr_) 127 | * [15.2.4 html_absolute_](#1524-html_absolute_) 128 | * [15.2.5 html_xytable_absolute](#1525-html_xytable_absolute) 129 | * [15.2.6 html_xytable_relative](#1526-html_xytable_relative) 130 | * [15.2.7 html_](#1527-html_) 131 | * [`html_latex2html`](#html_latex2html) 132 | * [`html_msword`](#html_msword) 133 | * [`html_ooffice`](#html_ooffice) 134 | * [`html_docbook_xsl`](#html_docbook_xsl) 135 | * [16 Document Meta Information](#16-document-meta-information) 136 | * [17 Sample Usage](#17-sample-usage) 137 | 138 | 139 | 140 | ## Revision History 141 | 142 | hOCR has been originally developed by Thomas Breuel. 143 | 144 | See the [releases](https://github.com/kba/hocr-spec/releases/) and full [commit 145 | history](https://github.com/kba/hocr-spec/commits/) for a revision history. 146 | 147 | ## 1 Rationale 148 | 149 | The purpose of this document is to define an open standard for representing OCR 150 | results. The goal is to reuse as much existing technology as possible, and to 151 | arrive at a representation that makes it easy to reuse OCR results. 152 | 153 | 154 | ## 2 Getting Started 155 | 156 | This document describes many tags and a lot of information that can be output. 157 | However, getting started with hOCR is easy: you only need to output the tags 158 | and information you actually want to. For example, just outputting `ocr_line` 159 | tags with bounding boxes is already very useful for many applications. Just 160 | start simple and add more output information as the need arises. 161 | 162 | 163 | ## 3 Terminology and Representation 164 | 165 | This document describes a representation of various aspects of OCR output in an 166 | XML-like format. That is, we define as set of tags containing text and other 167 | tags, together with attributes of those tags. However, since the content we are 168 | representing is formatted text, 169 | 170 | However, we are not actually using a new XML for the representation; instead 171 | embed the representation in XHTML (or HTML) because XHTML and XHTML processing 172 | already define many aspects of OCR output representation that would otherwise 173 | need additional, separate and ad-hoc definitions. These aspects include: 174 | 175 | * standard representations for common logical structuring elements, including 176 | section headings, citations, tables, emphasis, line breaks, quotations, 177 | citations, and preformatted text 178 | * standard representations for fonts, embedded images, embedded vector 179 | graphics, tables, languages, writing direction, colors 180 | * standard representations for geometric layout and positioning 181 | * output files that are understood without any further modification by widely 182 | used viewers (browsers), editors, conversion tools, and indexing tools 183 | * libraries for parsing and generating the content 184 | * support for document metadata 185 | 186 | We are embedding this information inside HTML by encoding it within valid tags 187 | and attributes inside HTML; We are going to use the terms "elements" and 188 | "properties" for referring to embedded markup. 189 | 190 | Elements are defined by the class= attribute on an arbitrary HTML tag. All 191 | elements in this format have a class name of the form `ocr..._...`. 192 | 193 | Properties are defined by putting information into the `title=` attribute of an 194 | HTML tag. Properties in title attributes are of the form “name values...”, and 195 | multiple properties are separated by semicolons. 196 | 197 | Here is an example: 198 | 199 | ```html 200 |
201 |
202 |
...
203 |
...
204 |
205 |
206 | ``` 207 | 208 | ### General Properties 209 | 210 | The following properties can apply to most elements (where it makes sense): 211 | 212 | #### `bbox` 213 | 214 | * `bbox x0 y0 x1 y1` – the bounding box of the element relative to the 215 | binarized document image 216 | * use `x_bboxes` below for character bounding boxes 217 | * do not use `bbox` unless the bounding box of the layout component is, in 218 | fact, rectangular 219 | * some non-rectangular layout components may have rectangular bounding boxes 220 | if the non-rectangularity is caused by floating elements around which text flows 221 | 222 | See also the section [`bbox (typesetting)`](#bbox-typesetting). 223 | 224 | #### `textangle` 225 | 226 | * `textangle alpha` - the angle in degrees by which textual content has been 227 | rotate relative to the rest of the page (if not present, the angle is assumed 228 | to be zero); rotations are counter-clockwise, so an angle of 90 degrees is 229 | vertical text running from bottom to top in Latin script; note that this is 230 | different from reading order, which should be indicated using standard HTML 231 | properties 232 | 233 | ### Non-recommended general properties 234 | 235 | The following properties can apply to most elements but should not be used 236 | unless there is no alternative: 237 | 238 | #### `poly` 239 | 240 | * `poly x0 y0 x1 y1 ...` - a closed polygon for elements with non-rectangular bounds 241 | * this property must not be used unless there is no other way of 242 | representing the layout of the page using rectangular bounding boxes, 243 | since most tools will simply not have the capability of dealing with 244 | non-rectangular layouts 245 | * note that the natural and correct representation of many non-rectangular 246 | layouts is in terms of rectangular content areas and rectangular floats 247 | * documents using polygonal borders anywhere must indicate this in the 248 | metadata 249 | * documents should attempt to provide a reasonable bbox equivalent as well 250 | 251 | #### `order` 252 | 253 | * `order n` – the reading order of the element (an integer) 254 | * this property must not be used unless there is no other way of representing 255 | the reading order of the page by element ordering within the page, since 256 | many tools will not be able to deal with content that is not in reading order 257 | 258 | #### `presence` 259 | 260 | * `presence` presence must be declared in the document meta data 261 | 262 | The following property relates the flow between multiple `ocr_carea` elements, 263 | and between `ocr_carea` and `ocr_linear` elements. 264 | 265 | #### `cflow` 266 | 267 | * `cflow s` – the content flow on the page that this element is a part of 268 | * s must be a unique string for each content flow 269 | * must be present on ocr_carea and ocrx_block tags when reading order is 270 | attempted and multiple content flows are present 271 | * presence must be declared in the document meta data 272 | 273 | This property applies primarily to textlines 274 | 275 | #### `baseline` 276 | 277 | * `baseline pn pn-1 ... p0` - a polynomial describing the baseline of a line of 278 | text 279 | * the polynomial is in the coordinate system of the line, with the bottom 280 | left of the bounding box as the origin 281 | 282 | ## 4 Logical Structuring Elements 283 | 284 | We recognize the following logical structuring elements: 285 | 286 | * `ocr_document` 287 | * `ocr_linear` 288 | * `ocr_title` 289 | * `ocr_author` 290 | * `ocr_abstract` 291 | * `ocr_part` [`

`] 292 | * `ocr_chapter` [`

`] 293 | * `ocr_section` [`

`] 294 | * `ocr_sub*section` [`

`,`

`] 295 | * `ocr_display` 296 | * `ocr_blockquote` [`
`] 297 | * `ocr_par` [`

`] 298 | 299 | ### `ocr_document` 300 | ### `ocr_title` 301 | ### `ocr_author` 302 | ### `ocr_abstract` 303 | ### `ocr_part` 304 | ### `ocr_chapter` 305 | ### `ocr_section` 306 | ### `ocr_subsubsection` 307 | ### `ocr_display` 308 | ### `ocr_blockquote` 309 | ### `ocr_par` 310 | 311 | These logical tags have their standard meaning as used in the publishing 312 | industry and tools like LaTeX, MS Word, and others. 313 | 314 | The standard HTML tags given in brackets specify the preferred HTML tags to use 315 | with those logical structuring elements, but it may not be possible or 316 | desirable to actually chose those tags (e.g., when adding hOCR information to 317 | an existing HTML output routine). 318 | 319 | ### `ocr_linear` 320 | 321 | For all of these elements except `ocr_linear`, there exists a natural linear 322 | ordering defined by reading order (`ocr_linear` indicates that the elements 323 | contained in it have a linear ordering). At the level of `ocr_linear`, there 324 | may not be a single distinguished order. A common example of `ocr_linear` is a 325 | newspaper, in which a single newspaper may contain many linear, but there is no 326 | unique reading order for the different linear. OCR evaluation tools should 327 | therefore be sensitive to the order of all elements other than `ocr_linear`. 328 | 329 | Tags must be nested as indicated by nesting above, but not all tags within the 330 | hierarchy need to be present. 331 | 332 | Textual information like section numbers and bullets must be represented as 333 | text inside the containing element. 334 | 335 | Documents whose logical structure does not map naturally onto these logical 336 | structuring elements must not use them for other purpose. 337 | 338 | ### `ocr_caption` 339 | 340 | Image captions may be indicated using the `ocr_caption` element; such an 341 | element refers to the image(s) contained within the same float, or the 342 | immediately adjacent image if both the image and the `ocr_caption` element are 343 | in running text. 344 | 345 | 346 | ## 5 Typesetting Related Elements 347 | 348 | The following typesetting related elements are based on a typesetting model as 349 | found in most typesetting systems, including 350 | [XSL:FO](https://www.w3.org/TR/xsl11/#fo-section), 351 | [(La)TeX](https://latex-project.org/guides/usrguide.pdf), 352 | [LibreOffice](https://wiki.documentfoundation.org/images/e/e6/WG42-WriterGuideLO.pdf), 353 | and Microsoft Word. 354 | 355 | In those systems, each page is divided into a number of areas. Each area can 356 | either be a part of the body text (or multiple body texts, in the case of 357 | newspaper layouts). The content of the areas derives from a linear stream of 358 | textual content, which flows into the areas, filling them linewise in their 359 | preferred directions. 360 | 361 | Overlaid onto the page is a set of floating elements; floating elements exist 362 | outside the normal reading order. Floating elements may be introduced by the 363 | textual content, or they may be related to the page itself (anchoring is a 364 | logical property). In typesetting systems, floating elements may be anchored to 365 | the page, to paragraphs, or to the content stream. Floating elements can 366 | overlap content areas and render on top of or under content, or they can force 367 | content to flow around them. The default for floating elements in this spec is 368 | that their anchor is undefined (it is a logical property, not a typesetting 369 | property), and that text flows around them. Note that with rectangular content 370 | areas and rectangular floats, already a wide variety of non-rectangular text 371 | shapes can be realized. 372 | 373 | **Issue: there is currently no way of indicating anchoring or flow-around 374 | properties for floating elements; properties need to be defined for this.** 375 | 376 | ### Classes for typesetting elements 377 | 378 | The following classes, as well as [floats](#classes-for-floats) are used for type-setting 379 | elements. 380 | 381 | #### `ocr_page` 382 | 383 | * `ocr_page` 384 | 385 | The `ocr_page` element must be present in all hOCR documents. 386 | 387 | #### `ocr_column` 388 | 389 | **DEPRECATED**: Please use [`ocr_carea`](#ocr_carea) instead 390 | 391 | #### `ocr_carea` 392 | 393 | * `ocr_carea` 394 | 395 | "ocr content area" or "body area" 396 | 397 | Used to be called ~~ocr_column~~ 398 | 399 | #### `ocr_line` 400 | 401 | Should be in a `` 402 | 403 | #### `ocr_separator` 404 | 405 | * `ocr_separator` (any separator or similar element) 406 | 407 | #### `ocr_noise` 408 | 409 | * `ocr_noise` (any noise element that isn't part of typesetting) 410 | 411 | ### Recommended Properties for typesetting elements 412 | 413 | The following properties should be present: 414 | 415 | #### `bbox (typesetting)` 416 | 417 | * `bbox` 418 | * the bounding box of the page; for pages, the top left corner must be at 419 | `(0,0)`, so a typical page bounding box will look like `bbox 0 0 2300 3200` 420 | 421 | #### `image` 422 | 423 | * `image imagefile` 424 | * image file name used as input 425 | * syntactically, must be a UNIX-like pathname or http URL (no Windows pathnames) 426 | * may be relative 427 | * cannot be resolved to the actual file in general (e.g., if the hOCR file 428 | becomes separated from the image file) 429 | * if the hOCR file is present in a directory hierarchy or file archive, should 430 | resolve to the corresponding image file 431 | 432 | #### `imagemd5` 433 | 434 | * `imagemd5 checksum` 435 | * MD5 fingerprint of the image file that this page was derived from 436 | * allows re-associating pages with source images 437 | 438 | #### `ppageno` 439 | 440 | * `ppageno n` 441 | * the physical page number 442 | * the front cover is page number 0 443 | * should be unique 444 | * must not be present unless the pages in the document have a physical ordering 445 | * must not be present unless it is well defined and unique 446 | 447 | #### `lpageno` 448 | 449 | * `lpageno string` 450 | * the logical page number expressed on the page 451 | * may not be numerical (e.g., Roman numerals) 452 | * usually is unique 453 | * must not be present unless it has been recognized from the page and is unambiguous 454 | 455 | ### Optional Properties for typesetting elements 456 | 457 | The following properties MAY be present: 458 | 459 | #### `scan_res` 460 | 461 | * `scan_res x_res y_res` 462 | * scanning resolution in DPI 463 | #### `x_scanner` 464 | 465 | * `x_scanner string` 466 | * a representation of the scanner 467 | 468 | #### `x_source` 469 | 470 | * `x_source string` 471 | * an implementation-dependent representation of the document source 472 | * could be a URL or a /gfs/ path 473 | * offsets within a multipage format (e.g., TIFF) may be represented using 474 | additional strings or using URL parameters or fragments 475 | * examples 476 | * `x_source /gfs/cc/clean/012345678911 17` 477 | * `x_source http://pageserver/012345678911&page=17` 478 | 479 | The `ocr_carea` elements should appear reading order unless this is impossible 480 | because of some other structuring requirement If the document contains multiple 481 | `ocr_linear` streams, then each `ocr_carea` must indicate which stream it belongs 482 | to. 483 | 484 | In typesetting systems, content areas are filled with “blocks”, but most of 485 | those blocks are not recoverable or semantically meaningful. However, one type 486 | of block is visible and very important for OCR engines: the line. Lines are 487 | typesetting blocks that only contain glyphs (“inlines” in XSL terminology). 488 | 489 | They are represented by the `ocr_line` area. In addition to the standard 490 | properties, the `ocr_line` area supports the following additional properties: 491 | 492 | #### `hardbreak` 493 | 494 | * `hardbreak n` 495 | * a zero (default) indicates that the end of the line is not a hard 496 | (explicit) line break, but a break due to text flow 497 | * a one indicates that the line is a hard (explicit) line break 498 | 499 | Any special characters representing the desired end-of-line processing must be 500 | present inside the `ocr_line` element. Examples of such special characters are a 501 | soft hyphen ("­", `U+00AD`), a hard line break (`
`), or whitespace (` `) for soft 502 | line breaks. 503 | 504 | Note that for many documents, the actual ground truth careas are well-defined 505 | by the document style of the original document before printing and scanning. 506 | From a single page, the `careas` of the original document style cannot be 507 | recovered exactly. However, the partition of a document by `ocr_carea` for an 508 | individual page shall be considered correct relative to ground truth if 509 | 510 | 1. all the text contained in a ground truth carea is fully contained within a 511 | single `ocr_carea`, 512 | 2. no text outside a ground truth `carea` is contained within an 513 | `ocr_carea`, and 514 | 3. the `ocr_careas` appear in the same order as the text flow 515 | relationships between the ground truth careas. 516 | 517 | ### Classes for floats 518 | 519 | Floats should not be nested. 520 | 521 | The following floats are defined: 522 | 523 | #### `ocr_float` 524 | 525 | * `ocr_float` 526 | 527 | #### `ocr_separator` 528 | 529 | * `ocr_separator` 530 | 531 | #### `ocr_textfloat` 532 | 533 | * `ocr_textfloat` 534 | 535 | #### `ocr_textimage` 536 | 537 | * `ocr_textimage` 538 | 539 | #### `ocr_image` 540 | 541 | * `ocr_image` 542 | 543 | #### `ocr_linedrawing` 544 | 545 | * `ocr_linedrawing` – something that could be represented well and naturally in 546 | a vector graphics format like SVG (even if it is actually represented as PNG) 547 | 548 | #### `ocr_photo` 549 | 550 | * `ocr_photo` – something that requires JPEG or PNG to be represented well 551 | 552 | #### `ocr_header` 553 | 554 | * `ocr_header` 555 | 556 | #### `ocr_footer` 557 | 558 | * `ocr_footer` 559 | 560 | #### `ocr_pageno` 561 | 562 | * `ocr_pageno` 563 | 564 | #### `ocr_table` 565 | 566 | * `ocr_table` 567 | 568 | ## 6 Inline Representations 569 | 570 | There is some content that should behave and flow like text 571 | 572 | ### Classes for Inline Representation 573 | 574 | #### `ocr_glyph` 575 | 576 | * `ocr_glyph` – an individual glyph represented as an image (e.g., an unrecognized character) 577 | * must contain a single `` tag, or be present on one 578 | 579 | #### `ocr_glyphs` 580 | 581 | * `ocr_glyphs` – multiple glyphs represented as an image (e.g., an unrecognized word) 582 | * must contain a single `` tag, or be present on one 583 | 584 | #### `ocr_dropcap` 585 | 586 | * `ocr_dropcap` – an individual glyph representing a dropcap 587 | * may contain text or an `` tag; the `alt` of the image tag should 588 | contain the corresponding text 589 | 590 | #### `ocr_chem` 591 | 592 | * `ocr_chem` – a chemical formula 593 | * must contain either a single `` tag or 594 | [ChemML](http://www.xml-cml.org/) markup, or be present on one 595 | 596 | #### `ocr_math` 597 | 598 | * `ocr_math` – a mathematical formula 599 | * must contain either a single `` tag or 600 | [MathML](https://www.w3.org/Math/) markup, or be present on one 601 | 602 | Mathematical and chemical formulas that float must be put into an `ocr_float` 603 | section. 604 | 605 | Mathematical and chemical formulas that are “display” mode should be put into 606 | an `ocr_display` section. 607 | 608 | #### Non-breaking space 609 | 610 | Non-breaking spaces must be represented using the HTML ` ` entity. 611 | 612 | #### Non-default spaces 613 | 614 | Different space widths should be indicated using HTML and ` `, `&emsp`, 615 | ` `, `‌`, `‍`. 616 | 617 | #### Hyphenation 618 | 619 | Soft hyphens must be represented using the HTML `­` entity. 620 | 621 | The HTML `‎` and `‏` entities (indicating writing direction) must not 622 | be used; all writing direction changes must be indicated with tags. 623 | 624 | #### Superscript and Subscript 625 | 626 | Other superscripts and subscripts must be represented using the HTML `` and 627 | `` tags, even if special Unicode characters are available. 628 | 629 | #### Ruby characters 630 | 631 | [Furigana and similar constructs](https://en.wikipedia.org/wiki/Ruby_character) 632 | must be represented using their correct Unicode encoding. 633 | 634 | ## 7 Character Information 635 | 636 | ### Classes for Character Information 637 | 638 | Character-level information may be put on any element that contains only a 639 | single "line" of text. 640 | 641 | #### `ocr_cinfo` 642 | 643 | If no other layout element applies, the `ocr_cinfo` element may be used. 644 | 645 | ### Properties for Character Information 646 | 647 | #### `cuts` 648 | 649 | * `cuts c1 c2 c3 ...` 650 | * character segmentation cuts (see below) 651 | * there must be a bbox property relative to which the cuts can be interpreted 652 | 653 | #### `nlp` 654 | 655 | * `nlp c1 c2 c3 ...` 656 | * estimate of the negative log probabilities of each character by the recognizer 657 | 658 | For left-to-write writing directions, cuts are sequences of deltas in the x and 659 | y direction; the first delta in each path is an offset in the x direction 660 | relative to the last x position of the previous path. The subsequent deltas 661 | alternate between up and right moves. 662 | 663 | Assume a bounding box of `(0,0,300,100)`; then 664 | 665 | ```python 666 | cuts("10 11 7 19") = 667 | [ [(10,0),(10,100)], [(21,0),(21,100)], [(28,0),(28,100)], [(47,0),(47,100)] ] 668 | cuts("10,50,3 11,30,-3") = 669 | [ [(10,0),(10,50),(13,50),(13,100)], [(21,0),(21,30),(18,30),(18,100)] ] 670 | ``` 671 | 672 | Here is an example: 673 | 674 | ```html 675 | hello 676 | ``` 677 | 678 | 679 | Cuts are between all codepoints contained within the element, including any 680 | whitespace and control characters. Simply use a delta of 0 (zero) for 681 | invisible codepoints. 682 | 683 | Writing directions other than left-to-right specify cuts as if the bounding box 684 | for the element had been rotated by a multiple of 90 degrees such that the 685 | writing direction is left to right, then rotated back. 686 | 687 | It is undefined what happens when cut paths intersect, with the exception that 688 | a delta of 0 always corresponds to an invisible codepoint. 689 | 690 | 691 | ## 8 OCR Engine-Specific Markup 692 | 693 | A few abstractions are used as intermediate abstractions in OCR engines, 694 | although they do not have a meaning that can be defined either in terms of 695 | typesetting or logical function. Representing them may be useful to represent 696 | existing OCR output, say for workflow abstractions. 697 | 698 | Common suggested engine-specific markup are: 699 | 700 | ### Classes for engine specific markup 701 | 702 | #### `ocrx_block` 703 | 704 | * `ocrx_block` 705 | * any kind of "block" returned by an OCR system 706 | * engine-specific because the definition of a "block" depends on the engine 707 | 708 | #### `ocrx_line` 709 | 710 | * `ocrx_line` 711 | * any kind of "line" returned by an OCR system that differs from the standard ocr_line above 712 | * might be some kind of "logical" line 713 | 714 | #### `ocrx_word` 715 | 716 | * `ocrx_word` 717 | * any kind of "word" returned by an OCR system 718 | * engine specific because the definition of a "word" depends on the engine 719 | 720 | The meaning of these tags is OCR engine specific. However, generators should 721 | attempt to ensure the following properties: 722 | 723 | * an `ocrx_block` should not contain content from multiple ocr_careas 724 | * the union of all `ocrx_blocks` should approximately cover all `ocr_careas` 725 | * an `ocrx_block` should contain either a float or body text, but not both 726 | * an `ocrx_block` should contain either an image or text, but not both 727 | * an `ocrx_line` should correspond as closely as possible to an `ocr_line` 728 | * `ocrx_cinfo` should nest inside `ocrx_line` 729 | * `ocrx_cinfo` should contain only `x_conf`, `x_bboxes`, and `cuts` attributes 730 | 731 | ### Properties for engine-specific markup 732 | 733 | The following properties are defined: 734 | 735 | #### `x_font` 736 | 737 | * `x_font s` 738 | * OCR-engine specific font names 739 | 740 | #### `x_fsize` 741 | 742 | * `x_fsize n` 743 | * OCR-engine specific font size 744 | 745 | #### `x_boxes` 746 | 747 | * `x_bboxes b1x0 b1y0 b1x1 b1y1 b2x0 b2y0 b2x1 b2y1 ...` 748 | * OCR-engine specific boxes associated with each codepoint contained in the 749 | element 750 | * note that the bbox property is a property for the bounding box of a layout 751 | element, not of individual characters 752 | * in particular, use ``, not 753 | `` 754 | 755 | #### `x_confs` 756 | 757 | * `x_confs c1 c2 c3 ...` 758 | * OCR-engine specific character confidences 759 | * `c1` etc. must be numbers 760 | * higher values should express higher confidences 761 | * if possible, convert character confidences to values between 0 and 100 and 762 | have them approximate posterior probabilities (expressed in %) 763 | 764 | #### `x_wconf` 765 | 766 | * `x_wconf n` 767 | * OCR-engine specific confidence for the entire contained substring 768 | * n must be a number 769 | * higher values should express higher confidences 770 | * if possible, convert word confidences to values between 0 and 100 and have 771 | them approximate posterior probabilities (expressed in %) 772 | 773 | 774 | ## 9 Font, Text Color, Language, Direction 775 | 776 | OCR-generated font and text color information is encoded using standard HTML 777 | and CSS attributes on elements with a class of `ocr_...` or `ocrx_...`. 778 | Language and writing direction should be indicated using the HTML standard 779 | attributes `lang=` and `dir=`, or alternatively can be indicated as properties on 780 | elements. 781 | 782 | OCR information and presentation information can be separated by putting the 783 | CSS info related to the CSS in an outer element with an `ocr_` or `ocrx_` class, 784 | and then overriding it for the presentation by nesting another `` with the 785 | actual presentation information inside that: 786 | 787 | ``` 788 | ... 789 | ``` 790 | 791 | The CSS3 text layout attributes can be used when necessary. For example, CSS 792 | supports writing-mode, direction, glyph-orientation [ISO-15924-based 793 | script](http://www.unicode.org/iso15924/codelists.html), text-indent, etc. 794 | 795 | 796 | ## 10 Alternative Segmentations / Readings 797 | 798 | Alternative segmentations and readings are indicated by a `` with 799 | `class="alternatives"`. It must contains `` and `` elements. The first 800 | contained element should be `` and represent the most probable interpretation, 801 | the subsequent ones ``. Each `` and `` element should have `class="alt"` and a 802 | property of either `nlp` or `x_cost`. These ``, ``, and `` tags can nest 803 | arbitrarily. 804 | 805 | Example: 806 | 807 | ```html 808 | 809 | hello 810 | hallo 811 | 812 | ``` 813 | 814 | Whitespace within the `` but outside the contained ``/`` 815 | elements is ignored and should be inserted to improve readability of the HTML 816 | when viewed in a browser. 817 | 818 | 819 | ## 11 Grouped Elements and Multiple Hierarchies 820 | 821 | The different levels of layout information (logical, physical, engine-specific) 822 | each form hierarchies, but those hierarchies may not be mutually compatible; 823 | for example, a single `ocr_page` may contain information from multiple sections 824 | or chapters. To represent both hierarchies within a single document, elements 825 | may be grouped together. That is, two elements with the same class may be 826 | treated as one element by adding a "groupid identifier" property to them and 827 | using the same identifier. 828 | 829 | Grouped elements should be logically consistent with the markup they represent; 830 | for example, it is probably not sensible to use grouped elements to interleave 831 | parts of two different chapters. Therefore, grouped elements should usually be 832 | adjacent in the markup. 833 | 834 | Applications using hOCR may choose to manipulate grouped elements directly, but 835 | the simplest way of dealing with them is to transform a document with grouped 836 | elements into one without grouped elements prior to further processing by first 837 | removing tags that are not of interest for the subsequent processing step, and 838 | then collapsing grouped elements into single elements. For example, output 839 | that contains both logical and physical layout information, where the logical 840 | layout information uses grouped elements, can be transformed by removing all 841 | the physical layout information, and then collapsing all split `ocr_chapter` 842 | elements into single `ocr_chapter` elements based on the groupid. The result is 843 | a simple DOM tree. This transformation can be provided generically as a 844 | pre-processor or Javascript. 845 | 846 | The presence of grouped elements does not need to be indicated in the header; 847 | when it affects their operations, hOCR processors should check for the presence 848 | of grouped elements in the output and fail with an error message if they cannot 849 | correctly process the hOCR information. 850 | 851 | 852 | ## 12 Capabilities 853 | 854 | Any program generating files in this output format must indicate in the 855 | document metadata what kind of markup it is capable of generating. This 856 | includes listing the exact set of markup sections that the system could have 857 | generated, even if it did not actually generate them for the particular 858 | document. 859 | 860 | The capability to generate specific properties is given by the prefix `ocrp_...`; 861 | the important properties are: 862 | 863 | ### `ocrp_lang` 864 | 865 | * `ocrp_lang` – capable of generating `lang=` attributes 866 | 867 | ### `ocrp_dir` 868 | 869 | * `ocrp_dir` – capable of generating `dir=` attributes 870 | 871 | ### `ocrp_poly` 872 | 873 | * `ocrp_poly` – capable of generating [polygonal bounds](#poly) 874 | 875 | ### `ocrp_font` 876 | 877 | * `ocrp_font` – capable of generating font information (standard font information) 878 | 879 | ### `ocrp_nlp` 880 | 881 | * `ocrp_nlp` – capable of generating [nlp confidences](#nlp) 882 | 883 | ### `ocr_embeddedformat_` 884 | 885 | The capability to generate other specific embedded formats is given by the 886 | prefix `ocr_embeddedformat_`. 887 | 888 | ### `ocr__unordered` 889 | 890 | If an OCR engine represents a particular tag but cannot determine reading order 891 | for that tag, it must must specify a capability of `ocr__unordered`. 892 | 893 | If a document lists a certain capabilities but no element or attribute is found 894 | that corresponds to that capability, users of the document may infer that the 895 | content is absent in the source document. If a capability is not listed, the 896 | corresponding element or attribute must not be present in the document. 897 | 898 | 899 | ## 13 Profiles 900 | 901 | hOCR provides standard means of marking up information, but it does not mandate 902 | the presence or absence of particular kinds of information. For example, an 903 | hOCR file may contain only logical markup, only physical markup, or only 904 | engine-specific markup. As a result, merely knowing that OCR output is hOCR 905 | compliant doesn't tell us whether that file is actually useful for subsequent 906 | processing. 907 | 908 | OCR systems can use hOCR in various different ways internally, but we will 909 | eventually define some common profiles that mandate what kinds of information 910 | needs to be present in particular kinds of output. 911 | 912 | Of particular importance are: 913 | 914 | * physical layout profile: OCR output in XHTML format with a defined set of 915 | common physical layout markup capabilities (page, carea, floats, line). 916 | Logical layout may be present as well, but the document tree structure must 917 | represent the physical layout structure, with logical layout elements split 918 | and grouped as needed. 919 | 920 | * logical layout profile: OCR output in XHTML format with a defined set of 921 | common logical layout markup capabilities (linear, chapter, section, 922 | subsection). Physical layout may be present as well, but the document tree 923 | structure must represent the logical layout structure, with logical layout 924 | elements split and grouped as needed. 925 | 926 | Other possible profiles might be defined for specific engines or specific document classes: 927 | 928 | * common commercial OCR output (e.g., Abbyy) 929 | * ocr_page 930 | * ocrx_block, ocrx_line, ocrx_word 931 | * ocrp_lang 932 | * ocrp_font 933 | * book target 934 | * all logical structuring elements (as applicable), except ocr_linear 935 | * ocr_page 936 | * newspaper target 937 | * all logical structuring elements (as applicable) 938 | * articles map on ocr_linear 939 | * ocr_page 940 | 941 | 942 | ## 14 Required Meta Information 943 | 944 | The OCR system is required to indicate the following using meta tags in the header: 945 | 946 | * `` 947 | * `` 948 | * see the capabilities defined above 949 | 950 | The OCR system should indicate the following information 951 | 952 | * `` 953 | * `` 954 | * use [ISO 639-1](https://www.loc.gov/standards/iso639-2/php/code_list.php) codes 955 | * value may be `unknown` 956 | * `` 957 | * use [ISO 15924](http://www.unicode.org/iso15924/codelists.html) letter codes 958 | * value may be `unknown` 959 | 960 | 961 | ## 15 HTML Markup 962 | 963 | The HTML-based markup is orthogonal to the hOCR-based markup; that is, both can 964 | be chosen independent of one another. The only thing that needs to be 965 | consistent between the two markups is the text contained within the tags. hOCR 966 | and other embedded format tags can be put on HTML tags, or they can be put on 967 | their own `

`/`` tags. 968 | 969 | There are many different choices possible and reasonable for the HTML markup, 970 | depending on the use and further processing of the document. Each such choice 971 | must be indicated in the meta data for the document. 972 | 973 | Many mappings derived from existing tools are quite similar, and most follow 974 | the restrictions and recommendations below already without further 975 | modifications. 976 | 977 | Depending on the particular HTML markup used in the document, the document is 978 | suitable for different kinds of processing and use. The formats have the 979 | following intents: 980 | 981 | ### `html_none` 982 | 983 | * `html_none`: straightforward equivalent of Goodoc or [XDOC](http://www.vividata.com/manuals/core12xdc.pdf) 984 | 985 | ### `html_ocr` 986 | 987 | * `html_ocr`: straightforward recording of commercial OCR system output 988 | 989 | ### `html_absolute` 990 | 991 | * `html_absolute`: target format for services like Google's View as HTML 992 | 993 | ### `html_xytable` 994 | 995 | * `html_xytable`: target format for layout-preserving on-screen document viewing 996 | 997 | ### `html_simpl` 998 | 999 | * `html_simpl`: target format for convenient on-line viewing and intermediate format for indexing 1000 | 1001 | As long as a format contains the hOCR information, it can be reprocessed by 1002 | layout analysis software and converted into one of the other formats. In 1003 | particular, we envision layout analysis tools for converting any hOCR document 1004 | into `html_absolute`, `html_xytable`, and `html_simple`. Furthermore, 1005 | internally, a layout analysis system might use `html_xytable` as an 1006 | intermediate format for converting hOCR into `html_simple`. 1007 | 1008 | 1009 | ### 15.1 Restrictions on HTML Content 1010 | 1011 | To avoid problems, any use of HTML markup must follow the following rules: 1012 | 1013 | * HTML content must not use class names that conflict with any of those defined in this document (`ocr_*`) 1014 | * HTML content must not use the title= attribute on any element with an ocr_* class for any purposes other than encoding OCR-related properties as described in this document 1015 | 1016 | 1017 | ### 15.2 Recommendations for Mappings 1018 | 1019 | When possible, any mapping of logical structure onto HTML should try to follow the following rules: 1020 | 1021 | * the mapping should be "natural" -- similar to what an author of the document 1022 | might have entered into a WYSIWYG content creation tool 1023 | * text should be in reading order 1024 | * all tags should be used for the intended purpose (and only for the intended 1025 | purpose) as defined in the [HTML 4 spec](https://www.w3.org/TR/html4/). 1026 | * floats are contained in `
` elements with a `style` that includes a float attribute 1027 | * repeating floating page elements (header/footer) should be repeated and occur 1028 | in their natural location in reading order (e.g., between pages) 1029 | * embedded images and SVG should be contained in files in the same directory 1030 | (no `/` in the URL) and embedded with `` and `` tags, respectively 1031 | 1032 | Specifically 1033 | 1034 | * `` and `` should represent emphasis, and are preferred to ``, ``, and `` 1035 | * ``, ``, and `` should represent a change in the corresponding 1036 | attribute for the current font (but an OCR font specification must still be 1037 | given) 1038 | * `

` should represent paragraph breaks 1039 | * `
` should represent explicit linebreaks (not linebreak that happen because of text flow) 1040 | * `

`, ..., `

` should represent the logical nesting structure (if any) of the document 1041 | * `` should represent hyperlinks and references within the document 1042 | * `
` should represent indented quotations, but not other uses of indented text. 1043 | * `