├── README.md ├── v1 ├── alto-1-0.xsd ├── alto-1-1.xsd ├── alto-1-2.xsd ├── alto-1-3.xsd └── alto-1-4.xsd ├── v2 ├── alto-2-0.xsd ├── alto-2-1-draft-tagsample.xml ├── alto-2-1.xsd └── alto-2-2-draft.xsd ├── v3 ├── ALTO-language support discussion so far-20150601.pdf ├── Comparison of text direction elements.pdf ├── alto-3-0.xsd ├── alto-3-1.xsd ├── alto-3-2-draft.xsd └── discussion of ALTO language support.pdf └── v4 ├── alto-4-0.xsd ├── alto-4-1.xsd ├── alto-4-2.xsd ├── alto-4-3.xsd └── alto-4-4.xsd /README.md: -------------------------------------------------------------------------------- 1 | ## [ALTO XML schema](https://github.com/altoxml/schema/wiki) 2 | This repository contains ALTO schema versions - drafts and final released ones. 3 | 4 | All open issues and discussions about changes to the ALTO standard can be found and tracked in the [issues](https://github.com/altoxml/schema/issues) repository 5 | 6 | Latest official schema version is 4.4.
7 | Primary source for the schema is (http://www.loc.gov/standards/alto/v4/alto-4-4.xsd)
8 | Alternate source for the schema is (https://cdn.rawgit.com/altoxml/schema/master/v4/alto-4-4.xsd)
9 | 10 | Summary of proposed changes 11 | 12 | * Change schema version to 4.4 13 | * Add LANG attribute on PageType level to describe the default language used in document 14 | * Add ROTATION attribute on PageType level to describe the default rotation used in document 15 | * Add OTHERLANGS attribute on PageType to summarize all the languages present into a particular document 16 | * Adapt "PointsType" documentation 17 | * Adapt xLink attribute group documentation on "BlockType" 18 | 19 | Details about the changes of the version and further documentation can be found in the ALTO 20 | [documentation](https://github.com/altoxml/documentation/wiki) repository. 21 | 22 | -------------------------------------------------------------------------------- /v1/alto-1-0.xsd: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | alto (analyzed layout and text object) stores layout information and OCR recognized text of books an journals. 16 | 17 | 18 | 19 | 20 | 21 | Styles define properties of layout elements. A style defined in a parent element is used as the default for all its children. 22 | 23 | 24 | 25 | 26 | 27 | A text style defines font properties of text. 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | A paragraph style defines formatting properties of text blocks. 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | The root layout element. 69 | 70 | 71 | 72 | 73 | 74 | One page of a book or journal. 75 | 76 | 77 | 78 | 79 | 80 | The area between the top line of print and the upper edge of the leaf. It may contain page number or running title. 81 | 82 | 83 | 84 | 85 | That margin of a page adjacent to the binding edge of a book. 86 | 87 | 88 | 89 | 90 | The space between the text and the outer extremity of the leaf of a book. May contain margin notes. 91 | 92 | 93 | 94 | 95 | The area between the bottom line of letterpress or writing and the bottom edge of the leaf. It may contain a page number, a signature number or a catch word. 96 | 97 | 98 | 99 | 100 | Rectangle surrounding the printed area of a page. Page number and running title are not part of the print space. 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | Group of available block types 132 | 133 | 134 | 135 | 136 | A block of text. 137 | 138 | 139 | 140 | 141 | A picture or image. 142 | 143 | 144 | 145 | 146 | A graphic used to separate blocks. Usually a line or rectangle. 147 | 148 | 149 | 150 | 151 | A block that consists of other blocks 152 | 153 | 154 | 155 | 156 | 157 | 158 | Base type for any kind of block on the page. 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | Tells the rotation of the block e.g. text or illustration. The value is in degree counterclockwise. 172 | 173 | 174 | 175 | 176 | The reading sequence of blocks on the page. 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | Type of the substitution (if any). May be something like hyphenation, or ocr correction 199 | 200 | 201 | 202 | 203 | Content of the substiutrion. Something like the corrected ocr text or the un hyphenated word 204 | 205 | 206 | 207 | 208 | Word Confidence: Confidence level of the ocr for this string. A value between 0 and 9 209 | 210 | 211 | 212 | 213 | Confidence level of each character in that string. A list of numbers, one number between 0 and 9 for each character 214 | 215 | 216 | 217 | 218 | 219 | A region on a page 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | A list of points 234 | 235 | 236 | 237 | 238 | 239 | Describes the bounding shape of a block, if it is not rectangular. 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | A polygon shape. 250 | 251 | 252 | 253 | 254 | 255 | An ellipse shape. 256 | 257 | 258 | 259 | 260 | 261 | 262 | 263 | 264 | A circle shape. 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | A block that consists of other blocks 273 | 274 | 275 | 276 | 277 | 278 | 279 | 280 | 281 | A user defined string to identify the type of composed block (e.g. table, advertisement, ...) 282 | 283 | 284 | 285 | 286 | A link to an image which contains only the composed block. 287 | 288 | 289 | 290 | 291 | 292 | 293 | 294 | A picture or image. 295 | 296 | 297 | 298 | 299 | 300 | A user defined string to identify the type of illustration like photo, map, drawing, chart, ... 301 | 302 | 303 | 304 | 305 | A link to an image which contains only the illustration. 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | A graphic used to separate blocks. Usually a line or rectangle. 314 | 315 | 316 | 317 | 318 | 319 | 320 | 321 | A block of text. 322 | 323 | 324 | 325 | 326 | 327 | 328 | A single line of text. 329 | 330 | 331 | 332 | 333 | 334 | 335 | 336 | A white space. 337 | 338 | 339 | 340 | 341 | 342 | 343 | 344 | 345 | 346 | 347 | 348 | A hyphenation char. Can appear only at the end of a line. 349 | 350 | 351 | 352 | 353 | 354 | 355 | 356 | 357 | 358 | 359 | 360 | 361 | 362 | 363 | 364 | 365 | 366 | 367 | 368 | 369 | 370 | 371 | 372 | 373 | -------------------------------------------------------------------------------- /v1/alto-1-1.xsd: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 7 | 8 | 9 | 10 | 11 | 12 | 22 | 26 | 31 | 32 | 33 | 34 | 35 | ALTO (analyzed layout and text object) stores layout information and 36 | OCR recognized text of pages of any kind of printed documents like books, journals and newspapers. 37 | ALTO is a standardized XML format to store layout and content information. 38 | It is designed to be used as an extension schema to METS (Metadata Encoding and Transmission Standard), 39 | where METS provides metadata and structural information while ALTO contains content and physical information. 40 | 41 | 42 | 43 | 44 | 45 | 46 | Describes general settings of the alto file like measurement units and metadata 47 | 48 | 49 | 50 | 51 | 52 | All measurement values inside the alto file except fontsize are related to this unit. The default is 1/10 of mm 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | Styles define properties of layout elements. A style defined in a parent element is used as default style for all related children elements. 79 | 80 | 81 | 82 | 83 | 84 | A text style defines font properties of text. 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | A paragraph style defines formatting properties of text blocks. 94 | 95 | 96 | 97 | 98 | 99 | Indicates the alignement of the paragraph. Could be left, right, center or justify. 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | Left indent of the paragraph in relation to the column. 113 | 114 | 115 | 116 | 117 | Right indent of the paragraph in relation to the column. 118 | 119 | 120 | 121 | 122 | Line spacing between two lines of the paragraph. Measurement calculated from baseline to baseline. 123 | 124 | 125 | 126 | 127 | Indent of the first line of the paragraph if this is different from the other lines. A negative value indicates an indent to the left, a positive value indicates an indent to the right. 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | The root layout element. 138 | 139 | 140 | 141 | 142 | 143 | One page of a book or journal. 144 | 145 | 146 | 147 | 148 | 149 | The area between the top line of print and the upper edge of the leaf. It may contain page number or running title. 150 | 151 | 152 | 153 | 154 | The area between the printspace and the left border of a page. May contain margin notes. 155 | 156 | 157 | 158 | 159 | The area between the printspace and the right border of a page. May contain margin notes. 160 | 161 | 162 | 163 | 164 | The area between the bottom line of letterpress or writing and the bottom edge of the leaf. It may contain a page number, a signature number or a catch word. 165 | 166 | 167 | 168 | 169 | Rectangle covering the printed area of a page. Page number and running title are not part of the print space. 170 | 171 | 172 | 173 | 174 | 175 | 176 | Any user-defined class like title page. 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | The number of the page within the document. 185 | 186 | 187 | 188 | 189 | The page number that is printed on the page. 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | Position of the page. Could be lefthanded, righthanded, foldout or single if it has no special position. 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | A link to the processing description that has been used for this page. 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | Group of available block types 231 | 232 | 233 | 234 | 235 | A block of text. 236 | 237 | 238 | 239 | 240 | A picture or image. 241 | 242 | 243 | 244 | 245 | A graphic used to separate blocks. Usually a line or rectangle. 246 | 247 | 248 | 249 | 250 | A block that consists of other blocks 251 | 252 | 253 | 254 | 255 | 256 | 257 | Base type for any kind of block on the page. 258 | 259 | 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | Tells the rotation of the block e.g. text or illustration. The value is in degree counterclockwise. 271 | 272 | 273 | 274 | 275 | The next block in reading sequence on the page. 276 | 277 | 278 | 279 | 280 | 281 | 282 | A sequence of chars. Strings are separated by white spaces or hyphenation chars. 283 | 284 | 285 | 286 | 287 | Any alternative for the word. 288 | 289 | 290 | 291 | 292 | 293 | 294 | Identifies the purpose of the alternative. 295 | 296 | 297 | 298 | 299 | 300 | 301 | 302 | 303 | 304 | 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | 318 | Type of the substitution (if any). 319 | 320 | 321 | 322 | 323 | 324 | 325 | 326 | 327 | 328 | 329 | 330 | Content of the substiution. 331 | 332 | 333 | 334 | 335 | Word Confidence: Confidence level of the ocr for this string. A value between 0 and 1. 336 | 337 | 338 | 339 | 340 | 341 | 342 | 343 | 344 | 345 | 346 | Confidence level of each character in that string. A list of numbers, one number between 0 and 9 for each character. 347 | 348 | 349 | 350 | 351 | 352 | A region on a page 353 | 354 | 355 | 356 | 357 | 358 | 359 | 360 | 361 | 362 | 363 | 364 | 365 | 366 | A list of points 367 | 368 | 369 | 370 | 371 | 372 | Describes the bounding shape of a block, if it is not rectangular. 373 | 374 | 375 | 376 | 377 | 378 | 379 | 380 | 381 | 382 | A polygon shape. 383 | 384 | 385 | 386 | 387 | 388 | An ellipse shape. 389 | 390 | 391 | 392 | 393 | 394 | 395 | 396 | 397 | A circle shape. 398 | 399 | 400 | 401 | 402 | 403 | 404 | 405 | Formatting attributes. Note that these attributes are assumed to be inherited from ancestor elements of the document hierarchy. 406 | 407 | 408 | 409 | The font name. 410 | 411 | 412 | 413 | 414 | 415 | 416 | The font size, in points (1/72 of an inch). 417 | 418 | 419 | 420 | 421 | Font color as RGB value 422 | 423 | 424 | 425 | 426 | 427 | 428 | Serif or Sans-Serif 429 | 430 | 431 | 432 | 433 | 434 | 435 | 436 | 437 | fixed or proportional 438 | 439 | 440 | 441 | 442 | 443 | 444 | 445 | 446 | Information to identify the image file from which the OCR text was created. 447 | 448 | 449 | 450 | 451 | 452 | 453 | 454 | 455 | A unique identifier for the image file. This is drawn from MIX. 456 | This identifier must be unique within the local system. To facilitate file sharing or interoperability with other systems, fileIdentifierLocation may be added to designate the system or application where the identifier is unique. 457 | 458 | 459 | 460 | 461 | 462 | A location qualifier, i.e., a namespace. 463 | 464 | 465 | 466 | 467 | 468 | 469 | 470 | Information on how the text was created, including preprocessing, OCR processing, and postprocessing steps. 471 | Where possible, this draws from MIX's change history. 472 | 473 | 474 | 475 | 476 | 477 | 478 | 479 | 480 | 481 | A processing step. 482 | 483 | 484 | 485 | 486 | Date or DateTime the image was processed. 487 | 488 | 489 | 490 | 491 | Identifies the organizationlevel producer(s) of the processed image. 492 | 493 | 494 | 495 | 496 | An ordinal listing of the image processing steps performed. For example, "image despeckling." 497 | 498 | 499 | 500 | 501 | A description of any setting of the processing application. For example, for a multi-engine OCR application this might include the engines which were used. Ideally, this description should be adequate so that someone else using the same application can produce identical results. 502 | 503 | 504 | 505 | 506 | 507 | 508 | 509 | Information about a software application. Where applicable, the preferred method for determining this information is by selecting Help --> About. 510 | 511 | 512 | 513 | 514 | The name of the organization or company that created the application. 515 | 516 | 517 | 518 | 519 | The name of the application. 520 | 521 | 522 | 523 | 524 | The version of the application. 525 | 526 | 527 | 528 | 529 | A description of any important characteristics of the application, especially for non-commercial applications. For example, if a non-commercial application is built using commercial components, e.g., an OCR engine SDK. Those components should be mentioned here. 530 | 531 | 532 | 533 | 534 | 535 | 536 | 537 | 538 | 539 | List of any combination of font styles 540 | 541 | 542 | 543 | 544 | 545 | 546 | 547 | 548 | 549 | 550 | 551 | 552 | 553 | 554 | 555 | 556 | 557 | 558 | 559 | 560 | 561 | A block that consists of other blocks 562 | 563 | 564 | 565 | 566 | 567 | 568 | 569 | 570 | A user defined string to identify the type of composed block (e.g. table, advertisement, ...) 571 | 572 | 573 | 574 | 575 | An ID to link to an image which contains only the composed block. The ID and the file link is defined in the related METS file. 576 | 577 | 578 | 579 | 580 | 581 | 582 | 583 | A picture or image. 584 | 585 | 586 | 587 | 588 | 589 | A user defined string to identify the type of illustration like photo, map, drawing, chart, ... 590 | 591 | 592 | 593 | 594 | A link to an image which contains only the illustration. 595 | 596 | 597 | 598 | 599 | 600 | 601 | 602 | A graphic used to separate blocks. Usually a line or rectangle. 603 | 604 | 605 | 606 | 607 | 608 | 609 | 610 | A block of text. 611 | 612 | 613 | 614 | 615 | 616 | 617 | A single line of text. 618 | 619 | 620 | 621 | 622 | 623 | 624 | 625 | A white space. 626 | 627 | 628 | 629 | 630 | 631 | 632 | 633 | 634 | 635 | 636 | 637 | A hyphenation char. Can appear only at the end of a line. 638 | 639 | 640 | 641 | 642 | 643 | 644 | 645 | 646 | 647 | 648 | 649 | 650 | 651 | 652 | 653 | 654 | 655 | 656 | 657 | 658 | 659 | 660 | 661 | 662 | -------------------------------------------------------------------------------- /v1/alto-1-2.xsd: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 7 | 8 | 9 | 10 | 11 | 12 | 22 | 26 | 31 | 35 | 38 | 39 | 40 | 41 | 42 | ALTO (analyzed layout and text object) stores layout information and 43 | OCR recognized text of pages of any kind of printed documents like books, journals and newspapers. 44 | ALTO is a standardized XML format to store layout and content information. 45 | It is designed to be used as an extension schema to METS (Metadata Encoding and Transmission Standard), 46 | where METS provides metadata and structural information while ALTO contains content and physical information. 47 | 48 | 49 | 50 | 51 | 52 | 53 | Describes general settings of the alto file like measurement units and metadata 54 | 55 | 56 | 57 | 58 | 59 | All measurement values inside the alto file except fontsize are related to this unit. The default is 1/10 of mm 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | Styles define properties of layout elements. A style defined in a parent element is used as default style for all related children elements. 85 | 86 | 87 | 88 | 89 | 90 | A text style defines font properties of text. 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | A paragraph style defines formatting properties of text blocks. 100 | 101 | 102 | 103 | 104 | 105 | Indicates the alignement of the paragraph. Could be left, right, center or justify. 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | Left indent of the paragraph in relation to the column. 119 | 120 | 121 | 122 | 123 | Right indent of the paragraph in relation to the column. 124 | 125 | 126 | 127 | 128 | Line spacing between two lines of the paragraph. Measurement calculated from baseline to baseline. 129 | 130 | 131 | 132 | 133 | Indent of the first line of the paragraph if this is different from the other lines. A negative value indicates an indent to the left, a positive value indicates an indent to the right. 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | The root layout element. 144 | 145 | 146 | 147 | 148 | 149 | One page of a book or journal. 150 | 151 | 152 | 153 | 154 | 155 | The area between the top line of print and the upper edge of the leaf. It may contain page number or running title. 156 | 157 | 158 | 159 | 160 | The area between the printspace and the left border of a page. May contain margin notes. 161 | 162 | 163 | 164 | 165 | The area between the printspace and the right border of a page. May contain margin notes. 166 | 167 | 168 | 169 | 170 | The area between the bottom line of letterpress or writing and the bottom edge of the leaf. It may contain a page number, a signature number or a catch word. 171 | 172 | 173 | 174 | 175 | Rectangle covering the printed area of a page. Page number and running title are not part of the print space. 176 | 177 | 178 | 179 | 180 | 181 | 182 | Any user-defined class like title page. 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | The number of the page within the document. 191 | 192 | 193 | 194 | 195 | The page number that is printed on the page. 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | Position of the page. Could be lefthanded, righthanded, foldout or single if it has no special position. 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | A link to the processing description that has been used for this page. 228 | 229 | 230 | 231 | 232 | Estimated percentage of OCR Accuracy 233 | 234 | 235 | 236 | 237 | Page Confidence: Confidence level of the ocr for this page. A value between 0 and 1. 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | Group of available block types 258 | 259 | 260 | 261 | 262 | A block of text. 263 | 264 | 265 | 266 | 267 | A picture or image. 268 | 269 | 270 | 271 | 272 | A graphic used to separate blocks. Usually a line or rectangle. 273 | 274 | 275 | 276 | 277 | A block that consists of other blocks 278 | 279 | 280 | 281 | 282 | 283 | 284 | Base type for any kind of block on the page. 285 | 286 | 287 | 288 | 289 | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | Tells the rotation of the block e.g. text or illustration. The value is in degree counterclockwise. 298 | 299 | 300 | 301 | 302 | The next block in reading sequence on the page. 303 | 304 | 305 | 306 | 307 | 308 | 309 | A sequence of chars. Strings are separated by white spaces or hyphenation chars. 310 | 311 | 312 | 313 | 314 | Any alternative for the word. 315 | 316 | 317 | 318 | 319 | 320 | 321 | Identifies the purpose of the alternative. 322 | 323 | 324 | 325 | 326 | 327 | 328 | 329 | 330 | 331 | 332 | 333 | 334 | 335 | 336 | 337 | 338 | 339 | 340 | 341 | 342 | 343 | 344 | 345 | Type of the substitution (if any). 346 | 347 | 348 | 349 | 350 | 351 | 352 | 353 | 354 | 355 | 356 | 357 | Content of the substiution. 358 | 359 | 360 | 361 | 362 | Word Confidence: Confidence level of the ocr for this string. A value between 0 and 1. 363 | 364 | 365 | 366 | 367 | 368 | 369 | 370 | 371 | 372 | 373 | Confidence level of each character in that string. A list of numbers, one number between 0 and 9 for each character. 374 | 375 | 376 | 377 | 378 | 379 | A region on a page 380 | 381 | 382 | 383 | 384 | 385 | 386 | 387 | 388 | 389 | 390 | 391 | 392 | 393 | A list of points 394 | 395 | 396 | 397 | 398 | 399 | Describes the bounding shape of a block, if it is not rectangular. 400 | 401 | 402 | 403 | 404 | 405 | 406 | 407 | 408 | 409 | A polygon shape. 410 | 411 | 412 | 413 | 414 | 415 | An ellipse shape. 416 | 417 | 418 | 419 | 420 | 421 | 422 | 423 | 424 | A circle shape. 425 | 426 | 427 | 428 | 429 | 430 | 431 | 432 | Formatting attributes. Note that these attributes are assumed to be inherited from ancestor elements of the document hierarchy. 433 | 434 | 435 | 436 | The font name. 437 | 438 | 439 | 440 | 441 | 442 | 443 | The font size, in points (1/72 of an inch). 444 | 445 | 446 | 447 | 448 | Font color as RGB value 449 | 450 | 451 | 452 | 453 | 454 | 455 | Serif or Sans-Serif 456 | 457 | 458 | 459 | 460 | 461 | 462 | 463 | 464 | fixed or proportional 465 | 466 | 467 | 468 | 469 | 470 | 471 | 472 | 473 | Information to identify the image file from which the OCR text was created. 474 | 475 | 476 | 477 | 478 | 479 | 480 | 481 | 482 | A unique identifier for the image file. This is drawn from MIX. 483 | This identifier must be unique within the local system. To facilitate file sharing or interoperability with other systems, fileIdentifierLocation may be added to designate the system or application where the identifier is unique. 484 | 485 | 486 | 487 | 488 | 489 | A location qualifier, i.e., a namespace. 490 | 491 | 492 | 493 | 494 | 495 | 496 | 497 | Information on how the text was created, including preprocessing, OCR processing, and postprocessing steps. 498 | Where possible, this draws from MIX's change history. 499 | 500 | 501 | 502 | 503 | 504 | 505 | 506 | 507 | 508 | A processing step. 509 | 510 | 511 | 512 | 513 | Date or DateTime the image was processed. 514 | 515 | 516 | 517 | 518 | Identifies the organizationlevel producer(s) of the processed image. 519 | 520 | 521 | 522 | 523 | An ordinal listing of the image processing steps performed. For example, "image despeckling." 524 | 525 | 526 | 527 | 528 | A description of any setting of the processing application. For example, for a multi-engine OCR application this might include the engines which were used. Ideally, this description should be adequate so that someone else using the same application can produce identical results. 529 | 530 | 531 | 532 | 533 | 534 | 535 | 536 | Information about a software application. Where applicable, the preferred method for determining this information is by selecting Help --> About. 537 | 538 | 539 | 540 | 541 | The name of the organization or company that created the application. 542 | 543 | 544 | 545 | 546 | The name of the application. 547 | 548 | 549 | 550 | 551 | The version of the application. 552 | 553 | 554 | 555 | 556 | A description of any important characteristics of the application, especially for non-commercial applications. For example, if a non-commercial application is built using commercial components, e.g., an OCR engine SDK. Those components should be mentioned here. 557 | 558 | 559 | 560 | 561 | 562 | 563 | 564 | 565 | 566 | List of any combination of font styles 567 | 568 | 569 | 570 | 571 | 572 | 573 | 574 | 575 | 576 | 577 | 578 | 579 | 580 | 581 | 582 | 583 | 584 | 585 | 586 | 587 | 588 | A block that consists of other blocks 589 | 590 | 591 | 592 | 593 | 594 | 595 | 596 | 597 | A user defined string to identify the type of composed block (e.g. table, advertisement, ...) 598 | 599 | 600 | 601 | 602 | An ID to link to an image which contains only the composed block. The ID and the file link is defined in the related METS file. 603 | 604 | 605 | 606 | 607 | 608 | 609 | 610 | A picture or image. 611 | 612 | 613 | 614 | 615 | 616 | A user defined string to identify the type of illustration like photo, map, drawing, chart, ... 617 | 618 | 619 | 620 | 621 | A link to an image which contains only the illustration. 622 | 623 | 624 | 625 | 626 | 627 | 628 | 629 | A graphic used to separate blocks. Usually a line or rectangle. 630 | 631 | 632 | 633 | 634 | 635 | 636 | 637 | A block of text. 638 | 639 | 640 | 641 | 642 | 643 | 644 | A single line of text. 645 | 646 | 647 | 648 | 649 | 650 | 651 | 652 | A white space. 653 | 654 | 655 | 656 | 657 | 658 | 659 | 660 | 661 | 662 | 663 | 664 | A hyphenation char. Can appear only at the end of a line. 665 | 666 | 667 | 668 | 669 | 670 | 671 | 672 | 673 | 674 | 675 | 676 | 677 | 678 | 679 | 680 | 681 | 682 | 683 | Correction Status. Indicates whether manual correction has been done or not. 684 | 685 | 686 | 687 | 688 | 689 | 690 | 691 | 692 | 693 | 694 | -------------------------------------------------------------------------------- /v1/alto-1-3.xsd: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 7 | 8 | 9 | 10 | 11 | 12 | 22 | 26 | 31 | 35 | 38 | 39 | 40 | 41 | 42 | 43 | ALTO (analyzed layout and text object) stores layout information and 44 | OCR recognized text of pages of any kind of printed documents like books, journals and newspapers. 45 | ALTO is a standardized XML format to store layout and content information. 46 | It is designed to be used as an extension schema to METS (Metadata Encoding and Transmission Standard), 47 | where METS provides metadata and structural information while ALTO contains content and physical information. 48 | 49 | 50 | 51 | 52 | 53 | 54 | Describes general settings of the alto file like measurement units and metadata 55 | 56 | 57 | 58 | 59 | 60 | All measurement values inside the alto file except fontsize are related to this unit. The default is 1/10 of mm 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | Styles define properties of layout elements. A style defined in a parent element is used as default style for all related children elements. 86 | 87 | 88 | 89 | 90 | 91 | A text style defines font properties of text. 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | A paragraph style defines formatting properties of text blocks. 101 | 102 | 103 | 104 | 105 | 106 | Indicates the alignement of the paragraph. Could be left, right, center or justify. 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | Left indent of the paragraph in relation to the column. 120 | 121 | 122 | 123 | 124 | Right indent of the paragraph in relation to the column. 125 | 126 | 127 | 128 | 129 | Line spacing between two lines of the paragraph. Measurement calculated from baseline to baseline. 130 | 131 | 132 | 133 | 134 | Indent of the first line of the paragraph if this is different from the other lines. A negative value indicates an indent to the left, a positive value indicates an indent to the right. 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | The root layout element. 145 | 146 | 147 | 148 | 149 | 150 | One page of a book or journal. 151 | 152 | 153 | 154 | 155 | 156 | The area between the top line of print and the upper edge of the leaf. It may contain page number or running title. 157 | 158 | 159 | 160 | 161 | The area between the printspace and the left border of a page. May contain margin notes. 162 | 163 | 164 | 165 | 166 | The area between the printspace and the right border of a page. May contain margin notes. 167 | 168 | 169 | 170 | 171 | The area between the bottom line of letterpress or writing and the bottom edge of the leaf. It may contain a page number, a signature number or a catch word. 172 | 173 | 174 | 175 | 176 | Rectangle covering the printed area of a page. Page number and running title are not part of the print space. 177 | 178 | 179 | 180 | 181 | 182 | 183 | Any user-defined class like title page. 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | The number of the page within the document. 192 | 193 | 194 | 195 | 196 | The page number that is printed on the page. 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | Position of the page. Could be lefthanded, righthanded, foldout or single if it has no special position. 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | A link to the processing description that has been used for this page. 228 | 229 | 230 | 231 | 232 | Estimated percentage of OCR Accuracy 233 | 234 | 235 | 236 | 237 | Page Confidence: Confidence level of the ocr for this page. A value between 0 and 1. 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | Group of available block types 258 | 259 | 260 | 261 | 262 | A block of text. 263 | 264 | 265 | 266 | 267 | A picture or image. 268 | 269 | 270 | 271 | 272 | A graphic used to separate blocks. Usually a line or rectangle. 273 | 274 | 275 | 276 | 277 | A block that consists of other blocks 278 | 279 | 280 | 281 | 282 | 283 | 284 | Base type for any kind of block on the page. 285 | 286 | 287 | 288 | 289 | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | Tells the rotation of the block e.g. text or illustration. The value is in degree counterclockwise. 298 | 299 | 300 | 301 | 302 | The next block in reading sequence on the page. 303 | 304 | 305 | 306 | 307 | 308 | 309 | A sequence of chars. Strings are separated by white spaces or hyphenation chars. 310 | 311 | 312 | 313 | 314 | Any alternative for the word. 315 | 316 | 317 | 318 | 319 | 320 | 321 | Identifies the purpose of the alternative. 322 | 323 | 324 | 325 | 326 | 327 | 328 | 329 | 330 | 331 | 332 | 333 | 334 | 335 | 336 | 337 | 338 | 339 | 340 | 341 | 342 | 343 | 344 | 345 | Type of the substitution (if any). 346 | 347 | 348 | 349 | 350 | 351 | 352 | 353 | 354 | 355 | 356 | 357 | Content of the substiution. 358 | 359 | 360 | 361 | 362 | Word Confidence: Confidence level of the ocr for this string. A value between 0 and 1. 363 | 364 | 365 | 366 | 367 | 368 | 369 | 370 | 371 | 372 | 373 | Confidence level of each character in that string. A list of numbers, one number between 0 and 9 for each character. 374 | 375 | 376 | 377 | 378 | 379 | A region on a page 380 | 381 | 382 | 383 | 384 | 385 | 386 | 387 | 388 | 389 | 390 | 391 | 392 | 393 | A list of points 394 | 395 | 396 | 397 | 398 | 399 | Describes the bounding shape of a block, if it is not rectangular. 400 | 401 | 402 | 403 | 404 | 405 | 406 | 407 | 408 | 409 | A polygon shape. 410 | 411 | 412 | 413 | 414 | 415 | An ellipse shape. 416 | 417 | 418 | 419 | 420 | 421 | 422 | 423 | 424 | A circle shape. 425 | 426 | 427 | 428 | 429 | 430 | 431 | 432 | Formatting attributes. Note that these attributes are assumed to be inherited from ancestor elements of the document hierarchy. 433 | 434 | 435 | 436 | The font name. 437 | 438 | 439 | 440 | 441 | 442 | 443 | The font size, in points (1/72 of an inch). 444 | 445 | 446 | 447 | 448 | Font color as RGB value 449 | 450 | 451 | 452 | 453 | 454 | 455 | Serif or Sans-Serif 456 | 457 | 458 | 459 | 460 | 461 | 462 | 463 | 464 | fixed or proportional 465 | 466 | 467 | 468 | 469 | 470 | 471 | 472 | 473 | Information to identify the image file from which the OCR text was created. 474 | 475 | 476 | 477 | 478 | 479 | 480 | 481 | 482 | A unique identifier for the image file. This is drawn from MIX. 483 | This identifier must be unique within the local system. To facilitate file sharing or interoperability with other systems, fileIdentifierLocation may be added to designate the system or application where the identifier is unique. 484 | 485 | 486 | 487 | 488 | 489 | A location qualifier, i.e., a namespace. 490 | 491 | 492 | 493 | 494 | 495 | 496 | 497 | Information on how the text was created, including preprocessing, OCR processing, and postprocessing steps. 498 | Where possible, this draws from MIX's change history. 499 | 500 | 501 | 502 | 503 | 504 | 505 | 506 | 507 | 508 | A processing step. 509 | 510 | 511 | 512 | 513 | Date or DateTime the image was processed. 514 | 515 | 516 | 517 | 518 | Identifies the organizationlevel producer(s) of the processed image. 519 | 520 | 521 | 522 | 523 | An ordinal listing of the image processing steps performed. For example, "image despeckling." 524 | 525 | 526 | 527 | 528 | A description of any setting of the processing application. For example, for a multi-engine OCR application this might include the engines which were used. Ideally, this description should be adequate so that someone else using the same application can produce identical results. 529 | 530 | 531 | 532 | 533 | 534 | 535 | 536 | Information about a software application. Where applicable, the preferred method for determining this information is by selecting Help --> About. 537 | 538 | 539 | 540 | 541 | The name of the organization or company that created the application. 542 | 543 | 544 | 545 | 546 | The name of the application. 547 | 548 | 549 | 550 | 551 | The version of the application. 552 | 553 | 554 | 555 | 556 | A description of any important characteristics of the application, especially for non-commercial applications. For example, if a non-commercial application is built using commercial components, e.g., an OCR engine SDK. Those components should be mentioned here. 557 | 558 | 559 | 560 | 561 | 562 | 563 | 564 | 565 | 566 | List of any combination of font styles 567 | 568 | 569 | 570 | 571 | 572 | 573 | 574 | 575 | 576 | 577 | 578 | 579 | 580 | 581 | 582 | 583 | 584 | 585 | 586 | 587 | 588 | A block that consists of other blocks 589 | 590 | 591 | 592 | 593 | 594 | 595 | 596 | 597 | A user defined string to identify the type of composed block (e.g. table, advertisement, ...) 598 | 599 | 600 | 601 | 602 | An ID to link to an image which contains only the composed block. The ID and the file link is defined in the related METS file. 603 | 604 | 605 | 606 | 607 | 608 | 609 | 610 | A picture or image. 611 | 612 | 613 | 614 | 615 | 616 | A user defined string to identify the type of illustration like photo, map, drawing, chart, ... 617 | 618 | 619 | 620 | 621 | A link to an image which contains only the illustration. 622 | 623 | 624 | 625 | 626 | 627 | 628 | 629 | A graphic used to separate blocks. Usually a line or rectangle. 630 | 631 | 632 | 633 | 634 | 635 | 636 | 637 | A block of text. 638 | 639 | 640 | 641 | 642 | 643 | 644 | A single line of text. 645 | 646 | 647 | 648 | 649 | 650 | 651 | 652 | A white space. 653 | 654 | 655 | 656 | 657 | 658 | 659 | 660 | 661 | 662 | 663 | 664 | A hyphenation char. Can appear only at the end of a line. 665 | 666 | 667 | 668 | 669 | 670 | 671 | 672 | 673 | 674 | 675 | 676 | 677 | 678 | 679 | 680 | 681 | 682 | 683 | Correction Status. Indicates whether manual correction has been done or not. 684 | 685 | 686 | 687 | 688 | 689 | 690 | 691 | 692 | 693 | 694 | -------------------------------------------------------------------------------- /v1/alto-1-4.xsd: -------------------------------------------------------------------------------- 1 | 2 | 3 | 6 | 8 | 10 | 13 | 14 | 15 | 16 | 17 | 18 | 28 | 32 | 37 | 41 | 44 | 49 | 50 | 51 | 52 | 53 | 54 | ALTO (analyzed layout and text object) stores layout information and 55 | OCR recognized text of pages of any kind of printed documents like books, journals and newspapers. 56 | ALTO is a standardized XML format to store layout and content information. 57 | It is designed to be used as an extension schema to METS (Metadata Encoding and Transmission Standard), 58 | where METS provides metadata and structural information while ALTO contains content and physical information. 59 | 60 | 61 | 62 | 63 | 64 | 65 | Describes general settings of the alto file like measurement units and metadata 66 | 67 | 68 | 69 | 70 | 71 | All measurement values inside the alto file except fontsize are related to this unit. The default is 1/10 of mm 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | Styles define properties of layout elements. A style defined in a parent element is used as default style for all related children elements. 97 | 98 | 99 | 100 | 101 | 102 | A text style defines font properties of text. 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | A paragraph style defines formatting properties of text blocks. 112 | 113 | 114 | 115 | 116 | 117 | Indicates the alignement of the paragraph. Could be left, right, center or justify. 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | Left indent of the paragraph in relation to the column. 131 | 132 | 133 | 134 | 135 | Right indent of the paragraph in relation to the column. 136 | 137 | 138 | 139 | 140 | Line spacing between two lines of the paragraph. Measurement calculated from baseline to baseline. 141 | 142 | 143 | 144 | 145 | Indent of the first line of the paragraph if this is different from the other lines. A negative value indicates an indent to the left, a positive value indicates an indent to the right. 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | The root layout element. 156 | 157 | 158 | 159 | 160 | 161 | One page of a book or journal. 162 | 163 | 164 | 165 | 166 | 167 | The area between the top line of print and the upper edge of the leaf. It may contain page number or running title. 168 | 169 | 170 | 171 | 172 | The area between the printspace and the left border of a page. May contain margin notes. 173 | 174 | 175 | 176 | 177 | The area between the printspace and the right border of a page. May contain margin notes. 178 | 179 | 180 | 181 | 182 | The area between the bottom line of letterpress or writing and the bottom edge of the leaf. It may contain a page number, a signature number or a catch word. 183 | 184 | 185 | 186 | 187 | Rectangle covering the printed area of a page. Page number and running title are not part of the print space. 188 | 189 | 190 | 191 | 192 | 193 | 194 | Any user-defined class like title page. 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | The number of the page within the document. 203 | 204 | 205 | 206 | 207 | The page number that is printed on the page. 208 | 209 | 210 | 211 | 212 | Gives brief information about original page quality 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | Gives more details about the original page quality, since QUALITY attribute gives only brief and restrictive information 229 | 230 | 231 | 232 | 233 | Position of the page. Could be lefthanded, righthanded, cover, foldout or single if it has no special position. 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | A link to the processing description that has been used for this page. 248 | 249 | 250 | 251 | 252 | Estimated percentage of OCR Accuracy in range from 0 to 100 253 | 254 | 255 | 256 | 257 | Page Confidence: Confidence level of the ocr for this page. A value between 0 (unsure) and 1 (sure). 258 | 259 | 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 274 | 275 | 276 | 277 | Group of available block types 278 | 279 | 280 | 281 | 282 | A block of text. 283 | 284 | 285 | 286 | 287 | A picture or image. 288 | 289 | 290 | 291 | 292 | A graphic used to separate blocks. Usually a line or rectangle. 293 | 294 | 295 | 296 | 297 | A block that consists of other blocks 298 | 299 | 300 | 301 | 302 | 303 | 304 | Base type for any kind of block on the page. 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | Tells the rotation of the block e.g. text or illustration. The value is in degree counterclockwise. 318 | 319 | 320 | 321 | 322 | The next block in reading sequence on the page. 323 | 324 | 325 | 326 | 327 | 328 | 329 | A sequence of chars. Strings are separated by white spaces or hyphenation chars. 330 | 331 | 332 | 333 | 334 | Any alternative for the word. 335 | 336 | 337 | 338 | 339 | 340 | 341 | Identifies the purpose of the alternative. 342 | 343 | 344 | 345 | 346 | 347 | 348 | 349 | 350 | 351 | 352 | 353 | 354 | 355 | 356 | 357 | 358 | 359 | 360 | 361 | 362 | 363 | 364 | 365 | Type of the substitution (if any). 366 | 367 | 368 | 369 | 370 | 371 | 372 | 373 | 374 | 375 | 376 | 377 | Content of the substiution. 378 | 379 | 380 | 381 | 382 | Word Confidence: Confidence level of the ocr for this string. A value between 0 (unsure) and 1 (sure). 383 | 384 | 385 | 386 | 387 | 388 | 389 | 390 | 391 | 392 | 393 | Confidence level of each character in that string. A list of numbers, one number between 0 (sure) and 9 (unsure) for each character. 394 | 395 | 396 | 397 | 398 | 399 | A region on a page 400 | 401 | 402 | 403 | 404 | 405 | 406 | 407 | 408 | 409 | 410 | 411 | 412 | 413 | A list of points 414 | 415 | 416 | 417 | 418 | 419 | Describes the bounding shape of a block, if it is not rectangular. 420 | 421 | 422 | 423 | 424 | 425 | 426 | 427 | 428 | 429 | A polygon shape. 430 | 431 | 432 | 433 | 434 | 435 | An ellipse shape. 436 | 437 | 438 | 439 | 440 | 441 | 442 | 443 | 444 | A circle shape. 445 | 446 | 447 | 448 | 449 | 450 | 451 | 452 | Formatting attributes. Note that these attributes are assumed to be inherited from ancestor elements of the document hierarchy. 453 | 454 | 455 | 456 | The font name. 457 | 458 | 459 | 460 | 461 | 462 | 463 | The font size, in points (1/72 of an inch). 464 | 465 | 466 | 467 | 468 | Font color as RGB value 469 | 470 | 471 | 472 | 473 | 474 | 475 | Serif or Sans-Serif 476 | 477 | 478 | 479 | 480 | 481 | 482 | 483 | 484 | fixed or proportional 485 | 486 | 487 | 488 | 489 | 490 | 491 | 492 | 493 | Information to identify the image file from which the OCR text was created. 494 | 495 | 496 | 497 | 498 | 499 | 500 | 501 | 502 | A unique identifier for the image file. This is drawn from MIX. 503 | This identifier must be unique within the local system. To facilitate file sharing or interoperability with other systems, fileIdentifierLocation may be added to designate the system or application where the identifier is unique. 504 | 505 | 506 | 507 | 508 | 509 | A location qualifier, i.e., a namespace. 510 | 511 | 512 | 513 | 514 | 515 | 516 | 517 | Information on how the text was created, including preprocessing, OCR processing, and postprocessing steps. 518 | Where possible, this draws from MIX's change history. 519 | 520 | 521 | 522 | 523 | 524 | 525 | 526 | 527 | 528 | A processing step. 529 | 530 | 531 | 532 | 533 | Date or DateTime the image was processed. 534 | 535 | 536 | 537 | 538 | Identifies the organizationlevel producer(s) of the processed image. 539 | 540 | 541 | 542 | 543 | An ordinal listing of the image processing steps performed. For example, "image despeckling." 544 | 545 | 546 | 547 | 548 | A description of any setting of the processing application. For example, for a multi-engine OCR application this might include the engines which were used. Ideally, this description should be adequate so that someone else using the same application can produce identical results. 549 | 550 | 551 | 552 | 553 | 554 | 555 | 556 | Information about a software application. Where applicable, the preferred method for determining this information is by selecting Help --> About. 557 | 558 | 559 | 560 | 561 | The name of the organization or company that created the application. 562 | 563 | 564 | 565 | 566 | The name of the application. 567 | 568 | 569 | 570 | 571 | The version of the application. 572 | 573 | 574 | 575 | 576 | A description of any important characteristics of the application, especially for non-commercial applications. For example, if a non-commercial application is built using commercial components, e.g., an OCR engine SDK. Those components should be mentioned here. 577 | 578 | 579 | 580 | 581 | 582 | 583 | 584 | 585 | 586 | List of any combination of font styles 587 | 588 | 589 | 590 | 591 | 592 | 593 | 594 | 595 | 596 | 597 | 598 | 599 | 600 | 601 | 602 | 603 | 604 | 605 | 606 | 607 | 608 | A block that consists of other blocks 609 | 610 | 611 | 612 | 613 | 614 | 615 | 616 | 617 | A user defined string to identify the type of composed block (e.g. table, advertisement, ...) 618 | 619 | 620 | 621 | 622 | An ID to link to an image which contains only the composed block. The ID and the file link is defined in the related METS file. 623 | 624 | 625 | 626 | 627 | 628 | 629 | 630 | A picture or image. 631 | 632 | 633 | 634 | 635 | 636 | A user defined string to identify the type of illustration like photo, map, drawing, chart, ... 637 | 638 | 639 | 640 | 641 | A link to an image which contains only the illustration. 642 | 643 | 644 | 645 | 646 | 647 | 648 | 649 | A graphic used to separate blocks. Usually a line or rectangle. 650 | 651 | 652 | 653 | 654 | 655 | 656 | 657 | A block of text. 658 | 659 | 660 | 661 | 662 | 663 | 664 | A single line of text. 665 | 666 | 667 | 668 | 669 | 670 | 671 | 672 | A white space. 673 | 674 | 675 | 676 | 677 | 678 | 679 | 680 | 681 | 682 | 683 | 684 | A hyphenation char. Can appear only at the end of a line. 685 | 686 | 687 | 688 | 689 | 690 | 691 | 692 | 693 | 694 | 695 | 696 | 697 | 698 | 699 | 700 | 701 | 702 | 703 | Correction Status. Indicates whether manual correction has been done or not. 704 | 705 | 706 | 707 | 708 | 709 | 710 | 711 | 712 | 713 | 714 | -------------------------------------------------------------------------------- /v2/alto-2-0.xsd: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 20 | 24 | 29 | 33 | 36 | 41 | 48 | 51 | 52 | 53 | 54 | 55 | ALTO (analyzed layout and text object) stores layout information and 56 | OCR recognized text of pages of any kind of printed documents like books, journals and newspapers. 57 | ALTO is a standardized XML format to store layout and content information. 58 | It is designed to be used as an extension schema to METS (Metadata Encoding and Transmission Standard), 59 | where METS provides metadata and structural information while ALTO contains content and physical information. 60 | 61 | 62 | 63 | 64 | 65 | 66 | Describes general settings of the alto file like measurement units and metadata 67 | 68 | 69 | 70 | 71 | 72 | All measurement values inside the alto file except fontsize are related to this unit. The default is 1/10 of mm 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | Styles define properties of layout elements. A style defined in a parent element is used as default style for all related children elements. 98 | 99 | 100 | 101 | 102 | 103 | A text style defines font properties of text. 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | A paragraph style defines formatting properties of text blocks. 113 | 114 | 115 | 116 | 117 | 118 | Indicates the alignement of the paragraph. Could be left, right, center or justify. 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | Left indent of the paragraph in relation to the column. 132 | 133 | 134 | 135 | 136 | Right indent of the paragraph in relation to the column. 137 | 138 | 139 | 140 | 141 | Line spacing between two lines of the paragraph. Measurement calculated from baseline to baseline. 142 | 143 | 144 | 145 | 146 | Indent of the first line of the paragraph if this is different from the other lines. A negative value indicates an indent to the left, a positive value indicates an indent to the right. 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | The root layout element. 157 | 158 | 159 | 160 | 161 | 162 | One page of a book or journal. 163 | 164 | 165 | 166 | 167 | 168 | The area between the top line of print and the upper edge of the leaf. It may contain page number or running title. 169 | 170 | 171 | 172 | 173 | The area between the printspace and the left border of a page. May contain margin notes. 174 | 175 | 176 | 177 | 178 | The area between the printspace and the right border of a page. May contain margin notes. 179 | 180 | 181 | 182 | 183 | The area between the bottom line of letterpress or writing and the bottom edge of the leaf. It may contain a page number, a signature number or a catch word. 184 | 185 | 186 | 187 | 188 | Rectangle covering the printed area of a page. Page number and running title are not part of the print space. 189 | 190 | 191 | 192 | 193 | 194 | 195 | Any user-defined class like title page. 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | The number of the page within the document. 204 | 205 | 206 | 207 | 208 | The page number that is printed on the page. 209 | 210 | 211 | 212 | 213 | Gives brief information about original page quality 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | Gives more details about the original page quality, since QUALITY attribute gives only brief and restrictive information 230 | 231 | 232 | 233 | 234 | Position of the page. Could be lefthanded, righthanded, cover, foldout or single if it has no special position. 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | A link to the processing description that has been used for this page. 249 | 250 | 251 | 252 | 253 | Estimated percentage of OCR Accuracy in range from 0 to 100 254 | 255 | 256 | 257 | 258 | Page Confidence: Confidence level of the ocr for this page. A value between 0 (unsure) and 1 (sure). 259 | 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 274 | 275 | 276 | 277 | 278 | Group of available block types 279 | 280 | 281 | 282 | 283 | A block of text. 284 | 285 | 286 | 287 | 288 | A picture or image. 289 | 290 | 291 | 292 | 293 | A graphic used to separate blocks. Usually a line or rectangle. 294 | 295 | 296 | 297 | 298 | A block that consists of other blocks 299 | 300 | 301 | 302 | 303 | 304 | 305 | Base type for any kind of block on the page. 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | 318 | Tells the rotation of the block e.g. text or illustration. The value is in degree counterclockwise. 319 | 320 | 321 | 322 | 323 | The next block in reading sequence on the page. 324 | 325 | 326 | 327 | 328 | 329 | 330 | A sequence of chars. Strings are separated by white spaces or hyphenation chars. 331 | 332 | 333 | 334 | 335 | Any alternative for the word. 336 | 337 | 338 | 339 | 340 | 341 | 342 | Identifies the purpose of the alternative. 343 | 344 | 345 | 346 | 347 | 348 | 349 | 350 | 351 | 352 | 353 | 354 | 355 | 356 | 357 | 358 | 359 | 360 | 361 | 362 | 363 | 364 | 365 | 366 | Type of the substitution (if any). 367 | 368 | 369 | 370 | 371 | 372 | 373 | 374 | 375 | 376 | 377 | 378 | Content of the substiution. 379 | 380 | 381 | 382 | 383 | Word Confidence: Confidence level of the ocr for this string. A value between 0 (unsure) and 1 (sure). 384 | 385 | 386 | 387 | 388 | 389 | 390 | 391 | 392 | 393 | 394 | Confidence level of each character in that string. A list of numbers, one number between 0 (sure) and 9 (unsure) for each character. 395 | 396 | 397 | 398 | 399 | 400 | A region on a page 401 | 402 | 403 | 404 | 405 | 406 | 407 | 408 | 409 | 410 | 411 | 412 | 413 | 414 | A list of points 415 | 416 | 417 | 418 | 419 | 420 | Describes the bounding shape of a block, if it is not rectangular. 421 | 422 | 423 | 424 | 425 | 426 | 427 | 428 | 429 | 430 | A polygon shape. 431 | 432 | 433 | 434 | 435 | 436 | An ellipse shape. 437 | 438 | 439 | 440 | 441 | 442 | 443 | 444 | 445 | A circle shape. 446 | 447 | 448 | 449 | 450 | 451 | 452 | 453 | Formatting attributes. Note that these attributes are assumed to be inherited from ancestor elements of the document hierarchy. 454 | 455 | 456 | 457 | The font name. 458 | 459 | 460 | 461 | 462 | 463 | 464 | The font size, in points (1/72 of an inch). 465 | 466 | 467 | 468 | 469 | Font color as RGB value 470 | 471 | 472 | 473 | 474 | 475 | 476 | Serif or Sans-Serif 477 | 478 | 479 | 480 | 481 | 482 | 483 | 484 | 485 | fixed or proportional 486 | 487 | 488 | 489 | 490 | 491 | 492 | 493 | 494 | Information to identify the image file from which the OCR text was created. 495 | 496 | 497 | 498 | 499 | 500 | 501 | 502 | 503 | A unique identifier for the image file. This is drawn from MIX. 504 | This identifier must be unique within the local system. To facilitate file sharing or interoperability with other systems, fileIdentifierLocation may be added to designate the system or application where the identifier is unique. 505 | 506 | 507 | 508 | 509 | 510 | A location qualifier, i.e., a namespace. 511 | 512 | 513 | 514 | 515 | 516 | 517 | 518 | Information on how the text was created, including preprocessing, OCR processing, and postprocessing steps. 519 | Where possible, this draws from MIX's change history. 520 | 521 | 522 | 523 | 524 | 525 | 526 | 527 | 528 | 529 | A processing step. 530 | 531 | 532 | 533 | 534 | Date or DateTime the image was processed. 535 | 536 | 537 | 538 | 539 | Identifies the organizationlevel producer(s) of the processed image. 540 | 541 | 542 | 543 | 544 | An ordinal listing of the image processing steps performed. For example, "image despeckling." 545 | 546 | 547 | 548 | 549 | A description of any setting of the processing application. For example, for a multi-engine OCR application this might include the engines which were used. Ideally, this description should be adequate so that someone else using the same application can produce identical results. 550 | 551 | 552 | 553 | 554 | 555 | 556 | 557 | Information about a software application. Where applicable, the preferred method for determining this information is by selecting Help --> About. 558 | 559 | 560 | 561 | 562 | The name of the organization or company that created the application. 563 | 564 | 565 | 566 | 567 | The name of the application. 568 | 569 | 570 | 571 | 572 | The version of the application. 573 | 574 | 575 | 576 | 577 | A description of any important characteristics of the application, especially for non-commercial applications. For example, if a non-commercial application is built using commercial components, e.g., an OCR engine SDK. Those components should be mentioned here. 578 | 579 | 580 | 581 | 582 | 583 | 584 | 585 | 586 | 587 | List of any combination of font styles 588 | 589 | 590 | 591 | 592 | 593 | 594 | 595 | 596 | 597 | 598 | 599 | 600 | 601 | 602 | 603 | 604 | 605 | 606 | 607 | 608 | 609 | A block that consists of other blocks 610 | 611 | 612 | 613 | 614 | 615 | 616 | 617 | 618 | A user defined string to identify the type of composed block (e.g. table, advertisement, ...) 619 | 620 | 621 | 622 | 623 | An ID to link to an image which contains only the composed block. The ID and the file link is defined in the related METS file. 624 | 625 | 626 | 627 | 628 | 629 | 630 | 631 | A picture or image. 632 | 633 | 634 | 635 | 636 | 637 | A user defined string to identify the type of illustration like photo, map, drawing, chart, ... 638 | 639 | 640 | 641 | 642 | A link to an image which contains only the illustration. 643 | 644 | 645 | 646 | 647 | 648 | 649 | 650 | A graphic used to separate blocks. Usually a line or rectangle. 651 | 652 | 653 | 654 | 655 | 656 | 657 | 658 | A block of text. 659 | 660 | 661 | 662 | 663 | 664 | 665 | A single line of text. 666 | 667 | 668 | 669 | 670 | 671 | 672 | 673 | A white space. 674 | 675 | 676 | 677 | 678 | 679 | 680 | 681 | 682 | 683 | 684 | 685 | A hyphenation char. Can appear only at the end of a line. 686 | 687 | 688 | 689 | 690 | 691 | 692 | 693 | 694 | 695 | 696 | 697 | 698 | 699 | 700 | 701 | 702 | 703 | 704 | Correction Status. Indicates whether manual correction has been done or not. 705 | 706 | 707 | 708 | 709 | 710 | 711 | 712 | 713 | 714 | 715 | -------------------------------------------------------------------------------- /v3/ALTO-language support discussion so far-20150601.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/altoxml/schema/1a67f01c3689e5ff4b1714c4395b9fdcf668d93b/v3/ALTO-language support discussion so far-20150601.pdf -------------------------------------------------------------------------------- /v3/Comparison of text direction elements.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/altoxml/schema/1a67f01c3689e5ff4b1714c4395b9fdcf668d93b/v3/Comparison of text direction elements.pdf -------------------------------------------------------------------------------- /v3/discussion of ALTO language support.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/altoxml/schema/1a67f01c3689e5ff4b1714c4395b9fdcf668d93b/v3/discussion of ALTO language support.pdf --------------------------------------------------------------------------------