├── README.md ├── images ├── dla_examples_1.png ├── dla_examples_2.png ├── dla_examples_3.png ├── dla_examples_4.png ├── dqa_example_1.gif ├── dqa_example_2.png ├── du_example.png ├── kie_examples_1.png ├── kie_examples_2.png ├── kie_examples_3.png ├── kie_examples_4.png ├── kie_examples_5.png ├── vrd_examples_1.png ├── vrd_examples_2.png └── vrd_examples_2v2.png └── topics ├── dla └── README.md ├── dqa └── README.md ├── kie └── README.md ├── ocr └── README.md ├── related └── README.md └── sdu └── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Awesome Document Understanding [![Awesome](https://awesome.re/badge-flat.svg)](https://awesome.re) 2 | 3 | A curated list of resources for Document Understanding (DU) topic related to Intelligent Document Processing (IDP), which is relative to Robotic Process Automation (RPA) from unstructured data, especially form Visually Rich Documents (VRDs). 4 | 5 | **Note 1: bolded positions are more important then others.** 6 | 7 | **Note 2: due to the novelty of the field, this list is under construction - contributions are welcome (thank you in advance!).** Please remember to use following convention: 8 | * [Title of a publication / dataset / resource title](https://arxiv.org), \[[code/data/Website](https://github.com/example/test) ![](https://img.shields.io/github/stars/example/test.svg?style=social)\] 9 |
10 | List of authors Conference/Journal name Year 11 | Dataset size: Train(no of examples), Dev(no of examples), Test(no of examples) [Optional for dataset papers/resources]; Abstract/short description ... 12 |
13 |

14 | 15 | 16 |

17 | 18 | 19 | 20 |

21 |

22 | 23 | # Table of contents 24 | 25 | 1. [Introduction](#introduction) 26 | 1. [Research topics](#research-topics) 27 | 1. [Key Information Extraction (KIE)](topics/kie/README.md) 28 | 1. [Document Layout Analysis (DLA)](topics/dla/README.md) 29 | 1. [Document Question Answering (DQA)](topics/dqa/README.md) 30 | 1. [Scientific Document Understanding (SDU)](topics/sdu/README.md) 31 | 1. [Optical Character Recognition (OCR)](topics/ocr/README.md) 32 | 1. [Related](topics/related/README.md) 33 | 1. [General](topics/related/README.md#general) 34 | 1. [Tabular Data Comprehension (TDC)](topics/related/README.md#tabular-data-comprehension) 35 | 1. [Robotic Process Automation (RPA)](topics/related/README.md#robotic-process-automation) 36 | 1. [Others](#others) 37 | 1. [Resources](#resources) 38 | 1. [Datasets for Pre-training Language Models](#datasets-for-pre-training-language-models) 39 | 1. [PDF processing tools](#pdf-processing-tools) 40 | 1. [Conferences / workshops](#conferences-workshops) 41 | 1. [Blogs](#blogs) 42 | 1. [Solutions](#solutions) 43 | 1. [Examples](#examples) 44 | 1. [Visually Rich Documents (VRDs)](#visually-rich-documents) 45 | 1. [Key Information Extraction (KIE)](#key-information-extraction) 46 | 1. [Document Layout Analysis (DLA)](#document-layout-analysis) 47 | 1. [Document Question Answering (DQA)](#document-question-answering) 48 | 1. [Inspirations](#inspirations) 49 | 50 | 51 | # Introduction 52 | 53 | Documents are a core part of many businesses in many fields such as law, finance, and technology among others. Automatic understanding of documents such as invoices, contracts, and resumes is lucrative, opening up many new avenues of business. The fields of natural language processing and computer vision have seen tremendous progress through the development of deep learning such that these methods have started to become infused in contemporary document understanding systems. [source](https://arxiv.org/abs/2011.13534) 54 | 55 | 56 | ### Papers 57 | 58 | #### 2023 59 | 60 | * [DocILE Benchmark for Document Information Localization and Extraction](https://arxiv.org/abs/2302.05658), \[[Website](https://docile.rossum.ai)\] \[[benchmark](https://rrc.cvc.uab.es/?ch=26)\] \[[code](https://github.com/rossumai/docile) ![](https://img.shields.io/github/stars/rossumai/docile.svg?style=social)\] 61 |
62 | Štěpán Šimsa, Milan Šulc, Michal Uřičář, Yash Patel, Ahmed Hamdi, Matěj Kocián, Matyáš Skalický, Jiří Matas, Antoine Doucet, Mickaël Coustaty, Dimosthenis Karatzas arxiv pre-print 2023 63 | This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly~1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain- and task-specific aspects, resulting in the following key features: (i) annotations in 55 classes, which surpasses the granularity of previously published key information extraction datasets by a large margin; (ii) Line Item Recognition represents a highly practical information extraction task, where key information has to be assigned to items in a table; (iii) documents come from numerous layouts and the test set includes zero- and few-shot cases as well as layouts commonly seen in the training set. The benchmark comes with several baselines, including RoBERTa, LayoutLMv3 and DETR-based Table Transformer. These baseline models were applied to both tasks of the DocILE benchmark, with results shared in this paper, offering a quick starting point for future work. The dataset and baselines are available at this https URL. 64 |
65 | 66 | #### 2022 67 | 68 | * [Business Document Information Extraction: Towards Practical Benchmarks](https://arxiv.org/abs/2206.11229) 69 |
70 | Matyáš Skalický, Štěpán Šimsa, Michal Uřičář, Milan Šulc CLEF 2022 71 | Information extraction from semi-structured documents is crucial for frictionless business-to-business (B2B) communication. While machine learning problems related to Document Information Extraction (IE) have been studied for decades, many common problem definitions and benchmarks do not reflect domain-specific aspects and practical needs for automating B2B document communication. We review the landscape of Document IE problems, datasets and benchmarks. We highlight the practical aspects missing in the common definitions and define the Key Information Localization and Extraction (KILE) and Line Item Recognition (LIR) problems. There is a lack of relevant datasets and benchmarks for Document IE on semi-structured business documents as their content is typically legally protected or sensitive. We discuss potential sources of available documents including synthetic data. 72 |
73 | 74 | * [Doc2Graph: A Task Agnostic Document Understanding Framework Based on Graph Neural Networks](https://link.springer.com/chapter/10.1007/978-3-031-25069-9_22), \[[code](https://github.com/andreagemelli/doc2graph) ![](https://img.shields.io/github/stars/andreagemelli/doc2graph.svg?style=social)\] 75 |
76 | Andrea Gemelli, Sanket Biswas, Enrico Civitelli, Josep Lladós, Simone Marinai TiE Workshop @ ECCV 2022 77 | Geometric Deep Learning has recently attracted significant interest in a wide range of machine learning fields, including document analysis. The application of Graph Neural Networks (GNNs) has become crucial in various document-related tasks since they can unravel important structural patterns, fundamental in key information extraction processes. Previous works in the literature propose task-driven models and do not take into account the full power of graphs. We propose Doc2Graph, a task-agnostic document understanding framework based on a GNN model, to solve different tasks given different types of documents. We evaluated our approach on two challenging datasets for key information extraction in form understanding, invoice layout analysis and table detection 78 |
79 | 80 | #### 2021 81 | 82 | * [Document AI: Benchmarks, Models and Applications](https://arxiv.org/abs/2111.08609) 83 |
84 | Lei Cui, Yiheng Xu, Tengchao Lv, Furu Wei arxiv 2021 85 | Document AI, or Document Intelligence, is a relatively new research topic that refers to the techniques for automatically reading, understanding, and analyzing business documents. It is an important research direction for natural language processing and computer vision. In recent years, the popularity of deep learning technology has greatly advanced the development of Document AI, such as document layout analysis, visual information extraction, document visual question answering, document image classification, etc. This paper briefly reviews some of the representative models, tasks, and benchmark datasets. Furthermore, we also introduce early-stage heuristic rule-based document analysis, statistical machine learning algorithms, and deep learning approaches especially pre-training methods. Finally, we look into future directions for Document AI research. 86 |
87 | 88 | * **[Efficient Automated Processing of the Unstructured Documents using Artificial Intelligence: A Systematic Literature Review and Future Directions](https://ieeexplore.ieee.org/abstract/document/9402739)** 89 |
90 | Dipali Baviskar, Swati Ahirrao, Vidyasagar Potdar, Ketan Kotecha IEEE Access 2021 91 | The unstructured data impacts 95% of the organizations and costs them millions of dollars annually. If managed well, it can significantly improve business productivity. The traditional information extraction techniques are limited in their functionality, but AI-based techniques can provide a better solution. A thorough investigation of AI-based techniques for automatic information extraction from unstructured documents is missing in the literature. The purpose of this Systematic Literature Review (SLR) is to recognize, and analyze research on the techniques used for automatic information extraction from unstructured documents and to provide directions for future research. The SLR guidelines proposed by Kitchenham and Charters were adhered to conduct a literature search on various databases between 2010 and 2020. We found that: 1. The existing information extraction techniques are template-based or rule-based, 2. The existing methods lack the capability to tackle complex document layouts in real-time situations such as invoices and purchase orders, 3.The datasets available publicly are task-specific and of low quality. Hence, there is a need to develop a new dataset that reflects real-world problems. Our SLR discovered that AI-based approaches have a strong potential to extract useful information from unstructured documents automatically. However, they face certain challenges in processing multiple layouts of the unstructured documents. Our SLR brings out conceptualization of a framework for construction of high-quality unstructured documents dataset with strong data validation techniques for automated information extraction. Our SLR also reveals a need for a close association between the businesses and researchers to handle various challenges of the unstructured data analysis. 92 |
93 | 94 | #### 2020 95 | 96 | * **[A Survey of Deep Learning Approaches for OCR and Document Understanding](https://arxiv.org/abs/2011.13534)** 97 |
98 | Nishant Subramani, Alexandre Matton, Malcolm Greaves, Adrian Lam ML-RSA Workshop at NeurIPS 2020 99 | Documents are a core part of many businesses in many fields such as law, finance, and technology among others. Automatic understanding of documents such as invoices, contracts, and resumes is lucrative, opening up many new avenues of business. The fields of natural language processing and computer vision have seen tremendous progress through the development of deep learning such that these methods have started to become infused in contemporary document understanding systems. In this survey paper, we review different techniques for document understanding for documents written in English and consolidate methodologies present in literature to act as a jumping-off point for researchers exploring this area. 100 |
101 | 102 | * **[Conversations with Documents. An Exploration of Document-Centered Assistance](https://arxiv.org/pdf/2002.00747.pdf)** 103 |
104 | Maartje ter Hoeve, Robert Sim, Elnaz Nouri, Adam Fourney, Maarten de Rijke, Ryen W. White CHIIR 2020 105 | The role of conversational assistants has become more prevalent in helping people increase their productivity. Document-centered assistance, for example to help an individual quickly review a document, has seen less significant progress, even though it has the potential to tremendously increase a user's productivity. This type of document-centered assistance is the focus of this paper. Our contributions are three-fold: (1) We first present a survey to understand the space of document-centered assistance and the capabilities people expect in this scenario. (2) We investigate the types of queries that users will pose while seeking assistance with documents, and show that document-centered questions form the majority of these queries. (3) We present a set of initial machine learned models that show that (a) we can accurately detect document-centered questions, and (b) we can build reasonably accurate models for answering such questions. These positive results are encouraging, and suggest that even greater results may be attained with continued study of this interesting and novel problem space. Our findings have implications for the design of intelligent systems to support task completion via natural interactions with documents. 106 |
107 | 108 | #### 2018 109 | 110 | * [Future paradigms of automated processing of business documents](https://www.sciencedirect.com/science/article/pii/S0268401217309994) 111 |
112 | Matteo Cristania, Andrea Bertolasob, Simone Scannapiecoc, Claudio Tomazzolia International Journal of Information Management 2018 113 | In this paper we summarize the results obtained so far in the communities interested in the development of automated processing techniques as applied to business documents, and devise a few evolutions that are demanded by the current stage of either those techniques by themselves or by collateral sector advancements. It emerges a clear picture of a field that has put an enormous effort in solving problems that changed a lot during the last 30 years, and is now rapidly evolving to incorporate document processing into workflow management systems on one side and to include features derived by the introduction of cloud computing technologies on the other side. We propose an architectural schema for business document processing that comes from the two above evolution lines. 114 |
115 | 116 | #### Older 117 | 118 | * [Machine Learning for Intelligent Processing of Printed Documents](https://www.semanticscholar.org/paper/Machine-Learning-for-Intelligent-Processing-of-Esposito-Malerba/1f23b61f04d450ffc49ec6371bb5b30d198cdc5b) 119 |
120 | F. Esposito, D. Malerba, F. Lisi - 2004 121 | A paper document processing system is an information system component which transforms information on printed or handwritten documents into a computer-revisable form. In intelligent systems for paper document processing this information capture process is based on knowledge of the specific layout and logical structures of the documents. This article proposes the application of machine learning techniques to acquire the specific knowledge required by an intelligent document processing system, named WISDOM++, that manages printed documents, such as letters and journals. Knowledge is represented by means of decision trees and first-order rules automatically generated from a set of training documents. In particular, an incremental decision tree learning system is applied for the acquisition of decision trees used for the classification of segmented blocks, while a first-order learning system is applied for the induction of rules used for the layout-based classification and understanding of documents. Issues concerning the incremental induction of decision trees and the handling of both numeric and symbolic data in first-order rule learning are discussed, and the validity of the proposed solutions is empirically evaluated by processing a set of real printed documents. 122 |
123 | 124 | 125 | * [Document Understanding: Research Directions](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.40.9880&rep=rep1&type=pdf) 126 |
127 | S. Srihari, S. Lam, V. Govindaraju, R. Srihari, J. Hull - 1994 128 | A document image is a visual representation of a printed page such as a journal article page, a facsimile cover page, a technical document, an office letter, etc. Document understanding as a research endeavor consists of studying all processes involved in taking a document through various representations: from a scanned physical document to high-level semantic descriptions of the document. Some of the types of representation that are useful are: editable descriptions, descriptions that enable exact reproductions and high-level semantic descriptions about document content. This report is a definition of five research subdomains within document understanding as pertaining to predominantly printed documents. The topics described are: modular architectures for document understanding; decomposition and structural analysis of documents; model-based OCR; table, diagram and image understanding; and performance evaluation under distortion and noise. 129 |
130 | 131 | 132 | 133 | # Research topics 134 | 135 | * [Key Information Extraction (KIE)](topics/kie/README.md) 136 | * [Document Layout Analysis (DLA)](topics/dla/README.md) 137 | * [Document Question Answering (DQA)](topics/dqa/README.md) 138 | * [Scientific Document Understanding (SDU)](topics/sdu/README.md) 139 | * [Optical Character Recogtion (OCR)](topics/ocr/README.md) 140 | * [Related](topics/related/README.md) 141 | * [General](topics/related/README.md#general) 142 | * [Tabular Data Comprehension (TDC)](topics/related/README.md#tabular-data-comprehension) 143 | * [Robotic Process Automation (RPA)](topics/related/README.md#robotic-process-automation) 144 | 145 | # Others 146 | 147 | ## Resources 148 | 149 | [Back to top](#table-of-contents) 150 | 151 | #### Datasets for Pre-training Language Models 152 | 153 | 1. [The RVL-CDIP Dataset](https://adamharley.com/rvl-cdip/) - dataset consists of 400,000 grayscale images in 16 classes, with 25,000 images per class 154 | 1. [The Industry Documents Library](https://www.industrydocuments.ucsf.edu/) - a portal to millions of documents created by industries that influence public health, hosted by the UCSF Library 155 | 1. [Color Document Dataset](https://ivi.fnwi.uva.nl/isis/UvA-CDD/) - from the Intelligent Sensory Information Systems, University of Amsterdam 156 | 1. [The IIT CDIP Collection](https://data.nist.gov/od/id/mds2-2531) - dataset consists of documents from the states' lawsuit against the tobacco industry in the 1990s, consists of around 7 million documents 157 | 158 | 159 | #### PDF processing tools 160 | 161 | 1. [borb](https://github.com/jorisschellekens/borb) ![](https://img.shields.io/github/stars/jorisschellekens/borb.svg?style=social) - is a pure python library to read, write and manipulate PDF documents. It represents a PDF document as a JSON-like datastructure of nested lists, dictionaries and primitives (numbers, string, booleans, etc). 162 | 1. [pawls](https://github.com/allenai/pawls) ![](https://img.shields.io/github/stars/allenai/pawls.svg?style=social) - PDF Annotations with Labels and Structure is software that makes it easy to collect a series of annotations associated with a PDF document 163 | 1. [pdfplumber](https://github.com/jsvine/pdfplumber) ![](https://img.shields.io/github/stars/jsvine/pdfplumber.svg?style=social) - Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging 164 | 1. [Pdfminer.six](https://github.com/pdfminer/pdfminer.six) ![](https://img.shields.io/github/stars/pdfminer/pdfminer.six.svg?style=social) - Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF documents. It focuses on getting and analyzing text data 165 | 1. [Layout Parser](https://github.com/Layout-Parser/layout-parser) ![](https://img.shields.io/github/stars/Layout-Parser/layout-parser.svg?style=social) - Layout Parser is a deep learning based tool for document image layout analysis tasks 166 | 1. [Tabulo](https://github.com/interviewBubble/Tabulo) ![](https://img.shields.io/github/stars/interviewBubble/Tabulo.svg?style=social) - Table extraction from images 167 | 1. [OCRmyPDF](https://github.com/jbarlow83/OCRmyPDF) ![](https://img.shields.io/github/stars/jbarlow83/OCRmyPDF.svg?style=social) - OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted 168 | 1. [PDFBox](https://github.com/apache/pdfbox) ![](https://img.shields.io/github/stars/apache/pdfbox.svg?style=social) - The Apache PDFBox library is an open source Java tool for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents 169 | 1. [PdfPig](https://github.com/UglyToad/PdfPig) ![](https://img.shields.io/github/stars/UglyToad/PdfPig.svg?style=social) - This project allows users to read and extract text and other content from PDF files. In addition the library can be used to create simple PDF documents containing text and geometrical shapes. This project aims to port PDFBox to C# 170 | 1. [parsing-prickly-pdfs](https://porter.io/github.com/jsfenfen/parsing-prickly-pdfs) ![](https://img.shields.io/github/stars/jsfenfen/parsing-prickly-pdfs.svg?style=social) - Resources and worksheet for the NICAR 2016 workshop of the same name 171 | 1. [pdf-text-extraction-benchmark](https://github.com/ckorzen/pdf-text-extraction-benchmark) ![](https://img.shields.io/github/stars/ckorzen/pdf-text-extraction-benchmark.svg?style=social) - PDF tools benchmark 172 | 1. [Born digital pdf scanner](https://github.com/applicaai/digital-born-pdf-scanner) ![](https://img.shields.io/github/stars/applicaai/digital-born-pdf-scanner.svg?style=social) - checking if pdf is born-digital 173 | 1. [OpenContracts](https://github.com/JSv4/OpenContracts) ![](https://img.shields.io/github/stars/JSv4/OpenContracts?style=social) Apache2-licensed, PDF annotating platform for visually-rich documents that preserves the original layout and exports x,y positional data for tokens as well as span starts and stops. Based on PAWLs, but with a Python-based backend and readily deployable on your local machine, company intranet or the web via Docker Compose. 174 | 1. [deepdoctection](https://github.com/deepdoctection/deepdoctection) ![](https://img.shields.io/github/stars/deepdoctection/deepdoctection?style=social) **deep**doctection is a Python library that orchestrates document extraction and document layout analysis tasks for images and pdf documents using deep learning models. It does not implement models but enables you to build pipelines using highly acknowledged libraries for object detection, OCR and selected NLP tasks and provides an integrated framework for fine-tuning, evaluating and running models. 175 | 1. [pydoxtools](https://github.com/xyntopia/pydoxtools) ![](https://img.shields.io/github/stars/xyntopia/pydoxtools.svg?style=social) Pydoxtools is an AI-composition library for dpocument analysis. It features an extensive toolset for building complex document analysis pipelines and recognizes most document formats out of the box. It supports typical NLP tasks such as keywords, summarization, question_answering out of the box. and features a high quality low-CPU/memory table extraction algorithm and makes NLP batch operations on a cluster easy. 176 | 177 | ## Conferences, workshops 178 | 179 | [Back to top](#table-of-contents) 180 | 181 | #### General/ Business / Finance 182 | 183 | 1. **International Conference on Document Analysis and Recognition (ICDAR)** [[2021](https://icdar2021.org/), [2019](http://icdar2019.org/), [2017](http://u-pat.org/ICDAR2017/index.php)] 184 | 1. Workshop on Document Intelligence (DI) [[2021](https://document-intelligence.github.io/DI-2021/), [2019](https://sites.google.com/view/di2019)] 185 | 1. Financial Narrative Processing Workshop (FNP) [[2021](http://wp.lancs.ac.uk/cfie/fnp2021/), [2020](http://wp.lancs.ac.uk/cfie/fincausal2020/), [2019](https://www.aclweb.org/anthology/volumes/W19-64/) ] 186 | 1. Workshop on Economics and Natural Language Processing (ECONLP) [[2021](https://julielab.de/econlp/2021/), [2019](https://sites.google.com/view/econlp-2019), [2018](https://www.aclweb.org/anthology/W18-31.pdf) ] 187 | 1. INTERNATIONAL WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS (DAS) [[2020](https://www.vlrlab.net/das2020/), [2018](https://das2018.cvl.tuwien.ac.at/en/), [2016](https://www.primaresearch.org/das2016/)] 188 | 1. [ACM International Conference on AI in Finance (ICAIF)](https://ai-finance.org/) 189 | 1. [The AAAI-21 Workshop on Knowledge Discovery from Unstructured Data in Financial Services](https://aaai-kdf.github.io/kdf2021/) 190 | 1. [CVPR 2020 Workshop on Text and Documents in the Deep Learning Era](https://cvpr2020text.wordpress.com/accepted-papers/) 191 | 1. [KDD Workshop on Machine Learning in Finance (KDD MLF 2020)](https://sites.google.com/view/kdd-mlf-2020) 192 | 1. [FinIR 2020: The First Workshop on Information Retrieval in Finance](https://finir2020.github.io/) 193 | 1. [2nd KDD Workshop on Anomaly Detection in Finance (KDD 2019)](https://sites.google.com/view/kdd-adf-2019) 194 | 1. [Document Understanding Conference (DUC 2007)](https://duc.nist.gov/pubs.html) 195 | 196 | #### Scientific Document Understanding 197 | 198 | 1. [The AAAI-21 Workshop on Scientific Document Understanding (SDU 2021)](https://sites.google.com/view/sdu-aaai21/home) 199 | 1. [First Workshop on Scholarly Document Processing (SDProc 2020)](https://ornlcda.github.io/SDProc/) 200 | 1. International Workshop on SCIentific DOCument Analysis (SCIDOCA) [[2020](http://research.nii.ac.jp/SCIDOCA2020/), [2018](http://www.jaist.ac.jp/event/SCIDOCA/2018/), [2017](https://aclweb.org/portal/content/second-international-workshop-scientific-document-analysis) ] 201 | 202 | ## Blogs 203 | 204 | [Back to top](#table-of-contents) 205 | 206 | 1. [A Survey of Document Understanding Models](https://www.pragmatic.ml/a-survey-of-document-understanding-models/), 2021 207 | 1. [Document Form Extraction](https://www.crosstab.io/product-comparisons/document-form-extraction), 2021 208 | 1. [How to automate processes with unstructured data](https://levity.ai/blog/automate-processes-with-unstructured-data), 2021 209 | 1. [A Comprehensive Guide to OCR with RPA and Document Understanding](https://nanonets.com/blog/ocr-with-rpa-and-document-understanding-uipath/), 2021 210 | 1. [Information Extraction from Receipts with Graph Convolutional Networks](https://nanonets.com/blog/information-extraction-graph-convolutional-networks/), 2021 211 | 1. [How to extract structured data from invoices](https://nanonets.com/blog/extract-structured-data-from-invoice/), 2021 212 | 1. [Extracting Structured Data from Templatic Documents](https://ai.googleblog.com/2020/06/extracting-structured-data-from.html), 2020 213 | 1. [To apply AI for good, think form extraction](http://jonathanstray.com/to-apply-ai-for-good-think-form-extraction), 2020 214 | 1. [UiPath Document Understanding Solution Architecture and Approach](https://medium.com/@lahirufernando90/uipath-document-understanding-solution-architecture-and-approach-934a9a26630a), 2020 215 | 1. [How Can I Automate Data Extraction from Complex Documents?](https://www.infrrd.ai/blog/how-can-i-automate-data-extraction-from-complex-documents), 2020 216 | 1. [LegalTech: Information Extraction in legal documents](https://naturaltech.medium.com/legaltech-information-extraction-in-legal-documents-e1843a60bc8d), 2020 217 | 218 | ## Solutions 219 | 220 | [Back to top](#table-of-contents) 221 | 222 | Big companies: 223 | 1. [Abby](https://www.abbyy.com/flexicapture/) 224 | 1. [Accenture](https://www.accenture.com/us-en/services/applied-intelligence/document-understanding-solutions) 225 | 1. [Amazon](https://aws.amazon.com/about-aws/whats-new/2020/11/introducing-document-understanding-solution/) 226 | 1. [Google](https://cloud.google.com/document-ai) 227 | 1. [Microsoft](https://azure.microsoft.com/en-us/services/cognitive-services/) 228 | 1. [Uipath](https://www.uipath.com/product/document-understanding) 229 | 230 | Smaller: 231 | 1. [Applica.ai](https://applica.ai/) 232 | 1. [Base64.ai](https://base64.ai) 233 | 1. [Docstack](https://www.docstack.com/ai-document-understanding) 234 | 1. [Element AI](https://www.elementai.com/products/document-intelligence) 235 | 1. [Indico](https://indico.io) 236 | 1. [Instabase](https://instabase.com/) 237 | 1. [Konfuzio](https://konfuzio.com/en/) 238 | 1. [Metamaze](https://metamaze.eu) 239 | 1. [Nanonets](https://nanonets.com) 240 | 1. [Rossum](https://rossum.ai/) 241 | 1. [Silo](https://silo.ai/how-document-understanding-improves-invoice-contract-and-resume-processing/) 242 | 243 | # Examples 244 | 245 | ## Visually Rich Documents 246 | 247 | [Back to top](#table-of-contents) 248 | 249 | In VRDs the importance of the layout information is crucial to understand the whole document correctly (this is the case with almost all business documents). For humans spatial information improves readability and speeds document understanding. 250 | 251 | #### Invoice / Resume / Job Ad 252 | 253 |

254 | 255 | 256 | 257 |

258 |

259 | 260 | #### NDA / Annual reports 261 | 262 |

263 | 264 | 265 | 266 |

267 |

268 | 269 | 270 | ## Key Information Extraction 271 | 272 | [Back to top](#table-of-contents) 273 | 274 | The aim of this task is to extract texts of a number of key fields from a given collection of documents containing similar key entities. 275 | 276 |
277 | 278 | #### Scanned Receipts 279 | 280 |

281 | 282 | 283 | 284 |

285 |

286 | 287 | #### NDA / Annual reports 288 | 289 | Examples of a real business applications and data for Kleister datasets (The key entities are in blue) 290 | 291 |

292 | 293 | 294 | 295 |

296 |

297 | 298 | #### Multimedia Online Flyers 299 | 300 | An example of a commercial real estate flyer and manually entered listing information © ProMaker Commercial Real Estate LLC, © BrokerSavant Inc. 301 | 302 |

303 | 304 | 305 | 306 |

307 |

308 | 309 | #### Value-added tax invoice 310 | 311 |

312 | 313 | 314 | 315 |

316 |

317 | 318 | #### Webpages 319 | 320 |

321 | 322 | 323 | 324 |

325 |

326 | 327 | 328 | ## Document Layout Analysis 329 | 330 | [Back to top](#table-of-contents) 331 | 332 | In computer vision or natural language processing, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order. Detection and labeling of the different zones (or blocks) as text body, illustrations, math symbols, and tables embedded in a document is called geometric layout analysis. But text zones play different logical roles inside the document (titles, captions, footnotes, etc.) and this kind of semantic labeling is the scope of the logical layout analysis. (https://en.wikipedia.org/wiki/Document_layout_analysis) 333 | 334 | 335 | #### Scientific publication 336 | 337 |

338 | 339 | 340 | 341 |

342 |

343 | 344 | 345 |

346 | 347 | 348 | 349 |

350 |

351 | 352 | 353 | #### Historical newspapers 354 | 355 |

356 | 357 | 358 | 359 |

360 |

361 | 362 | 363 | #### Business documents 364 | 365 | Red: text block, Blue: figure. 366 | 367 |

368 | 369 | 370 | 371 |

372 |

373 | 374 | 375 | ## Document Question Answering 376 | 377 | [Back to top](#table-of-contents) 378 | 379 | 380 | #### DocVQA example 381 | 382 |

383 | 384 | 385 | 386 |

387 |

388 | 389 | 390 | #### [Tilt model](https://arxiv.org/pdf/2102.09550.pdf) demo 391 | 392 |

393 | 394 | 395 | 396 |

397 |

398 | 399 | # Inspirations 400 | 401 | [Back to top](#table-of-contents) 402 | 403 | **Domain** 404 | 1. https://github.com/kba/awesome-ocr ![](https://img.shields.io/github/stars/kba/awesome-ocr.svg?style=social) 405 | 1. https://github.com/Liquid-Legal-Institute/Legal-Text-Analytics ![](https://img.shields.io/github/stars/Liquid-Legal-Institute/Legal-Text-Analytics.svg?style=social) 406 | 1. https://github.com/icoxfog417/awesome-financial-nlp ![](https://img.shields.io/github/stars/icoxfog417/awesome-financial-nlp.svg?style=social) 407 | 1. https://github.com/BobLd/DocumentLayoutAnalysis ![](https://img.shields.io/github/stars/BobLd/DocumentLayoutAnalysis.svg?style=social) 408 | 1. https://github.com/bikash/DocumentUnderstanding ![](https://img.shields.io/github/stars/bikash/DocumentUnderstanding.svg?style=social) 409 | 1. https://github.com/harpribot/awesome-information-retrieval ![](https://img.shields.io/github/stars/harpribot/awesome-information-retrieval.svg?style=social) 410 | 1. https://github.com/roomylee/awesome-relation-extraction ![](https://img.shields.io/github/stars/roomylee/awesome-relation-extraction.svg?style=social) 411 | 1. https://github.com/caufieldjh/awesome-bioie ![](https://img.shields.io/github/stars/caufieldjh/awesome-bioie.svg?style=social) 412 | 1. https://github.com/HelloRusk/entity-related-papers ![](https://img.shields.io/github/stars/HelloRusk/entity-related-papers.svg?style=social) 413 | 1. https://github.com/pliang279/awesome-multimodal-ml ![](https://img.shields.io/github/stars/pliang279/awesome-multimodal-ml.svg?style=social) 414 | 1. https://github.com/thunlp/LegalPapers ![](https://img.shields.io/github/stars/thunlp/LegalPapers.svg?style=social) 415 | 1. https://github.com/heartexlabs/awesome-data-labeling ![](https://img.shields.io/github/stars/heartexlabs/awesome-data-labeling.svg?style=social) 416 | 417 | **General AI/DL/ML** 418 | 1. https://github.com/jsbroks/awesome-dataset-tools ![](https://img.shields.io/github/stars/jsbroks/awesome-dataset-tools.svg?style=social) 419 | 1. https://github.com/EthicalML/awesome-production-machine-learning ![](https://img.shields.io/github/stars/EthicalML/awesome-production-machine-learning.svg?style=social) 420 | 1. https://github.com/eugeneyan/applied-ml ![](https://img.shields.io/github/stars/eugeneyan/applied-ml.svg?style=social) 421 | 1. https://github.com/awesomedata/awesome-public-datasets ![](https://img.shields.io/github/stars/awesomedata/awesome-public-datasets.svg?style=social) 422 | 1. https://github.com/keon/awesome-nlp ![](https://img.shields.io/github/stars/keon/awesome-nlp.svg?style=social) 423 | 1. https://github.com/thunlp/PLMpapers ![](https://img.shields.io/github/stars/thunlp/PLMpapers.svg?style=social) 424 | 1. https://github.com/jbhuang0604/awesome-computer-vision#awesome-lists ![](https://img.shields.io/github/stars/jbhuang0604/awesome-computer-vision.svg?style=social) 425 | 1. https://github.com/papers-we-love/papers-we-love ![](https://img.shields.io/github/stars/papers-we-love/papers-we-love.svg?style=social) 426 | 1. https://github.com/BAILOOL/DoYouEvenLearn ![](https://img.shields.io/github/stars/BAILOOL/DoYouEvenLearn.svg?style=social) 427 | 1. https://github.com/hibayesian/awesome-automl-papers ![](https://img.shields.io/github/stars/hibayesian/awesome-automl-papers.svg?style=social) 428 | -------------------------------------------------------------------------------- /images/dla_examples_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tstanislawek/awesome-document-understanding/2ad3a0f037b0cdcbba90843069836a80bf1dd8a9/images/dla_examples_1.png -------------------------------------------------------------------------------- /images/dla_examples_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tstanislawek/awesome-document-understanding/2ad3a0f037b0cdcbba90843069836a80bf1dd8a9/images/dla_examples_2.png -------------------------------------------------------------------------------- /images/dla_examples_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tstanislawek/awesome-document-understanding/2ad3a0f037b0cdcbba90843069836a80bf1dd8a9/images/dla_examples_3.png -------------------------------------------------------------------------------- /images/dla_examples_4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tstanislawek/awesome-document-understanding/2ad3a0f037b0cdcbba90843069836a80bf1dd8a9/images/dla_examples_4.png -------------------------------------------------------------------------------- /images/dqa_example_1.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tstanislawek/awesome-document-understanding/2ad3a0f037b0cdcbba90843069836a80bf1dd8a9/images/dqa_example_1.gif -------------------------------------------------------------------------------- /images/dqa_example_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tstanislawek/awesome-document-understanding/2ad3a0f037b0cdcbba90843069836a80bf1dd8a9/images/dqa_example_2.png -------------------------------------------------------------------------------- /images/du_example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tstanislawek/awesome-document-understanding/2ad3a0f037b0cdcbba90843069836a80bf1dd8a9/images/du_example.png -------------------------------------------------------------------------------- /images/kie_examples_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tstanislawek/awesome-document-understanding/2ad3a0f037b0cdcbba90843069836a80bf1dd8a9/images/kie_examples_1.png -------------------------------------------------------------------------------- /images/kie_examples_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tstanislawek/awesome-document-understanding/2ad3a0f037b0cdcbba90843069836a80bf1dd8a9/images/kie_examples_2.png -------------------------------------------------------------------------------- /images/kie_examples_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tstanislawek/awesome-document-understanding/2ad3a0f037b0cdcbba90843069836a80bf1dd8a9/images/kie_examples_3.png -------------------------------------------------------------------------------- /images/kie_examples_4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tstanislawek/awesome-document-understanding/2ad3a0f037b0cdcbba90843069836a80bf1dd8a9/images/kie_examples_4.png -------------------------------------------------------------------------------- /images/kie_examples_5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tstanislawek/awesome-document-understanding/2ad3a0f037b0cdcbba90843069836a80bf1dd8a9/images/kie_examples_5.png -------------------------------------------------------------------------------- /images/vrd_examples_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tstanislawek/awesome-document-understanding/2ad3a0f037b0cdcbba90843069836a80bf1dd8a9/images/vrd_examples_1.png -------------------------------------------------------------------------------- /images/vrd_examples_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tstanislawek/awesome-document-understanding/2ad3a0f037b0cdcbba90843069836a80bf1dd8a9/images/vrd_examples_2.png -------------------------------------------------------------------------------- /images/vrd_examples_2v2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tstanislawek/awesome-document-understanding/2ad3a0f037b0cdcbba90843069836a80bf1dd8a9/images/vrd_examples_2v2.png -------------------------------------------------------------------------------- /topics/dla/README.md: -------------------------------------------------------------------------------- 1 | ## Table of contents 2 | 3 | 1. [Overview](#overview) 4 | 1. [Papers](#papers) 5 | 1. [Datasets](#datasets) 6 | 7 | 8 | ## Overview 9 | 10 | Document Layout Analysis is a Computer Vision approach to the problem of detection of specific objects in a document, such as: 11 | * tables 12 | * form fields 13 | * clusters of text 14 | * stamps 15 | * images (i.e. logos...), 16 | * barcodes, 17 | * hand written parts, 18 | * headers, 19 | * check boxes, 20 | * etc. 21 | 22 | 23 | ## Papers 24 | 25 | 26 | * [(Fintabnet) Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context](https://openaccess.thecvf.com/content/WACV2021/papers/Zheng_Global_Table_Extractor_GTE_A_Framework_for_Joint_Table_Identification_WACV_2021_paper.pdf) 27 |
28 | Xinyi Zheng, Douglas Burdick, Lucian Popa, Xu Zhong, Nancy Xin Ru Wang WACV 2021 29 | Documents are often the format of choice for knowledge sharing and preservation in business and science, within which are tables that capture most of the critical data. Unfortunately, most documents are stored and distributed as PDF or scanned images, which fail to preserve table formatting. Recent vision-based deep learning approaches have been proposed to address this gap, but most still cannot achieve state-of-the-art results. We present Global Table Extractor (GTE), a vision-guided systematic framework for joint table detection and cell structured recognition, which could be built on top of any object detection model. With GTE-Table, we invent a new penalty based on the natural cell containment constraint of tables to train our table network aided by cell location predictions. GTE-Cell is a new hierarchical cell detection network that leverages table styles. Further, we design a method to automatically label table and cell structure in existing documents to cheaply create a large corpus of training and test data. We use this to enhance PubTabNet with cell labels and create FinTabNet, real-world and complex scientific and financial datasets with detailed table structure annotations to help train and test structure recognition. Our deep learning framework surpasses previous state-of-the-art results on the ICDAR 2013 and ICDAR 2019 table competition test dataset in both table detection and cell structure recognition. Further experiments demonstrate a greater than 45% improvement in cell structure recognition when compared to a vanilla RetinaNet object detection model in our new out-of-domain financial dataset (Fintabnet). 30 |
31 | 32 | * **[CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents](https://arxiv.org/pdf/2004.12629.pdf)**, \[[code](https://github.com/DevashishPrasad/CascadeTabNet) ![](https://img.shields.io/github/stars/DevashishPrasad/CascadeTabNet.svg?style=social)\] 33 |
34 | Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, Kavita Sultanpure CVPR Workshop 2020 35 | CascadTabNet is an automatic table recognition method for interpretation of tabular data in document images. We present an improved deep learning-based end to end approach for solving both problems of table detection and structure recognition using a single Convolution Neural Network (CNN) model. CascadeTabNet is a Cascade mask Region-based CNN High-Resolution Network (Cascade mask R-CNN HRNet) based model that detects the regions of tables and recognizes the structural body cells from the detected tables at the same time. We evaluate our results on ICDAR 2013, ICDAR 2019 and TableBank public datasets. We achieved 3rd rank in ICDAR 2019 post-competition results for table detection while attaining the best accuracy results for the ICDAR 2013 and TableBank dataset. We also attain the highest accuracy results on the ICDAR 2019 table structure recognition dataset. 36 |
37 | 38 | * [Document Structure Extraction for Forms using Very High Resolution Semantic Segmentation](https://www.researchgate.net/profile/Mausoom-Sarkar/publication/337590348_Document_Structure_Extraction_for_Forms_using_Very_High_Resolution_Semantic_Segmentation/links/5e6c91bc299bf12e23c35820/Document-Structure-Extraction-for-Forms-using-Very-High-Resolution-Semantic-Segmentation.pdf) 39 |
40 | Mausoom Sarkar et al. ECCV 2020 41 | In this work, we look at the problem of structure extraction from document images with a specific focus on forms. Forms as a document class have not received much attention, even though they comprise a significant fraction of documents and enable several applications. Forms possess a rich, complex, hierarchical, and high-density semantic structure that poses several challenges to semantic segmentation methods. We propose a prior based deep CNN-RNN hierarchical network architecture that enables document structure extraction using very high resolution(1800 x 1000) images. We divide the document image into overlapping horizontal strips such that the network segments a strip and uses its prediction mask as prior while predicting the segmentation for the subsequent strip. We perform experiments establishing the effectiveness of our strip based network architecture through ablation methods and comparison with low-resolution variations. We introduce our new rich human-annotated forms dataset, and we show that our method significantly outperforms other segmentation baselines in extracting several hierarchical structures on this dataset. We also outperform other baselines in table detection task on the Marmot dataset. Our method is currently being used in a world-leading customer experience management software suite for automated conversion of paper and PDF forms to modern HTML based forms. 42 |
43 | 44 | * [Visual Segmentation for Information Extraction from Heterogeneous Visually Rich Documents](https://www.researchgate.net/publication/333859687_Visual_Segmentation_for_Information_Extraction_from_Heterogeneous_Visually_Rich_Documents) 45 |
46 | Ritesh Sarkhel, Arnab Nandi SIGMOD 2019 47 | Physical and digital documents often contain visually rich information. With such information, there is no strict ordering or positioning in the document where the data values must appear. Along with textual cues, these documents often also rely on salient visual features to define distinct semantic boundaries and augment the information they disseminate. When performing information extraction (IE), traditional techniques fall short, as they use a text-only representation and do not consider the visual cues inherent to the layout of these documents. We propose VS2, a generalized approach for information extraction from heterogeneous visually rich documents. There are two major contributions of this work. First, we propose a robust segmentation algorithm that decomposes a visually rich document into a bag of visually isolated but semantically coherent areas, called logical blocks. Document type agnostic low-level visual and semantic features are used in this process. Our second contribution is a distantly supervised search-and-select method for identifying the named entities within these documents by utilizing the context boundaries defined by these logical blocks. Experimental results on three heterogeneous datasets suggest that the proposed approach significantly outperforms its text-only counterparts on all datasets. Comparing it against the state-of-the-art methods also reveal that VS2 performs comparably or better on all datasets. 48 |
49 | 50 | * [One-shot field spotting on colored forms using subgraph isomorphism](https://hal.archives-ouvertes.fr/hal-01249470/file/bare_conf.pdf) 51 |
52 | Maroua Hammami et al. ICDAR 2015 53 | This paper presents an approach for spotting textual fields in commercial and administrative colored forms. We proceed by locating these fields thanks to their neighboring context which is modeled with a structural representation. First, informative zones are extracted. Second, forms are represented by graphs. In these graphs, nodes represent colored rectangular shapes while edges represent neighboring relations. Finally, the neighboring context of the queried region of interest is modeled as a graph. Subgraph isomorphism is applied in order to locate this ROI in the structural representation of a whole document. Evaluated on a 130-document image dataset, experimental results show up that our approach is efficient and that the requested information is found even if its position is changed. 54 |
55 | 56 | 57 | ## Datasets 58 | 59 | * [DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis](https://arxiv.org/pdf/2206.01062.pdf), \[[code/data](https://github.com/DS4SD/DocLayNet) ![](https://img.shields.io/github/stars/DS4SD/DocLayNet.svg?style=social)\] 60 |
61 | Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar and Peter Staar Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining 62 | In this paper, we present DocLayNet, a new, publicly available, document-layout annotation dataset in COCO format. It contains 80863 manually annotated pages from diverse data sources to represent a wide variability in layouts. For each PDF page, the layout annotations provide labelled bounding-boxes with a choice of 11 distinct classes. DocLayNet also provides a subset of double- and triple-annotated pages to determine the inter-annotator agreement. 63 |
64 | 65 | * [DocBank: A Benchmark Dataset for Document Layout Analysis](https://arxiv.org/pdf/2006.01038.pdf), \[[code/data](https://github.com/doc-analysis/DocBank) ![](https://img.shields.io/github/stars/doc-analysis/DocBank.svg?style=social)\] 66 |
67 | Minghao Li et al. COLING 2020 68 | DocBank is a new large-scale dataset that is constructed using a weak supervision approach. It enables models to integrate both the textual and layout information for downstream tasks. The current DocBank dataset totally includes 500K document pages, where 400K for training, 50K for validation and 50K for testing. 69 |
70 | 71 | * [Tablebank: Table benchmark for image-based table detection and recognition](https://www.aclweb.org/anthology/2020.lrec-1.236/), \[[code/data](https://github.com/doc-analysis/TableBank) ![](https://img.shields.io/github/stars/doc-analysis/TableBank.svg?style=social)\] 72 |
73 | Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, Zhoujun Li LREC 2020 74 | We present TableBank, a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on the internet. Existing research for image-based table detection and recognition usually fine-tunes pre-trained models on out-of-domain data with a few thousand human-labeled examples, which is difficult to generalize on real-world applications. With TableBank that contains 417K high quality labeled tables, we build several strong baselines using state-of-the-art models with deep neural networks. We make TableBank publicly available and hope it will empower more deep learning approaches in the table detection and recognition task. The dataset and models can be downloaded from https://github.com/doc-analysis/TableBank. 75 |
76 | 77 | * [HJDataset: A Large Dataset of Historical Japanese Documents with Complex Layouts](https://arxiv.org/pdf/2004.08686.pdf), \[[code](https://github.com/dell-research-harvard/HJDataset) ![](https://img.shields.io/github/stars/dell-research-harvard/HJDataset.svg?style=social) \] 78 |
79 | Zejiang Shen, Kaixuan Zhang, Melissa Dell CVPR2020 Workshop 2020 80 | Deep learning-based approaches for automatic document layout analysis and content extraction have the potential to unlock rich information trapped in historical documents on a large scale. One major hurdle is the lack of large datasets for training robust models. In particular, little training data exist for Asian languages. To this end, we present HJDataset, a Large Dataset of Historical Japanese Documents with Complex Layouts. It contains over 250,000 layout element annotations of seven types. In addition to bounding boxes and masks of the content regions, it also includes the hierarchical structures and reading orders for layout elements. The dataset is constructed using a combination of human and machine efforts. A semi-rule based method is developed to extract the layout elements, and the results are checked by human inspectors. The resulting large-scale dataset is used to provide baseline performance analyses for text region detection using state-of-the-art deep learning models. And we demonstrate the usefulness of the dataset on real-world document digitization tasks. 81 |
82 | 83 | 84 | * [PubLayNet: largest dataset ever for document layout analysis](https://arxiv.org/abs/1908.07836), \[[code](https://github.com/ibm-aur-nlp/PubLayNet) ![](https://img.shields.io/github/stars/ibm-aur-nlp/PubLayNet.svg?style=social)\] 85 |
86 | Xu Zhong, Jianbin Tang, Antonio Jimeno Yepes ICDAR 2019 87 | Recognizing the layout of unstructured digital documents is an important step when parsing the documents into structured machine-readable format for downstream applications. Deep neural networks that are developed for computer vision have been proven to be an effective method to analyze layout of document images. However, document layout datasets that are currently publicly available are several magnitudes smaller than established computing vision datasets. Models have to be trained by transfer learning from a base model that is pre-trained on a traditional computer vision dataset. In this paper, we develop the PubLayNet dataset for document layout analysis by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central. The size of the dataset is comparable to established computer vision datasets, containing over 360 thousand document images, where typical document layout elements are annotated. The experiments demonstrate that deep neural networks trained on PubLayNet accurately recognize the layout of scientific articles. The pre-trained models are also a more effective base mode for transfer learning on a different document domain. We release the dataset to support development and evaluation of more advanced models for document layout analysis. 88 |
89 | 90 | 91 | * [ICDAR2017 Competition on Recognition of Early Indian Printed Documents – REID2017](https://www.primaresearch.org/www/assets/papers/ICDAR2017_Clausner_REID2017.pdf), \[[Website](https://www.primaresearch.org/datasets/REID2017)\] 92 |
93 | Christian Clausner, Apostolos Antonacopoulos, Tom Derrick, Stefan Pletschacher ICDAR 2017 94 | This paper presents an objective comparative evaluation of page analysis and recognition methods for historical documents with text mainly in Bengali language and script. It describes the competition (modus operandi, dataset and evaluation methodology) held in the context of ICDAR2017, presenting the results of the evaluation of seven methods – three submitted and four variations of open source state-of-the-art systems. The focus is on optical character recognition (OCR) performance. Different evaluation metrics were used to gain an insight into the algorithms, including new character accuracy metrics to better reflect the difficult circumstances presented by the documents. The results indicate that deep learning approaches are the most promising, but there is still a considerable need to develop robust methods that deal with challenges of historic material of this nature. 95 |
96 | 97 | 98 | * [A Realistic Dataset for Performance Evaluation of Document Layout Analysis](https://www.semanticscholar.org/paper/A-Realistic-Dataset-for-Performance-Evaluation-of-Antonacopoulos-Bridson/c4288ec46736acbe7ca1fc54d43f94b19b602450), \[[Website](http://www.primaresearch.org/datasets/Layout_Analysis)\] 99 |
100 | Apostolos Antonacopoulos, David Bridson, Christos Papadopoulos, Stefan Pletschacher ICDAR 2009 101 | There is a significant need for a realistic dataset on which to evaluate layout analysis methods and examine their performance in detail. This paper presents a new dataset (and the methodology used to create it) based on a wide range of contemporary documents. Strong emphasis is placed on comprehensive and detailed representation of both complex and simple layouts, and on colour originals. In-depth information is recorded both at the page and region level. Ground truth is efficiently created using a new semi-automated tool and stored in a new comprehensive XML representation, the PAGE format. The dataset can be browsed and searched via a web-based front end to the underlying database and suitable subsets (relevant to specific evaluation goals) can be selected and downloaded. 102 |
103 | -------------------------------------------------------------------------------- /topics/dqa/README.md: -------------------------------------------------------------------------------- 1 | ## Table of contents 2 | 3 | 1. [Papers](#papers) 4 | 1. [Datasets](#datasets) 5 | 6 | 7 | 8 | ## Papers 9 | 10 | #### 2022 11 | 12 | * *[ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding](https://arxiv.org/abs/2210.06155)*, 13 |
14 | Qiming Peng, Yinxu Pan, Wenjin Wang, Bin Luo, Zhenyu Zhang, Zhengjie Huang, Teng Hu, Weichong Yin, Yongfeng Chen, Yin Zhang, Shikun Feng, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang EMNLP (Findings) 2022 15 | Recent years have witnessed the rise and success of pre-training techniques in visually-rich document understanding. However, most existing methods lack the systematic mining and utilization of layout-centered knowledge, leading to sub-optimal performances. In this paper, we propose ERNIE-Layout, a novel document pre-training solution with layout knowledge enhancement in the whole workflow, to learn better representations that combine the features from text, layout, and image. Specifically, we first rearrange input sequences in the serialization stage, and then present a correlative pre-training task, reading order prediction, to learn the proper reading order of documents. To improve the layout awareness of the model, we integrate a spatial-aware disentangled attention into the multi-modal transformer and a replaced regions prediction task into the pre-training phase. Experimental results show that ERNIE-Layout achieves superior performance on various downstream tasks, setting new state-of-the-art on key information extraction, document image classification, and document question answering datasets. 16 |
17 | 18 | * **[Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347)**, 19 |
20 | Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova arxiv 2022 21 | Visually-situated language is ubiquitous -- sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to this diversity, previous work has typically relied on domain-specific recipes with limited sharing of the underlying data, model architectures, and objectives. We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing visually-situated language. Pix2Struct is pretrained by learning to parse masked screenshots of web pages into simplified HTML. The web, with its richness of visual elements cleanly reflected in the HTML structure, provides a large source of pretraining data well suited to the diversity of downstream tasks. Intuitively, this objective subsumes common pretraining signals such as OCR, language modeling, image captioning. In addition to the novel pretraining strategy, we introduce a variable-resolution input representation and a more flexible integration of language and vision inputs, where language prompts such as questions are rendered directly on top of the input image. For the first time, we show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains: documents, illustrations, user interfaces, and natural images. 22 |
23 | 24 | * **[OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664)**, 25 |
26 | Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park ECCV 2022 27 | Understanding document images (e.g., invoices) is a core but challenging task since it requires complex functions such as reading text and a holistic understanding of the document. Current Visual Document Understanding (VDU) methods outsource the task of reading text to off-the-shelf Optical Character Recognition (OCR) engines and focus on the understanding task with the OCR outputs. Although such OCR-based approaches have shown promising performance, they suffer from 1) high computational costs for using OCR; 2) inflexibility of OCR models on languages or types of document; 3) OCR error propagation to the subsequent process. To address these issues, in this paper, we introduce a novel OCR-free VDU model named Donut, which stands for Document understanding transformer. As the first step in OCR-free VDU research, we propose a simple architecture (i.e., Transformer) with a pre-training objective (i.e., cross-entropy loss). Donut is conceptually simple yet effective. Through extensive experiments and analyses, we show a simple OCR-free VDU model, Donut, achieves state-of-the-art performances on various VDU tasks in terms of both speed and accuracy. In addition, we offer a synthetic data generator that helps the model pre-training to be flexible in various languages and domains. 28 |
29 | 30 | * **[LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387)**, 31 |
32 | Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei ACM Multimedia 2022 33 | Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose \textbf{LayoutLMv3} to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis. 34 |
35 | 36 | * [STable: Table Generation Framework for Encoder-Decoder Models](https://arxiv.org/abs/2206.04045) 37 |
38 | Michał Pietruszka, Michał Turski, Łukasz Borchmann, Tomasz Dwojak, Gabriela Pałka, Karolina Szyndler, Dawid Jurkiewicz, Łukasz Garncarek arxiv 2022 39 | The output structure of database-like tables, consisting of values structured in horizontal rows and vertical columns identifiable by name, can cover a wide range of NLP tasks. Following this constatation, we propose a framework for text-to-table neural models applicable to problems such as extraction of line items, joint entity and relation extraction, or knowledge base population. The permutation-based decoder of our proposal is a generalized sequential method that comprehends information from all cells in the table. The training maximizes the expected log-likelihood for a table's content across all random permutations of the factorization order. During the content inference, we exploit the model's ability to generate cells in any order by searching over possible orderings to maximize the model's confidence and avoid substantial error accumulation, which other sequential models are prone to. Experiments demonstrate a high practical value of the framework, which establishes state-of-the-art results on several challenging datasets, outperforming previous solutions by up to 15%. 40 |
41 | 42 | #### 2021 43 | 44 | * **[Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer](https://arxiv.org/abs/2102.09550)**, 45 |
46 | Rafał Powalski, Łukasz Borchmann, Dawid Jurkiewicz, Tomasz Dwojak, Michał Pietruszka, Gabriela Pałka ICDAR 2021 47 | We address the challenging problem of Natural Language Comprehension beyond plain-text documents by introducing the TILT neural network architecture which simultaneously learns layout information, visual features, and textual semantics. Contrary to previous approaches, we rely on a decoder capable of unifying a variety of problems involving natural language. The layout is represented as an attention bias and complemented with contextualized visual information, while the core of our model is a pretrained encoder-decoder Transformer. Our novel approach achieves state-of-the-art results in extracting information from documents and answering questions which demand layout understanding (DocVQA, CORD, WikiOps, SROIE). At the same time, we simplify the process by employing an end-to-end model. 48 |
49 | 50 | * **[Open Question Answering over Tables and Text](https://arxiv.org/abs/2010.10439)**, \[[code](https://github.com/wenhuchen/OTT-QA) ![](https://img.shields.io/github/stars/wenhuchen/OTT-QA.svg?style=social)\] 51 |
52 | Wenhu Chen, Ming-Wei Chang, Eva Schlinger, William Wang, William W. Cohen ICLR 2021 53 | In open question answering (QA), the answer to a question is produced by retrieving and then analyzing documents that might contain answers to the question. Most open QA systems have considered only retrieving information from unstructured text. Here we consider for the first time open QA over both tabular and textual data and present a new large-scale dataset Open Table-and-Text Question Answering (OTT-QA) to evaluate performance on this task. Most questions in OTT-QA require multi-hop inference across tabular data and unstructured text, and the evidence required to answer a question can be distributed in different ways over these two types of input, making evidence retrieval challenging -- our baseline model using an iterative retriever and BERT-based reader achieves an exact match score less than 10%. We then propose two novel techniques to address the challenge of retrieving and aggregating evidence for OTT-QA. The first technique is to use "early fusion" to group multiple highly relevant tabular and textual units into a fused block, which provides more context for the retriever to search for. The second technique is to use a cross-block reader to model the cross-dependency between multiple retrieved evidence with global-local sparse attention. Combining these two techniques improves the score significantly, to above 27%. 54 |
55 | 56 | #### 2020 57 | 58 | * [Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering](https://arxiv.org/abs/2010.02582) 59 |
60 | Wei Han, Hantao Huang, Tao Han COLING 2020 61 | Image text carries essential information to understand the scene and perform reasoning. Text-based visual question answering (text VQA) task focuses on visual questions that require reading text in images. Existing text VQA systems generate an answer by selecting from optical character recognition (OCR) texts or a fixed vocabulary. Positional information of text is underused and there is a lack of evidence for the generated answer. As such, this paper proposes a localization-aware answer prediction network (LaAP-Net) to address this challenge. Our LaAP-Net not only generates the answer to the question but also predicts a bounding box as evidence of the generated answer. Moreover, a context-enriched OCR representation (COR) for multimodal fusion is proposed to facilitate the localization task. Our proposed LaAP-Net outperforms existing approaches on three benchmark datasets for the text VQA task by a noticeable margin. 62 |
63 | 64 | ## Datasets 65 | 66 | #### 2022 67 | 68 | * [(TAT-DQA) Towards Complex Document Understanding By Discrete Reasoning](https://arxiv.org/abs/2207.11871), \[[website](https://nextplusplus.github.io/TAT-DQA/) ] 69 |
70 | Fengbin Zhu, Wenqiang Lei, Fuli Feng, Chao Wang, Haozhou Zhang, Tat-Seng ChuaACM Multimedia 2021 71 | Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language, which is an emerging research topic for both Natural Language Processing and Computer Vision. In this work, we introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages comprising semi-structured table(s) and unstructured text as well as 16,558 question-answer pairs by extending the TAT-QA dataset. These documents are sampled from real-world financial reports and contain lots of numbers, which means discrete reasoning capability is demanded to answer questions on this dataset. Based on TAT-DQA, we further develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions with corresponding strategies, i.e., extraction or reasoning. Extensive experiments show that the MHST model significantly outperforms the baseline methods, demonstrating its effectiveness. However, the performance still lags far behind that of expert humans. We expect that our new TAT-DQA dataset would facilitate the research on deep understanding of visually-rich documents combining vision and language, especially for scenarios that require discrete reasoning. Also, we hope the proposed model would inspire researchers to design more advanced Document VQA models in future. 72 |
73 | 74 | * [Open Domain Question Answering over Tables via Dense Retrieval](https://arxiv.org/abs/2108.06712) 75 |
76 | Zhoujun Cheng, Haoyu Dong, Zhiruo Wang, Ran Jia, Jiaqi Guo, Yan Gao, Shi Han, Jian-Guang Lou, Dongmei Zhang ACL 2022 77 | Tables are often created with hierarchies, but existing works on table reasoning mainly focus on flat tables and neglect hierarchical tables. Hierarchical tables challenge existing methods by hierarchical indexing, as well as implicit relationships of calculation and semantics. This work presents HiTab, a free and open dataset to study question answering (QA) and natural language generation (NLG) over hierarchical tables. HiTab is a cross-domain dataset constructed from a wealth of statistical reports (analyses) and Wikipedia pages, and has unique characteristics: (1) nearly all tables are hierarchical, and (2) both target sentences for NLG and questions for QA are revised from original, meaningful, and diverse descriptive sentences authored by analysts and professions of reports. (3) to reveal complex numerical reasoning in statistical analyses, we provide fine-grained annotations of entity and quantity alignment. HiTab provides 10,686 QA pairs and descriptive sentences with well-annotated quantity and entity alignment on 3,597 tables with broad coverage of table hierarchies and numerical reasoning types. Targeting hierarchical structure, we devise a novel hierarchy-aware logical form for symbolic reasoning over tables, which shows high effectiveness. Targeting complex numerical reasoning, we propose partially supervised training given annotations of entity and quantity alignment, which helps models to largely reduce spurious predictions in the QA task. In the NLG task, we find that entity and quantity alignment also helps NLG models to generate better results in a conditional generation setting. Experiment results of state-of-the-art baselines suggest that this dataset presents a strong challenge and a valuable benchmark for future research. 78 |
79 | 80 | * [MultiHiertt: Numerical Reasoning over Multi Hierarchical Tabular and Textual Data](https://aclanthology.org/2022.acl-long.454/) 81 |
82 | Yilun Zhao, Yunxiang Li, Chenying Li, Rui Zhang ACL 2022 83 | Numerical reasoning over hybrid data containing both textual and tabular content (e.g., financial reports) has recently attracted much attention in the NLP community. However, existing question answering (QA) benchmarks over hybrid data only include a single flat table in each document and thus lack examples of multi-step numerical reasoning across multiple hierarchical tables. To facilitate data analytical progress, we construct a new large-scale benchmark, MultiHiertt, with QA pairs over Multi Hierarchical Tabular and Textual data. MultiHiertt is built from a wealth of financial reports and has the following unique characteristics: 1) each document contain multiple tables and longer unstructured texts; 2) most of tables contained are hierarchical; 3) the reasoning process required for each question is more complex and challenging than existing benchmarks; and 4) fine-grained annotations of reasoning processes and supporting facts are provided to reveal complex numerical reasoning. We further introduce a novel QA model termed MT2Net, which first applies facts retrieving to extract relevant supporting facts from both tables and text and then uses a reasoning module to perform symbolic reasoning over retrieved facts. We conduct comprehensive experiments on various baselines. The experimental results show that MultiHiertt presents a strong challenge for existing baselines whose results lag far behind the performance of human experts. The dataset and code are publicly available at https://github.com/psunlpgroup/MultiHiertt. 84 |
85 | 86 | #### Older 87 | 88 | * **[DUE: End-to-End Document Understanding Benchmark ](https://openreview.net/forum?id=rNs2FvJGDK)**, [Website](https://duebenchmark.com/) 89 |
90 | Łukasz Borchmann, Michał Pietruszka, Tomasz Stanislawek, Dawid Jurkiewicz, Michał Turski, Karolina Szyndler, Filip Graliński NIPS 2021 91 | Understanding documents with rich layouts plays a vital role in digitization and hyper-automation but remains a challenging topic in the NLP research community. Additionally, the lack of a commonly accepted benchmark made it difficult to quantify progress in the domain. To empower research in this field, we introduce the Document Understanding Evaluation (DUE) benchmark consisting of both available and reformulated datasets to measure the end-to-end capabilities of systems in real-world scenarios. The benchmark includes Visual Question Answering, Key Information Extraction, and Machine Reading Comprehension tasks over various document domains and layouts featuring tables, graphs, lists, and infographics. In addition, the current study reports systematic baselines and analyzes challenges in currently available datasets using recent advances in layout-aware language modeling. 92 |
93 | 94 | * **[InfographicVQA](https://arxiv.org/abs/2104.12756)**, [Website](https://www.docvqa.org/) 95 |
96 | Minesh Mathew, Viraj Bagal, Rubèn Pérez Tito, Dimosthenis Karatzas, Ernest Valveny, C.V Jawahar Proceedings of the IEEE/CVF 2022 97 | Infographics communicate information using a combination of textual, graphical and visual elements. This work explores the automatic understanding of infographic images by using a Visual Question Answering technique. To this end, we present InfographicVQA, a new dataset comprising a diverse collection of infographics and question-answer annotations. The questions require methods that jointly reason over the document layout, textual content, graphical elements, and data visualizations. We curate the dataset with an emphasis on questions that require elementary reasoning and basic arithmetic skills. For VQA on the dataset, we evaluate two Transformer-based strong baselines. Both the baselines yield unsatisfactory results compared to near perfect human performance on the dataset. The results suggest that VQA on infographics--images that are designed to communicate information quickly and clearly to human brain--is ideal for benchmarking machine understanding of complex document images. 98 |
99 | 100 | * **[DocVQA: A Dataset for VQA on Document Images](https://arxiv.org/pdf/2007.00398.pdf)**, [Website](http://docvqa.org/) 101 |
102 | Minesh Mathew, Dimosthenis Karatzas, C.V. Jawahar WACV 2021 103 | We present a new dataset for Visual Question Answering (VQA) on document images called DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images. Detailed analysis of the dataset in comparison with similar datasets for VQA and reading comprehension is presented. We report several baseline results by adopting existing VQA and reading comprehension models. Although the existing models perform reasonably well on certain types of questions, there is large performance gap compared to human performance (94.36% accuracy). The models need to improve specifically on questions where understanding structure of the document is crucial. The dataset, code and leaderboard are available at this http URL 104 |
105 | 106 | * [WebSRC: A Dataset for Web-Based Structural Reading Comprehension](https://arxiv.org/pdf/2101.09465.pdf), [Website](https://speechlab-sjtu.github.io/WebSRC/), [code/data](https://github.com/speechlab-sjtu/WebSRC) ![](https://img.shields.io/github/stars/speechlab-sjtu/WebSRC.svg?style=social) 107 |
108 | Lu Chen et al. arXiv 2021 109 | Web search is an essential way for human to obtain information, but it's still a great challenge for machines to understand the contents of web pages. In this paper, we introduce the task of web-based structural reading comprehension. Given a web page and a question about it, the task is to find an answer from the web page. This task requires a system not only to understand the semantics of texts but also the structure of the web page. Moreover, we proposed WebSRC, a novel Web-based Structural Reading Comprehension dataset. WebSRC consists of 0.44M question-answer pairs, which are collected from 6.5K web pages with corresponding HTML source code, screenshots, and metadata. Each question in WebSRC requires a certain structural understanding of a web page to answer, and the answer is either a text span on the web page or yes/no. We evaluate various strong baselines on our dataset to show the difficulty of our task. We also investigate the usefulness of structural information and visual features. Our dataset and task are publicly available at this https URL. 110 |
111 | -------------------------------------------------------------------------------- /topics/kie/README.md: -------------------------------------------------------------------------------- 1 | ## Table of contents 2 | 3 | 1. [Papers](#papers) 4 | 1. [Datasets](#datasets) 5 | 1. [Useful links](#useful-links) 6 | 7 | 8 | 9 | ## Papers 10 | 11 | #### 2022 12 | 13 | * *[ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding](https://arxiv.org/abs/2210.06155)*, 14 |
15 | Qiming Peng, Yinxu Pan, Wenjin Wang, Bin Luo, Zhenyu Zhang, Zhengjie Huang, Teng Hu, Weichong Yin, Yongfeng Chen, Yin Zhang, Shikun Feng, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang EMNLP (Findings) 2022 16 | Recent years have witnessed the rise and success of pre-training techniques in visually-rich document understanding. However, most existing methods lack the systematic mining and utilization of layout-centered knowledge, leading to sub-optimal performances. In this paper, we propose ERNIE-Layout, a novel document pre-training solution with layout knowledge enhancement in the whole workflow, to learn better representations that combine the features from text, layout, and image. Specifically, we first rearrange input sequences in the serialization stage, and then present a correlative pre-training task, reading order prediction, to learn the proper reading order of documents. To improve the layout awareness of the model, we integrate a spatial-aware disentangled attention into the multi-modal transformer and a replaced regions prediction task into the pre-training phase. Experimental results show that ERNIE-Layout achieves superior performance on various downstream tasks, setting new state-of-the-art on key information extraction, document image classification, and document question answering datasets. 17 |
18 | 19 | * **[Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347)**, 20 |
21 | Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova arxiv 2022 22 | Visually-situated language is ubiquitous -- sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to this diversity, previous work has typically relied on domain-specific recipes with limited sharing of the underlying data, model architectures, and objectives. We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing visually-situated language. Pix2Struct is pretrained by learning to parse masked screenshots of web pages into simplified HTML. The web, with its richness of visual elements cleanly reflected in the HTML structure, provides a large source of pretraining data well suited to the diversity of downstream tasks. Intuitively, this objective subsumes common pretraining signals such as OCR, language modeling, image captioning. In addition to the novel pretraining strategy, we introduce a variable-resolution input representation and a more flexible integration of language and vision inputs, where language prompts such as questions are rendered directly on top of the input image. For the first time, we show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains: documents, illustrations, user interfaces, and natural images. 23 |
24 | 25 | * **[XDoc: Unified Pre-training for Cross-Format Document Understanding](https://arxiv.org/abs/2210.02849)**, 26 |
27 | Jingye Chen, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei EMNLP 2022 28 | The surge of pre-training has witnessed the rapid development of document understanding recently. Pre-training and fine-tuning framework has been effectively used to tackle texts in various formats, including plain texts, document texts, and web texts. Despite achieving promising performance, existing pre-trained models usually target one specific document format at one time, making it difficult to combine knowledge from multiple document formats. To address this, we propose XDoc, a unified pre-trained model which deals with different document formats in a single model. For parameter efficiency, we share backbone parameters for different formats such as the word embedding layer and the Transformer layers. Meanwhile, we introduce adaptive layers with lightweight parameters to enhance the distinction across different formats. Experimental results have demonstrated that with only 36.7% parameters, XDoc achieves comparable or even better performance on a variety of downstream tasks compared with the individual pre-trained models, which is cost effective for real-world deployment. 29 |
30 | 31 | * **[OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664)**, 32 |
33 | Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park ECCV 2022 34 | Understanding document images (e.g., invoices) is a core but challenging task since it requires complex functions such as reading text and a holistic understanding of the document. Current Visual Document Understanding (VDU) methods outsource the task of reading text to off-the-shelf Optical Character Recognition (OCR) engines and focus on the understanding task with the OCR outputs. Although such OCR-based approaches have shown promising performance, they suffer from 1) high computational costs for using OCR; 2) inflexibility of OCR models on languages or types of document; 3) OCR error propagation to the subsequent process. To address these issues, in this paper, we introduce a novel OCR-free VDU model named Donut, which stands for Document understanding transformer. As the first step in OCR-free VDU research, we propose a simple architecture (i.e., Transformer) with a pre-training objective (i.e., cross-entropy loss). Donut is conceptually simple yet effective. Through extensive experiments and analyses, we show a simple OCR-free VDU model, Donut, achieves state-of-the-art performances on various VDU tasks in terms of both speed and accuracy. In addition, we offer a synthetic data generator that helps the model pre-training to be flexible in various languages and domains. 35 |
36 | 37 | 38 | * **[LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387)**, 39 |
40 | Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei ACM Multimedia 2022 41 | Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose \textbf{LayoutLMv3} to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis. 42 |
43 | 44 | * **[LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding](https://arxiv.org/abs/2202.13669)**, 45 |
46 | Jiapeng Wang, Lianwen Jin, Kai Ding ACL 2022 Main conference 2022 47 | Structured document understanding has attracted considerable attention and made significant progress recently, owing to its crucial role in intelligent document processing. However, most existing related models can only deal with the document data of specific language(s) (typically English) included in the pre-training collection, which is extremely limited. To address this issue, we propose a simple yet effective Language-independent Layout Transformer (LiLT) for structured document understanding. LiLT can be pre-trained on the structured documents of a single language and then directly fine-tuned on other languages with the corresponding off-the-shelf monolingual/multilingual pre-trained textual models. Experimental results on eight languages have shown that LiLT can achieve competitive or even superior performance on diverse widely-used downstream benchmarks, which enables language-independent benefit from the pre-training of document layout structure. Code and model are publicly available at this https URL. 48 |
49 | 50 | #### 2021 51 | 52 | * *[SelfDoc: Self-Supervised Document Representation Learning](https://openaccess.thecvf.com/content/CVPR2021/html/Li_SelfDoc_Self-Supervised_Document_Representation_Learning_CVPR_2021_paper.html)*, \[[code\data](https://github.com/microsoft/unilm/tree/master/layoutreader) \] 53 |
54 | Peizhao Li, Jiuxiang Gu, Jason Kuen, Vlad I. Morariu, Handong Zhao, Rajiv Jain, Varun Manjunatha, Hongfu Liu CVPR 2021 55 | We propose SelfDoc, a task-agnostic pre-training framework for document image understanding. Because documents are multimodal and are intended for sequential reading, our framework exploits the positional, textual, and visual information of every semantically meaningful component in a document, and it models the contextualization between each block of content. Unlike existing document pre-training models, our model is coarse-grained instead of treating individual words as input, therefore avoiding an overly fine-grained with excessive contextualization. Beyond that, we introduce cross-modal learning in the model pre-training phase to fully leverage multimodal information from unlabeled documents. For downstream usage, we propose a novel modality-adaptive attention mechanism for multimodal feature fusion by adaptively emphasizing language and vision signals. Our framework benefits from self-supervised pre-training on documents without requiring annotations by a feature masking training strategy. It achieves superior performance on multiple downstream tasks with significantly fewer document images used in the pre-training stage compared to previous works. 56 |
57 | 58 | * *[LayoutReader: Pre-training of Text and Layout for Reading Order Detection](https://arxiv.org/abs/2108.11591)*, \[[code\data](https://github.com/microsoft/unilm/tree/master/layoutreader) \] 59 |
60 | Zilong Wang, Yiheng Xu, Lei Cui, Jingbo Shang, Furu Wei EMNLP 2021 61 | Reading order detection is the cornerstone to understanding visually-rich documents (e.g., receipts and forms). Unfortunately, no existing work took advantage of advanced deep learning models because it is too laborious to annotate a large enough dataset. We observe that the reading order of WORD documents is embedded in their XML metadata; meanwhile, it is easy to convert WORD documents to PDFs or images. Therefore, in an automated manner, we construct ReadingBank, a benchmark dataset that contains reading order, text, and layout information for 500,000 document images covering a wide spectrum of document types. This first-ever large-scale dataset unleashes the power of deep neural networks for reading order detection. Specifically, our proposed LayoutReader captures the text and layout information for reading order prediction using the seq2seq model. It performs almost perfectly in reading order detection and significantly improves both open-source and commercial OCR engines in ordering text lines in their results in our experiments. 62 |
63 | 64 | * *[MatchVIE: Exploiting Match Relevancy between Entities for Visual Information Extraction](https://arxiv.org/abs/2106.12940)* 65 |
66 | Guozhi Tang, Lele Xie, Lianwen Jin, Jiapeng Wang, Jingdong Chen, Zhen Xu, Qianying Wang, Yaqiang Wu, Hui Li IJCAI 2021 67 | Visual Information Extraction (VIE) task aims to extract key information from multifarious document images (e.g., invoices and purchase receipts). Most previous methods treat the VIE task simply as a sequence labeling problem or classification problem, which requires models to carefully identify each kind of semantics by introducing multimodal features, such as font, color, layout. But simply introducing multimodal features couldn't work well when faced with numeric semantic categories or some ambiguous texts. To address this issue, in this paper we propose a novel key-value matching model based on a graph neural network for VIE (MatchVIE). Through key-value matching based on relevancy evaluation, the proposed MatchVIE can bypass the recognitions to various semantics, and simply focuses on the strong relevancy between entities. Besides, we introduce a simple but effective operation, Num2Vec, to tackle the instability of encoded values, which helps model converge more smoothly. Comprehensive experiments demonstrate that the proposed MatchVIE can significantly outperform previous methods. Notably, to the best of our knowledge, MatchVIE may be the first attempt to tackle the VIE task by modeling the relevancy between keys and values and it is a good complement to the existing methods. 68 |
69 | 70 | * *[StrucTexT: Structured Text Understanding with Multi-Modal Transformers](https://arxiv.org/abs/2108.02923)* 71 |
72 | Yulin Li, Yuxi Qian, Yuchen Yu, Xiameng Qin, Chengquan Zhang, Yan Liu, Kun Yao, Junyu Han, Jingtuo Liu, Errui Ding ACM Multimedia 2021 73 | Structured text understanding on Visually Rich Documents (VRDs) is a crucial part of Document Intelligence. Due to the complexity of content and layout in VRDs, structured text understanding has been a challenging task. Most existing studies decoupled this problem into two sub-tasks: entity labeling and entity linking, which require an entire understanding of the context of documents at both token and segment levels. However, little work has been concerned with the solutions that efficiently extract the structured data from different levels. This paper proposes a unified framework named StrucTexT, which is flexible and effective for handling both sub-tasks. Specifically, based on the transformer, we introduce a segment-token aligned encoder to deal with the entity labeling and entity linking tasks at different levels of granularity. Moreover, we design a novel pre-training strategy with three self-supervised tasks to learn a richer representation. StrucTexT uses the existing Masked Visual Language Modeling task and the new Sentence Length Prediction and Paired Boxes Direction tasks to incorporate the multi-modal information across text, image, and layout. We evaluate our method for structured text understanding at segment-level and token-level and show it outperforms the state-of-the-art counterparts with significantly superior performance on the FUNSD, SROIE, and EPHOIE datasets. 74 |
75 | 76 | * **[DocFormer: End-to-End Transformer for Document Understanding](https://arxiv.org/abs/2106.11539)** 77 |
78 | Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, R. Manmatha ICCV 2021 79 | We present DocFormer -- a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). VDU is a challenging problem which aims to understand documents in their varied formats (forms, receipts etc.) and layouts. In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x its size (in no. of parameters). 80 |
81 | 82 | * **[Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer](https://arxiv.org/abs/2102.09550)**, 83 |
84 | Rafał Powalski, Łukasz Borchmann, Dawid Jurkiewicz, Tomasz Dwojak, Michał Pietruszka, Gabriela Pałka ICDAR 2021 85 | We address the challenging problem of Natural Language Comprehension beyond plain-text documents by introducing the TILT neural network architecture which simultaneously learns layout information, visual features, and textual semantics. Contrary to previous approaches, we rely on a decoder capable of unifying a variety of problems involving natural language. The layout is represented as an attention bias and complemented with contextualized visual information, while the core of our model is a pretrained encoder-decoder Transformer. Our novel approach achieves state-of-the-art results in extracting information from documents and answering questions which demand layout understanding (DocVQA, CORD, WikiOps, SROIE). At the same time, we simplify the process by employing an end-to-end model. 86 |
87 | 88 | * **[LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740)**, \[[code](https://huggingface.co/transformers/model_doc/layoutlmv2.html) \] 89 |
90 | Yang Xu et al. ACL 2021 91 | Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. In this paper, we present \textbf{LayoutLMv2} by pre-training text, layout and image in a multi-modal framework, where new model architectures and pre-training tasks are leveraged. Specifically, LayoutLMv2 not only uses the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks in the pre-training stage, where cross-modality interaction is better learned. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture, so that the model can fully understand the relative positional relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms strong baselines and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD (0.7895 -> 0.8420), CORD (0.9493 -> 0.9601), SROIE (0.9524 -> 0.9781), Kleister-NDA (0.834 -> 0.852), RVL-CDIP (0.9443 -> 0.9564), and DocVQA (0.7295 -> 0.8672). 92 |
93 | 94 | * **[LAMBERT: Layout-Aware (Language) Modeling using BERT for information extraction](https://arxiv.org/abs/2002.08087)**, \[[code](https://github.com/applicaai/lambert) \] 95 |
96 | Łukasz Garncarek, Rafał Powalski, Tomasz Stanisławek, Bartosz Topolski, Piotr Halama, Michał Turski, Filip Graliński ICDAR 2021 97 | In this paper we introduce a novel approach to the problem of understanding documents where the local semantics is influenced by non-trivial layout. Namely, we modify the Transformer architecture in a way that allows it to use the graphical features defined by the layout, without the need to re-learn the language semantics from scratch, thanks to starting the training process from a model pretrained on classical language modeling tasks. SOTA on [SROIE leaderboard](https://rrc.cvc.uab.es/?ch=13&com=evaluation&task=3) 98 |
99 | 100 | * **[ViBERTgrid: A Jointly Trained Multi-Modal 2D Document Representation for Key Information Extraction from Documents](https://arxiv.org/abs/2105.11672)** 101 |
102 | Weihong Lin, Qifang Gao, Lei Sun, Zhuoyao Zhong, Kai Hu, Qin Ren, Qiang Huo ICDAR 2021 103 | Recent grid-based document representations like BERTgrid allow the simultaneous encoding of the textual and layout information of a document in a 2D feature map so that state-of-the-art image segmentation and/or object detection models can be straightforwardly leveraged to extract key information from documents. However, such methods have not achieved comparable performance to state-of-the-art sequence- and graph-based methods such as LayoutLM and PICK yet. In this paper, we propose a new multi-modal backbone network by concatenating a BERTgrid to an intermediate layer of a CNN model, where the input of CNN is a document image and the BERTgrid is a grid of word embeddings, to generate a more powerful grid-based document representation, named ViBERTgrid. Unlike BERTgrid, the parameters of BERT and CNN in our multimodal backbone network are trained jointly. Our experimental results demonstrate that this joint training strategy improves significantly the representation ability of ViBERTgrid. Consequently, our ViBERTgrid-based key information extraction approach has achieved state-of-the-art performance on real-world datasets. 104 |
105 | 106 | * [LAMPRET: Layout-Aware Multimodal PreTraining for Document Understanding](https://arxiv.org/abs/2104.08405) 107 |
108 | Te-Lin Wu, Cheng Li, Mingyang Zhang, Tao Chen, Spurthi Amba Hombaiah, Michael Bendersky arxiv 2021 109 | Document layout comprises both structural and visual (eg. font-sizes) information that is vital but often ignored by machine learning models. The few existing models which do use layout information only consider textual contents, and overlook the existence of contents in other modalities such as images. Additionally, spatial interactions of presented contents in a layout were never really fully exploited. To bridge this gap, we parse a document into content blocks (eg. text, table, image) and propose a novel layout-aware multimodal hierarchical framework, LAMPreT, to model the blocks and the whole document. Our LAMPreT encodes each block with a multimodal transformer in the lower-level and aggregates the block-level representations and connections utilizing a specifically designed transformer at the higher-level. We design hierarchical pretraining objectives where the lower-level model is trained similarly to multimodal grounding models, and the higher-level model is trained with our proposed novel layout-aware objectives. We evaluate the proposed model on two layout-aware tasks -- text block filling and image suggestion and show the effectiveness of our proposed hierarchical architecture as well as pretraining techniques. 110 |
111 | 112 | * **[LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/pdf/2104.08836.pdf)**, \[[code/data](https://github.com/microsoft/unilm/tree/master/layoutxlm) \] 113 |
114 | Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei arxiv 2021 115 | Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually-rich document understanding tasks recently, which demonstrates the great potential for joint learning across different modalities. In this paper, we present LayoutXLM, a multimodal pre-trained model for multilingual document understanding, which aims to bridge the language barriers for visually-rich document understanding. To accurately evaluate LayoutXLM, we also introduce a multilingual form understanding benchmark dataset named XFUN, which includes form understanding samples in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese), and key-value pairs are manually labeled for each language. Experiment results show that the LayoutXLM model has significantly outperformed the existing SOTA cross-lingual pre-trained models on the XFUN dataset. 116 |
117 | 118 | * [Glean: Structured Extractions from Templatic Documents](https://research.google/pubs/pub50092/) 119 |
120 | Sandeep Tata et al. Proceedings of the VLDB Endowment 2021 121 | Extracting structured information from templatic documents is an important problem with the potential to automate many real-world business workflows such as payment, procurement, and payroll. The core challenge is that such documents can be laid out in virtually infinitely different ways. A good solution to this problem is one that generalizes well not only to known templates such as invoices from a known vendor, but also to unseen ones. We developed a system called Glean to tackle this problem. Given a target schema for a document type and some labeled documents of that type, Glean uses machine learning to automatically extract structured information from other documents of that type. In this paper, we describe the overall architecture of Glean, and discuss three key data management challenges : 1) managing the quality of ground truth data, 2) generating training data for the machine learning model using labeled documents, and 3) building tools that help a developer rapidly build and improve a model for a given document type. Through empirical studies on a real-world dataset, we show that these data management techniques allow us to train a model that is over 5 F1 points better than the exact same model architecture without the techniques we describe. We argue that for such information-extraction problems, designing abstractions that carefully manage the training data is at least as important as choosing a good model architecture. 122 |
123 | 124 | * [Improving Information Extraction from Visually Rich Documents using Visual Span Representations](https://www.researchgate.net/publication/348559404_Improving_Information_Extraction_from_Visually_Rich_Documents_using_Visual_Span_Representations) 125 |
126 | Ritesh Sarkhel, Arnab Nandi ResearchGate 2021 127 | Along with textual content, visual features play an essential role in the semantics of visually rich documents. Information extraction (IE) tasks perform poorly on these documents if these visual cues are not taken into account. In this paper, we present Artemis-a visually aware, machine-learning-based IE method for heterogeneous visually rich documents. Artemis represents a visual span in a document by jointly encoding its visual and textual context for IE tasks. Our main contribution is twofold. First, we develop a deep-learning model that identifies the local context boundary of a visual span with minimal human-labeling. Second, we describe a deep neural network that encodes the multimodal context of a visual span into a fixed-length vector by taking its textual and layout-specific features into account. It identifies the visual span(s) containing a named entity by leveraging this learned representation followed by an inference task. We evaluate Artemis on four heterogeneous datasets from different domains over a suite of information extraction tasks. Results show that it outperforms state-of-the-art text-based methods by up to 17 points in F1-score. 128 |
129 | 130 | 131 | #### 2020 132 | 133 | * **[BROS: A Pre-trained Language Model for Understanding Texts in Document](https://openreview.net/pdf?id=punMXQEsPr0)** 134 |
135 | Teakgyu Hong, DongHyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, Sungrae Park openreview.net 2020 136 | Understanding document from their visual snapshots is an emerging and challenging problem that requires both advanced computer vision and NLP methods. Although the recent advance in OCR enables the accurate extraction of text segments, it is still challenging to extract key information from documents due to the diversity of layouts. To compensate for the difficulties, this paper introduces a pre-trained language model, BERT Relying On Spatiality (BROS), that represents and understands the semantics of spatially distributed texts. Different from previous pre-training methods on 1D text, BROS is pre-trained on large-scale semi-structured documents with a novel area-masking strategy while efficiently including the spatial layout information of input documents. Also, to generate structured outputs in various document understanding tasks, BROS utilizes a powerful graph-based decoder that can capture the relation between text segments. BROS achieves state-of-the-art results on four benchmark tasks: FUNSD, SROIE*, CORD, and SciTSR. Our experimental settings and implementation codes will be publicly available. 137 |
138 | 139 | * **[LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://www.microsoft.com/en-us/research/publication/layoutlm-pre-training-of-text-and-layout-for-document-image-understanding/)**, \[[code](https://github.com/microsoft/unilm) ![](https://img.shields.io/github/stars/microsoft/unilm.svg?style=social)\] 140 |
141 | Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou KDD 2020 142 | Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the widespread of pre-training models for NLP applications, they almost focused on text-level manipulation, while neglecting the layout and style information that is vital for document image understanding. In this paper, we propose the LayoutLM to jointly model the interaction between text and layout information across scanned document images, which is beneficial for a great number of real-world document image understanding tasks such as information extraction from scanned documents. Furthermore, we also leverage the image features to incorporate the visual information of words into LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single framework for document-level pre-training. It achieves new state-of-the-art results in several downstream tasks, including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification (from 93.07 to 94.42). The code and pre-trained LayoutLM models are publicly available on GitHub. 143 |
144 | 145 | * **[Representation Learning for Information Extraction from Form-like Documents](https://www.aclweb.org/anthology/2020.acl-main.580/)**, \[[code](https://github.com/Praneet9/Representation-Learning-for-Information-Extraction) ![](https://img.shields.io/github/stars/Praneet9/Representation-Learning-for-Information-Extraction.svg?style=social)\] 146 |
147 | Bodhisattwa Prasad Majumder, Navneet Potti, Sandeep Tata, James Bradley Wendt, Qi Zhao, Marc Najork ACL 2020 148 | We propose a novel approach using representation learning for tackling the problem of extracting structured information from form-like document images. We propose an extraction system that uses knowledge of the types of the target fields to generate extraction candidates and a neural network architecture that learns a dense representation of each candidate based on neighboring words in the document. These learned representations are not only useful in solving the extraction task for unseen document templates from two different domains but are also interpretable, as we show using loss cases. 149 |
150 | 151 | * **[PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks](https://arxiv.org/abs/2004.07464)**, \[[code](https://github.com/wenwenyu/PICK-pytorch) ![](https://img.shields.io/github/stars/wenwenyu/PICK-pytorch.svg?style=social) \] 152 |
153 | Wenwen Yu, Ning Lu, Xianbiao Qi, Ping Gong, Rong Xiao ICPR 2020 154 | Computer vision with state-of-the-art deep learning models has achieved huge success in the field of Optical Character Recognition (OCR) including text detection and recognition tasks recently. However, Key Information Extraction (KIE) from documents as the downstream task of OCR, having a large number of use scenarios in real-world, remains a challenge because documents not only have textual features extracting from OCR systems but also have semantic visual features that are not fully exploited and play a critical role in KIE. Too little work has been devoted to efficiently make full use of both textual and visual features of the documents. In this paper, we introduce PICK, a framework that is effective and robust in handling complex documents layout for KIE by combining graph learning with graph convolution operation, yielding a richer semantic representation containing the textual and visual features and global layout without ambiguity. Extensive experiments on real-world datasets have been conducted to show that our method outperforms baselines methods by significant margins. 155 |
156 | 157 | * [Attention-Based Graph Neural Network with Global Context Awareness for Document Understanding](https://www.aclweb.org/anthology/2020.ccl-1.79.pdf) 158 |
159 | Yuan Hua, Z. Huang, J. Guo, Weidong Qiu CNCL 2020 160 | Information extraction from documents such as receipts or invoices is a fundamental and crucial step for office automation. Many approaches focus on extracting entities and relationships from plain texts, however, when it comes to document images, such demand becomes quite challenging since visual and layout information are also of great significance to help tackle this problem. In this work, we propose the attention-based graph neural network to combine textual and visual information from document images.Moreover, the global node is introduced in our graph construction algorithm which is used as a virtual hub to collect the information from all the nodes and edges to help improve the performance. Extensive experiments on real-world datasets show that our method outperforms baseline methods by significant margins. 161 |
162 | 163 | * [Towards a Multi-modal, Multi-task Learning based Pre-training Framework for Document Representation Learning](https://arxiv.org/pdf/2009.14457.pdf) 164 |
165 | Subhojeet Pramanik, Shashank Mujumdar, Hima Patel arxiv 2020 166 | In this paper, we propose a multi-task learning-based framework that utilizes a combination of self-supervised and supervised pre-training tasks to learn a generic document representation. We design the network architecture and the pre-training tasks to incorporate the multi-modal document information across text, layout, and image dimensions and allow the network to work with multi-page documents. We showcase the applicability of our pre-training framework on a variety of different real-world document tasks such as document classification, document information extraction, and document retrieval. We conduct exhaustive experiments to compare performance against different ablations of our framework and state-of-the-art baselines. We discuss the current limitations and next steps for our work. 167 |
168 | 169 | * [Merge and Recognize: A Geometry and 2D Context Aware Graph Model for Named Entity Recognition from Visual Documents](https://www.aclweb.org/anthology/2020.textgraphs-1.3/) 170 |
171 | Chuwei Luo, Yongpan Wang, Qi Zheng, Liangchen Li, Feiyu Gao, Shiyu Zhang COLING 2020 172 | Named entity recognition (NER) from visual documents, such as invoices, receipts or business cards, is a critical task for visual document understanding. Most classical approaches use a sequence-based model (typically BiLSTM-CRF framework) without considering document structure. Recent work on graph-based model using graph convolutional networks to encode visual and textual features have achieved promising performance on the task. However, few attempts take geometry information of text segments (text in bounding box) in visual documents into account. Meanwhile, existing methods do not consider that related text segments which need to be merged to form a complete entity in many real-world situations. In this paper, we present GraphNEMR, a graph-based model that uses graph convolutional networks to jointly merge text segments and recognize named entities. By incorporating geometry information from visual documents into our model, richer 2D context information is generated to improve document representations. To merge text segments, we introduce a novel mechanism that captures both geometry information as well as semantic information based on pre-trained language model. Experimental results show that the proposed GraphNEMR model outperforms both sequence-based and graph-based SOTA methods significantly. 173 |
174 | 175 | * [TRIE: End-to-End Text Reading and Information Extraction for Document Understanding](https://dl.acm.org/doi/10.1145/3394171.3413900) 176 |
177 | Peng Zhang et al. Proceedings of the 28th ACM International Conference on Multimedia 2020 178 | Since real-world ubiquitous documents (e.g., invoices, tickets, resumes and leaflets) contain rich information, automatic document image understanding has become a hot topic. Most existing works decouple the problem into two separate tasks, (1) text reading for detecting and recognizing texts in images and (2) information extraction for analyzing and extracting key elements from previously extracted plain text.However, they mainly focus on improving information extraction task, while neglecting the fact that text reading and information extraction are mutually correlated. In this paper, we propose a unified end-to-end text reading and information extraction network, where the two tasks can reinforce each other. Specifically, the multimodal visual and textual features of text reading are fused for information extraction and in turn, the semantics in information extraction contribute to the optimization of text reading. On three real-world datasets with diverse document images (from fixed layout to variable layout, from structured text to semi-structured text), our proposed method significantly outperforms the state-of-the-art methods in both efficiency and accuracy. 179 |
180 | 181 | * [Robust Layout-aware IE for Visually Rich Documents with Pretrained Language Models](https://arxiv.org/pdf/2005.11017.pdf) 182 |
183 | Mengxi Wei, Yifan He, Qiong Zhang ACM SIGIR 2020 184 | Many business documents processed in modern NLP and IR pipelines are visually rich: in addition to text, their semantics can also be captured by visual traits such as layout, format, and fonts. We study the problem of information extraction from visually rich documents (VRDs) and present a model that combines the powerof large pre-trained language models and graph neural networks to efficiently encode both textual and visual information in business documents. We further introduce new fine-tuning objectives to improve in-domain unsupervised fine-tuning to better utilize large amount of unlabeled in-domain data. We experiment on real world invoice and resume data sets and show that the proposed method outperforms strong text-based RoBERTa baselines by 6.3% absolute F1 on invoices and 4.7% absolute F1 on resumes. When evaluated in a few-shot setting, our method requires up to 30x less annotation data than the baseline to achieve the same level of performance at∼90% F1. 185 |
186 | 187 | * [End-to-End Extraction of Structured Information from Business Documents with Pointer-Generator Networks](https://www.aclweb.org/anthology/2020.spnlp-1.6/) 188 |
189 | Clément Sage et al. EMNLP 2020 190 | The predominant approaches for extracting key information from documents resort to classifiers predicting the information type of each word. However, the word level ground truth used for learning is expensive to obtain since it is not naturally produced by the extraction task. In this paper, we discuss a new method for training extraction models directly from the textual value of information. The extracted information of a document is represented as a sequence of tokens in the XML language. We learn to output this representation with a pointer-generator network that alternately copies the document words carrying information and generates the XML tags delimiting the types of information. The ability of our end-to-end method to retrieve structured information is assessed on a large set of business documents. We show that it performs competitively with a standard word classifier without requiring costly word level supervision. 191 |
192 | 193 | * [Information Extraction from Text Intensive and Visually Rich Banking Documents](https://www.sciencedirect.com/science/article/pii/S0306457320308566) 194 |
195 | Berke Oral et al. Information Processing & Management 2020 196 | Document types, where visual and textual information plays an important role in their analysis and understanding, pose a new and attractive area for information extraction research. Although cheques, invoices, and receipts have been studied in some previous multi-modal studies, banking documents present an unexplored area due to the naturalness of the text they possess in addition to their visual richness. This article presents the first study which uses visual and textual information for deep-learning based information extraction on text-intensive and visually rich scanned documents which are, in this instance, unstructured banking documents, or more precisely, money transfer orders. The impact of using different neural word representations (i.e., FastText, ELMo, and BERT) on IE subtasks (namely, named entity recognition and relation extraction stages), positional features of words on document images and auxiliary learning with some other tasks are investigated. The article proposes a new relation extraction algorithm based on graph factorization to solve the complex relation extraction problem where the relations within documents are n-ary, nested, document-level, and previously indeterminate in quantity. Our experiments revealed that the use of deep learning algorithms yielded around 10 percentage points improvement on the IE sub-tasks. The inclusion of word positional features yielded around 3 percentage points of improvement in some specific information fields. Similarly, our auxiliary learning experiments yielded around 2 percentage points of improvement on some information fields associated with the specific transaction type detected by our auxiliary task. The integration of the information extraction system into a real banking environment reduced cycle times substantially. When compared to the manual workflow, document processing pipeline shortened book-to-book money transfers to 10 minutes (from 29 min.) and electronic fund transfers (EFT) to 17 minutes (from 41 min.) respectively. 197 |
198 | 199 | #### 2019 200 | 201 | * **[Graph Convolution for Multimodal Information Extraction from Visually Rich Documents](https://arxiv.org/abs/1903.11279)** 202 |
203 | Xiaojing Liu, Feiyu Gao, Qiong Zhang, Huasha Zhao NAACL 2019 204 | Visually rich documents (VRDs) are ubiquitous in daily business and life. Examples are purchase receipts, insurance policy documents, custom declaration forms and so on. In VRDs, visual and layout information is critical for document understanding, and texts in such documents cannot be serialized into the one-dimensional sequence without losing information. Classic information extraction models such as BiLSTM-CRF typically operate on text sequences and do not incorporate visual features. In this paper, we introduce a graph convolution based model to combine textual and visual information presented in VRDs. Graph embeddings are trained to summarize the context of a text segment in the document, and further combined with text embeddings for entity extraction. Extensive experiments have been conducted to show that our method outperforms BiLSTM-CRF baselines by significant margins, on two real-world datasets. Additionally, ablation studies are also performed to evaluate the effectiveness of each component of our model. 205 |
206 | 207 | * **[GraphIE: A Graph-Based Framework for Information Extraction](https://www.aclweb.org/anthology/N19-1082/)**, \[[code](https://github.com/thomas0809/GraphIE) ![](https://img.shields.io/github/stars/thomas0809/GraphIE.svg?style=social)\] 208 |
209 | Yujie Qian, Enrico Santus, Zhijing Jin, Jiang Guo, Regina Barzilay NAACL 2019 210 | Most modern Information Extraction (IE) systems are implemented as sequential taggers and only model local dependencies. Non-local and non-sequential context is, however, a valuable source of information to improve predictions. In this paper, we introduce GraphIE, a framework that operates over a graph representing a broad set of dependencies between textual units (i.e. words or sentences). The algorithm propagates information between connected nodes through graph convolutions, generating a richer representation that can be exploited to improve word-level predictions. Evaluation on three different tasks — namely textual, social media and visual information extraction — shows that GraphIE consistently outperforms the state-of-the-art sequence tagging model by a significant margin. 211 |
212 | 213 | * **[Attend, Copy, Parse: End-to-end information extraction from documents](https://arxiv.org/pdf/1812.07248.pdf)** \[[code - unofficial](https://github.com/naiveHobo/InvoiceNet) ![](https://img.shields.io/github/stars/naiveHobo/InvoiceNet.svg?style=social)\] 214 |
215 | Rasmus Berg Palm, Florian Laws, Ole Winther ICDAR 2019 216 | Document information extraction tasks performed by humans create data consisting of a PDF or document image input, and extracted string outputs. This end-to-end data is naturally consumed and produced when performing the task because it is valuable in and of itself. It is naturally available, at no additional cost. Unfortunately, state-of-the-art word classification methods for information extraction cannot use this data, instead requiring word-level labels which are expensive to create and consequently not available for many real life tasks. In this paper we propose the Attend, Copy, Parse architecture, a deep neural network model that can be trained directly on end-to-end data, bypassing the need for word-level labels. We evaluate the proposed architecture on a large diverse set of invoices, and outperform a state-of-the-art production system based on word classification. We believe our proposed architecture can be used on many real life information extraction tasks where word classification cannot be used due to a lack of the required word-level labels. 217 |
218 | 219 | * [One-shot Information Extraction from Document Images using Neuro-Deductive Program Synthesis](https://arxiv.org/abs/1906.02427) 220 |
221 | Vishal Sunder, Ashwin Srinivasan, Lovekesh Vig, Gautam Shroff, Rohit Rahul arxiv 2019 222 | Our interest in this paper is in meeting a rapidly growing industrial demand for information extraction from images of documents such as invoices, bills, receipts etc. In practice users are able to provide a very small number of example images labeled with the information that needs to be extracted. We adopt a novel two-level neuro-deductive, approach where (a) we use pre-trained deep neural networks to populate a relational database with facts about each document-image; and (b) we use a form of deductive reasoning, related to meta-interpretive learning of transition systems to learn extraction programs: Given task-specific transitions defined using the entities and relations identified by the neural detectors and a small number of instances (usually 1, sometimes 2) of images and the desired outputs, a resource-bounded meta-interpreter constructs proofs for the instance(s) via logical deduction; a set of logic programs that extract each desired entity is easily synthesized from such proofs. In most cases a single training example together with a noisy-clone of itself suffices to learn a program-set that generalizes well on test documents, at which time the value of each entity is determined by a majority vote across its program-set. We demonstrate our two-level neuro-deductive approach on publicly available datasets ("Patent" and "Doctor's Bills") and also describe its use in a real-life industrial problem. 223 |
224 | 225 | * [EATEN: Entity-aware Attention for Single ShotVisual Text Extraction](https://arxiv.org/pdf/1909.09380.pdf), \[[code](https://github.com/beacandler/EATEN) ![](https://img.shields.io/github/stars/beacandler/EATEN.svg?style=social)\] 226 |
227 | He guo, Xiameng Qin, Jiaming Liu, Junyu Han, Jingtuo Liu, Errui Ding ICDAR 2019 228 | Extracting entity from images is a crucial part ofmany OCR applications, such as entity recognition of cards,invoices, and receipts. Most of the existing works employ classicaldetection and recognition paradigm. This paper proposes anEntity-aware Attention Text Extraction Network called EATEN,which is an end-to-end trainable system to extract the entitieswithout any post-processing. In the proposed framework, eachentity is parsed by its corresponding entity-aware decoder, re-spectively. Moreover, we innovatively introduce a state transitionmechanism which further improves the robustness of entityextraction. In consideration of the absence of public benchmarks,we construct a dataset of almost 0.6 million images in three real-world scenarios (train ticket, passport and business card), whichis publicly available at https://github.com/beacandler/EATEN. To the best of our knowledge, EATEN is the first single shotmethod to extract entities from images. Extensive experiments onthese benchmarks demonstrate the state-of-the-art performanceof EATEN. 229 |
230 | 231 | 232 | * [End-to-End Information Extraction byCharacter-Level Embedding and Multi-StageAttentional U-Net](https://bmvc2019.org/wp-content/uploads/papers/0870-paper.pdf) 233 |
234 | Tuan Nguyen Dang, Dat Nguyen Thanh BMVC 2019 235 | Information extraction from document images has received a lot of attention recently, due to the need for digitizing a large volume of unstructured documents such as invoices, receipts, bank transfers, etc. In this paper, we propose a novel deep learning architecture for end-to-end information extraction on the 2D character-grid embedding of the document, namely the Multi-Stage Attentional U-Net. To effectively capture the textual and spatial relations between 2D elements, our model leverages a specialized multi-stage encoder-decoders design, in conjunction with efficient uses of the self-attention mechanism and the box convolution. Experimental results on different datasets show that our model outperforms the baseline U-Net architecture by a large margin while using 40% fewer parameters. Moreover, it also significantly improved the baseline in erroneous OCR and limited training data scenario, thus becomes practical for real-world applications. 236 |
237 | 238 | 239 | #### Older 240 | 241 | * **[Chargrid: Towards Understanding 2D Documents](https://arxiv.org/pdf/1809.08799v1.pdf)**, \[[code - unofficial](https://github.com/sciencefictionlab/chargrid-pytorch) ![](https://img.shields.io/github/stars/sciencefictionlab/chargrid-pytorch.svg?style=social)\] 242 |
243 | Anoop R Katti et al. EMNLP 2018 244 | We introduce a novel type of text representation that preserves the 2D layout of a document. This is achieved by encoding each document page as a two-dimensional grid of characters. Based on this representation, we present a generic document understanding pipeline for structured documents. This pipeline makes use of a fully convolutional encoder-decoder network that predicts a segmentation mask and bounding boxes. We demonstrate its capabilities on an information extraction task from invoices and show that it significantly outperforms approaches based on sequential text or document images. 245 |
246 | 247 | * **[CloudScan - A Configuration-Free Invoice Analysis System Using Recurrent Neural Networks](https://arxiv.org/abs/1708.07403)**, \[[code - unofficial](https://github.com/naiveHobo/InvoiceNet/tree/cloudscan) ![](https://img.shields.io/github/stars/naiveHobo/InvoiceNet.svg?style=social)\] 248 |
249 | Rasmus Berg Palm, Ole Winther, Florian Laws ICDAR 2017 250 | We present CloudScan; an invoice analysis system that requires zero configuration or upfront annotation. In contrast to previous work, CloudScan does not rely on templates of invoice layout, instead it learns a single global model of invoices that naturally generalizes to unseen invoice layouts. The model is trained using data automatically extracted from end-user provided feedback. This automatic training data extraction removes the requirement for users to annotate the data precisely. We describe a recurrent neural network model that can capture long range context and compare it to a baseline logistic regression model corresponding to the current CloudScan production system. We train and evaluate the system on 8 important fields using a dataset of 326,471 invoices. The recurrent neural network and baseline model achieve 0.891 and 0.887 average F1 scores respectively on seen invoice layouts. For the harder task of unseen invoice layouts, the recurrent neural network model outperforms the baseline with 0.840 average F1 compared to 0.788. 251 |
252 | 253 | * [Field Extraction by Hybrid Incremental and a-priori Structural Templates](http://www.cvc.uab.es/~marcal/pdfs/DAS18e.pdf) 254 |
255 | Vincent Poulain d'Andecy, Emmanuel Hartmann, Marçal Rusiñol DAS 2018 256 | In this paper, we present an incremental frame-work for extracting information fields from administrative documents. First, we demonstrate some limits of the existing state-of-the-art methods such as the delay of the system efficiency. This is a concern in industrial context when we have only few samples of each document class. Based on this analysis, we propose a hybrid system combining incremental learning by means of itf-df statistics and a-priori generic models. We report in the experimental section our results obtained with a dataset of real invoices. 257 |
258 | 259 | * [Multidomain Document Layout Understanding using Few Shot Object Detection](https://www.researchgate.net/publication/327173179_Multidomain_Document_Layout_Understanding_using_Few_Shot_Object_Detection) 260 |
261 | Pranaydeep Singh, Srikrishna Varadarajan, Ankit Narayan Singh, Muktabh Mayank Srivastava arxiv 2018 262 | We try to address the problem of document layout understanding using a simple algorithm which generalizes across multiple domains while training on just few examples per domain. We approach this problem via supervised object detection method and propose a methodology to overcome the requirement of large datasets. We use the concept of transfer learning by pre-training our object detector on a simple artificial (source) dataset and fine-tuning it on a tiny domain specific (target) dataset. We show that this methodology works for multiple domains with training samples as less as 10 documents. We demonstrate the effect of each component of the methodology in the end result and show the superiority of this methodology over simple object detectors. 263 |
264 | 265 | * [Extracting structured data from invoices](https://www.aclweb.org/anthology/U18-1006/) 266 |
267 | Xavier Holt, Andrew Chisholm ALTA 2018 268 | Business documents encode a wealth of information in a format tailored to human consumption – i.e. aesthetically disbursed natural language text, graphics and tables. We address the task of extracting key fields (e.g. the amount due on an invoice) from a wide-variety of potentially unseen document formats. In contrast to traditional template driven extraction systems, we introduce a content-driven machine-learning approach which is both robust to noise and generalises to unseen document formats. In a comparison of our approach with alternative invoice extraction systems, we observe an absolute accuracy gain of 20\% across compared fields, and a 25\%–94\% reduction in extraction latency. 269 |
270 | 271 | * [Automatic and interactive rule inference without ground truth](https://hal.inria.fr/hal-01197470/document) 272 |
273 | Cérès Carton, Aurélie Lemaitre, Bertrand Coüasnon ICDAR 2015 274 | Dealing with non annotated documents for the design of a document recognition system is not an easy task. In general, statistical methods cannot learn without an annotated ground truth, unlike syntactical methods. However their ability to deal with non annotated data comes from the fact that the description is manually made by a user. The adaptation to a new kind of document is then tedious as the whole manual process of extraction of knowledge has to be redone. In this paper, we propose a method to extract knowledge and generate rules without any ground truth. Using large volume of non annotated documents, it is possible to study redundancies of some extracted elements in the document images. The redundancy is exploited through an automatic clustering algorithm. An interaction with the user brings semantic to the detected clusters. In this work, the extracted elements are some keywords extracted with word spotting. This approach has been applied to old marriage record field detection on the FamilySearch HIP2013 competition database. The results demonstrate that we successfully automatically infer rules from non annotated documents using the redundancy of extracted elements of the documents. 275 |
276 | 277 | * [Semantic Label and Structure Model based Approach for Entity Recognition in Database Context](https://www.researchgate.net/publication/308809442_Semantic_Label_and_Structure_Model_based_Approach_for_Entity_Recognition_in_Database_Context) 278 |
279 | Nihel Kooli, Abdel Belaïd ICDAR 2015 280 | This paper proposes an entity recognition approach in scanned documents referring to their description in database records. First, using the database record values, the corresponding document fields are labeled. Second, entities are identified by their labels and ranked using a TF/IDF based score. For each entity, local labels are grouped into a graph. This graph is matched with a graph model (structure model) which represents geometric structures of local entity labels using a specific cost function. This model is trained on a set of well chosen entities semi-automatically annotated. At the end, a correction step allows us to complete the eventual entity mislabeling using geometrical relationships between labels. The evaluation on 200 business documents containing 500 entities reaches about 93% for recall and 97% for precision. 281 |
282 | 283 | * [Combining Visual and Textual Features for Information Extraction fromOnline Flyers](https://www.aclweb.org/anthology/N15-1032/) 284 |
285 | Emilia Apostolova, Noriko Tomuro EMNLP 2014 286 | Information in visually rich formats such as PDF and HTML is often conveyed by a combination of textual and visual features. In particular, genres such as marketing flyers and info-graphics often augment textual information by its color, size, positioning, etc. As a result, traditional text-based approaches to information extraction (IE) could underperform. In this study, we present a supervised machine learning approach to IE from on-line commercial real estate flyers. We evaluated the performance of SVM classifiers on the task of identifying 12 types of named entities using a combination of textual and visual features. Results show that the addition of visual features such as color, size, and positioning significantly increased classifier performance. 287 |
288 | 289 | * **[From one tree to a forest: a unified solution for structured web data extraction](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/StructedDataExtraction_SIGIR2011.pdf)**, \[[Website](https://archive.codeplex.com/?p=swde)\] 290 |
291 | Qiang Hao, Ruiqiong Chu Cai, Yanwei Pang, Lei Y Zhang ACM SIGIR 2011 292 | Structured data, in the form of entities and associated attributes, has been a rich web resource for search engines and knowledge databases. To efficiently extract structured data from enormous websites in various verticals (e.g., books, restaurants), much research effort has been attracted, but most existing approaches either require considerable human effort or rely on strong features that lack of flexibility. We consider an ambitious scenario -- can we build a system that (1) is general enough to handle any vertical without re-implementation and (2) requires only one labeled example site from each vertical for training to automatically deal with other sites in the same vertical? In this paper, we propose a unified solution to demonstrate the feasibility of this scenario. Specifically, we design a set of weak but general features to characterize vertical knowledge (including attribute-specific semantics and inter-attribute layout relationships). Such features can be adopted in various verticals without redesign; meanwhile, they are weak enough to avoid overfitting of the learnt knowledge to seed sites. Given a new unseen site, the learnt knowledge is first applied to identify page-level candidate attribute values, while inevitably involve false positives. To remove noise, site-level information of the new site is then exploited to boost up the true values. The site-level information is derived in an unsupervised manner, without harm to the applicability of the solution. Promising experimental performance on 80 websites in 8 distinct verticals demonstrated the feasibility and flexibility of the proposed solution. 293 |
294 | 295 | * [A probabilistic approach to printed document understanding](https://link.springer.com/article/10.1007/s10032-010-0137-1) 296 |
297 | Eric Medvet, Alberto Bartoli, Giorgio Davanzo IJDAR 2010 298 | We propose an approach for information extraction for multi-page printed document understanding. The approach is designed for scenarios in which the set of possible document classes, i.e., documents sharing similar content and layout, is large and may evolve over time. Describing a new class is a very simple task: the operator merely provides a few samples and then, by means of a GUI, clicks on the OCR-generated blocks of a document containing the information to be extracted. Our approach is based on probability: we derived a general form for the probability that a sequence of blocks contains the searched information. We estimate the parameters for a new class by applying the maximum likelihood method to the samples of the class. All these parameters depend only on block properties that can be extracted automatically from the operator actions on the GUI. Processing a document of a given class consists in finding the sequence of blocks, which maximizes the corresponding probability for that class. We evaluated experimentally our proposal using 807 multi-page printed documents of different domains (invoices, patents, data-sheets), obtaining very good results—e.g., a success rate often greater than 90% even for classes with just two samples. 299 |
300 | 301 | 302 | ## Datasets 303 | 304 | 305 | #### English 306 | 307 | * **[DocILE Benchmark for Document Information Localization and Extraction](https://arxiv.org/abs/2302.05658)**, \[[Website](https://docile.rossum.ai)\] \[[benchmark](https://rrc.cvc.uab.es/?ch=26)\] \[[code](https://github.com/rossumai/docile) ![](https://img.shields.io/github/stars/rossumai/docile.svg?style=social)\] 308 |
309 | Štěpán Šimsa, Milan Šulc, Michal Uřičář, Yash Patel, Ahmed Hamdi, Matěj Kocián, Matyáš Skalický, Jiří Matas, Antoine Doucet, Mickaël Coustaty, Dimosthenis Karatzas arxiv pre-print 2023 310 | This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly~1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain- and task-specific aspects, resulting in the following key features: (i) annotations in 55 classes, which surpasses the granularity of previously published key information extraction datasets by a large margin; (ii) Line Item Recognition represents a highly practical information extraction task, where key information has to be assigned to items in a table; (iii) documents come from numerous layouts and the test set includes zero- and few-shot cases as well as layouts commonly seen in the training set. The benchmark comes with several baselines, including RoBERTa, LayoutLMv3 and DETR-based Table Transformer. These baseline models were applied to both tasks of the DocILE benchmark, with results shared in this paper, offering a quick starting point for future work. The dataset and baselines are available at this https URL. 311 |
312 | 313 | * **[DUE: End-to-End Document Understanding Benchmark](https://openreview.net/forum?id=rNs2FvJGDK)**, \[[Website](https://duebenchmark.com/leaderboard) \] 314 |
315 | Łukasz Borchmann, Michał Pietruszka, Tomasz Stanislawek, Dawid Jurkiewicz, Michał Turski, Karolina Szyndler, Filip Graliński NIPS 2021 316 | Understanding documents with rich layouts plays a vital role in digitization and hyper-automation but remains a challenging topic in the NLP research community. Additionally, the lack of a commonly accepted benchmark made it difficult to quantify progress in the domain. To empower research in this field, we introduce the Document Understanding Evaluation (DUE) benchmark consisting of both available and reformulated datasets to measure the end-to-end capabilities of systems in real-world scenarios. The benchmark includes Visual Question Answering, Key Information Extraction, and Machine Reading Comprehension tasks over various document domains and layouts featuring tables, graphs, lists, and infographics. In addition, the current study reports systematic baselines and analyses challenges in currently available datasets using recent advances in layout-aware language modeling. 317 |
318 | 319 | * [Spatial Dual-Modality Graph Reasoning for Key Information Extraction](https://arxiv.org/pdf/2103.14470.pdf), \[[Website](https://mmocr.readthedocs.io/en/latest/datasets.html#key-information-extraction) \] 320 |
321 | Hongbin Sun, Zhanghui Kuang, Xiaoyu Yue, Chenhao Lin, Wayne Zhang arxiv 2021 322 | Key information extraction from document images is of paramount importance in office automation. Conventional template matching based approaches fail to generalize well to document images of unseen templates, and are not robust against text recognition errors. In this paper, we propose an end-to-end Spatial Dual-Modality Graph Reasoning method (SDMG-R) to extract key information from unstructured document images. We model document images as dual-modality graphs, nodes of which encode both the visual and textual features of detected text regions, and edges of which represent the spatial relations between neighboring text regions. The key information extraction is solved by iteratively propagating messages along graph edges and reasoning the categories of graph nodes. In order to roundly evaluate our proposed method as well as boost the future research, we release a new dataset named WildReceipt, which is collected and annotated tailored for the evaluation of key information extraction from document images of unseen templates in the wild. It contains 25 key information categories, a total of about 69000 text boxes, and is about 2 times larger than the existing public datasets. Extensive experiments validate that all information including visual features, textual features and spatial relations can benefit key information extraction. It has been shown that SDMG-R can effectively extract key information from document images of unseen templates, and obtain new state-of-the-art results on the recent popular benchmark SROIE and our WildReceipt. Our code and dataset will be publicly released. 323 |
324 | 325 | * **[Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts](https://arxiv.org/abs/2003.02356)**, \[[data nda](https://github.com/applicaai/kleister-nda) ![](https://img.shields.io/github/stars/applicaai/kleister-nda.svg?style=social) \], \[[data charity](https://github.com/applicaai/kleister-charity) ![](https://img.shields.io/github/stars/applicaai/kleister-charity.svg?style=social) \] 326 |
327 | Tomasz Stanisławek, Filip Graliński, Anna Wróblewska, Dawid Lipiński, Agnieszka Kaliska, Paulina Rosalska, Bartosz Topolski, Przemysław Biecek ICDAR 2021 328 | Charity dataset size: Train(1 729), Dev(440), Test(609). NDA dataset size: Train(254), Dev(83), Test(203). Description: The relevance of Key Information Extraction (KIE) task is increasing in the natural language processing problems. But there are still only a few well-defined problems that serve as benchmarks for solutions in this area. To bridge this gap, we introduce two new datasets (Kleister NDA and Kleister Charity). They involve a mix of born-digital and scanned long formal documents in English. In these datasets, an NLP system is expected to find or infer various types of entities by utilizing both textual and structural layout features. The Kleister Charity dataset consists of 2,788 annual financial reports of charity organizations, i.e. 61,643 unique pages with 21,612 entities to extract. The Kleister NDA dataset contains 540 Non-disclosure Agreements, i.e. 3,229 unique pages with 2,160 entities to extract. We provide several state-of-the-art baseline systems from the KIE domain (Flair, BERT, RoBERTa, LayoutLM, LAMBERT), which show that our datasets pose a strong challenge to existing models. The best model achieved 81.77 % and 83.57 % F1-score on Kleister NDA and Kleister Charity datasets respectively. With this paper, we release our datasets to encourage progress on more in-depth and complex information extraction tasks. 329 |
330 | 331 | * **[SROIE](https://ieeexplore.ieee.org/document/8977955)**, \[[Website](https://rrc.cvc.uab.es/?ch=13&com=evaluation&task=3)\] 332 |
333 | Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, C. V. Jawahar ICDAR 2019 334 | Dataset size: Train(600), Test(400). Abstract: The dataset will have 1000 whole scanned receipt images. Each receipt image contains around about four key text fields, such as goods name, unit price and total cost, etc. The text annotated in the dataset mainly consists of digits and English characters. The dataset is split into a training/validation set (“trainval”) and a test set (“test”). The “trainval” set consists of 600 receipt images which will be made available to the participants along with their annotations. The “test” set consists of 400 images. 335 |
336 | 337 | * **[CORD](https://openreview.net/forum?id=SJl3z659UH)**, \[[code/data](https://github.com/clovaai/cord) ![](https://img.shields.io/github/stars/clovaai/cord.svg?style=social) \] 338 |
339 | Park, Seunghyun and Shin, Seung and Lee, Bado and Lee, Junyeop and Surh, Jaeheung and Seo, Minjoon and Lee, Hwalsuk NeurIPS Workshop Document Intelligence 2019 340 | Dataset size: Train(800), Dev(100), Test(100). Abstract: OCR is inevitably linked to NLP since its final output is in text. Advances in document intelligence are driving the need for a unified technology that integrates OCR with various NLP tasks, especially semantic parsing. Since OCR and semantic parsing have been studied as separate tasks so far, the datasets for each task on their own are rich, while those for the integrated post-OCR parsing tasks are relatively insufficient. In this study, we publish a consolidated dataset for receipt parsing as the first step towards post-OCR parsing tasks. The dataset consists of thousands of Indonesian receipts, which contains images and box/text annotations for OCR, and multi-level semantic labels for parsing. The proposed dataset can be used to address various OCR and parsing tasks. 341 |
342 | 343 | * [FUNSD](https://arxiv.org/pdf/1905.13538.pdf), \[[Website](https://guillaumejaume.github.io/FUNSD/)\] 344 |
345 | Guillaume Jaume, Hazım Kemal Ekenel, Jean-Philippe Thiran ICDAR-OST 2019 346 | Dataset size: Train(149), Test(50). Abstract: We present a new dataset for form understanding in noisy scanned documents (FUNSD) that aims at extracting and structuring the textual content of forms. The dataset comprises 199 real, fully annotated, scanned forms. The documents are noisy and vary widely in appearance, making form understanding (FoUn) a challenging task. The proposed dataset can be used for various tasks, including text detection, optical character recognition, spatial layout analysis, and entity labeling/linking. To the best of our knowledge, this is the first publicly available dataset with comprehensive annotations to address FoUn task. We also present a set of baselines and introduce metrics to evaluate performance on the FUNSD datase 347 |
348 | 349 | * [NIST](https://s3.amazonaws.com/nist-srd/SD2/users_guide_sd2.pdf), \[[Website](https://www.nist.gov/srd/nist-special-database-2)\] 350 |
351 | Darren Dimmick, Michael Garris, Charles Wilson, Patricia Flanagan 352 | The documents in this database are 12 different tax forms from the IRS 1040 Package X for the year 1988. These include Forms 1040, 2106, 2441, 4562, and 6251 together with Schedules A, B, C, D, E, F, and SE. There are 900 simulated tax submissions (Forms 1040, 2106, 2441, 4562, and 6251 together with Schedules A, B, C, D, E, F, and SE). Suitable for both document processing and automated data capture research, development, and evaluation, the data set can be used for: a) forms identification, b) field isolation; locating the entry fields on the form, c) character segmentation: separating entry field values into characters, d) character recognition: identifying specific machine printed characters 353 |
354 | 355 | * [Deepform](https://github.com/jstray/deepform), \[[Website](https://wandb.ai/deepform/political-ad-extraction/benchmark)\] 356 |
357 | Jonathan Stray, Nicholas Bardy 358 | DeepForm aims to extract information from TV and cable political advertising disclosure forms using deep learning and provide a challenging journalism-relevant dataset for NLP/ML researchers. This public data is valuable to journalists but locked in PDFs. Through this benchmark, we hope to accelerate collaboration on the concrete task of making this data accessible and longer-term solutions for general information extraction from visually-structured documents in fields like medicine, climate science, social science, and beyond. 359 |
360 | 361 | #### Chinese 362 | 363 | * [(EPHOIE) Towards Robust Visual Information Extraction in Real World: New Dataset and Novel Solution](https://arxiv.org/pdf/2102.06732.pdf), \[[code/data](https://github.com/HCIILAB/EPHOIE) ![](https://img.shields.io/github/stars/HCIILAB/EPHOIE.svg?style=social) \] 364 |
365 | Jiapeng Wang, Chongyu Liu, Lianwen Jin, Guozhi Tang, Jiaxin Zhang, Shuaitao Zhang, Qianying Wang, Yaqiang Wu, Mingxiang Cai AAAI 2021 366 | Visual information extraction (VIE) has attracted considerable attention recently owing to its various advanced applications such as document understanding, automatic marking and intelligent education. Most existing works decoupled this problem into several independent sub-tasks of text spotting (text detection and recognition) and information extraction, which completely ignored the high correlation among them during optimization. In this paper, we propose a robust visual information extraction system (VIES) towards real-world scenarios, which is a unified end-to-end trainable framework for simultaneous text detection, recognition and information extraction by taking a single document image as input and outputting the structured information. Specifically, the information extraction branch collects abundant visual and semantic representations from text spotting for multimodal feature fusion and conversely, provides higher-level semantic clues to contribute to the optimization of text spotting. Moreover, regarding the shortage of public benchmarks, we construct a fully-annotated dataset called EPHOIE (this https URL), which is the first Chinese benchmark for both text spotting and visual information extraction. EPHOIE consists of 1,494 images of examination paper head with complex layouts and background, including a total of 15,771 Chinese handwritten or printed text instances. Compared with the state-of-the-art methods, our VIES shows significant superior performance on the EPHOIE dataset and achieves a 9.01% F-score gain on the widely used SROIE dataset under the end-to-end scenario. 367 |
368 | 369 | * [Metaknowledge Extraction Based onMulti-Modal Documents](https://arxiv.org/pdf/2102.02971.pdf), \[[code/data](https://github.com/RuilinXu/GovDoc-CN) ![](https://img.shields.io/github/stars/RuilinXu/GovDoc-CN.svg?style=social) \] 370 |
371 | Shukan Liu, Ruilin Xu, Boying Geng, Qiao Sun, Li Duan, Yiming Liu IEEE Access 2011 372 | The triplet-based knowledge in large-scale knowledge bases is most likely lacking in structural logic and problematic of conducting knowledge hierarchy. In this paper, we introduce the concept of metaknowledge to knowledge engineering research for the purpose of structural knowledge construction. Therefore, the Metaknowledge Extraction Framework and Document Structure Tree model are presented to extract and organize metaknowledge elements (titles, authors, abstracts, sections, paragraphs, etc.), so that it is feasible to extract the structural knowledge from multi-modal documents. Experiment results have proved the effectiveness of metaknowledge elements extraction by our framework. Meanwhile, detailed examples are given to demonstrate what exactly metaknowledge is and how to generate it. At the end of this paper, we propose and analyze the task flow of metaknowledge applications and the associations between knowledge and metaknowledge. 373 |
374 | 375 | * [EATEN: Entity-aware Attention for Single Shot Visual Text Extraction](https://arxiv.org/abs/2004.07464), \[[data](https://drive.google.com/u/0/uc?id=1o8JktPD7bS74tfjz-8dVcZq_uFS6YEGh&export=download)\], \[[code](https://github.com/beacandler/EATEN) ![](https://img.shields.io/github/stars/beacandler/EATEN.svg?style=social)\] 376 |
377 | He Guo, Xiameng Qin, Jiaming Liu, Junyu Han, Jingtuo Liu and Errui Ding ICDAR 2019 378 | Abstract: Extracting entity from images is a crucial part of many OCR applications, such as entity recognition of cards, invoices, and receipts. Most of the existing works employ classical detection and recognition paradigm. This paper proposes an Entity-aware Attention Text Extraction Network called EATEN, which is an end-to-end trainable system to extract the entities without any post-processing. In the proposed framework, each entity is parsed by its corresponding entity-aware decoder, respectively. Moreover, we innovatively introduce a state transition mechanism which further improves the robustness of entity extraction. In consideration of the absence of public benchmarks, we construct a dataset of almost 0.6 million images in three real-world scenarios (train ticket, passport and business card), which is publicly available at this https URL. To the best of our knowledge, EATEN is the first single shot method to extract entities from images. Extensive experiments on these benchmarks demonstrate the state-of-the-art performance of EATEN. 379 |
380 | 381 | #### Polish 382 | 383 | * [Results of the PolEval 2020 Shared Task 4: Information Extraction from Long Documents with Complex Layouts](http://2020.poleval.pl/files/poleval2020.pdf), \[[Website](http://2020.poleval.pl/tasks/task4/)\] 384 |
385 | Filip Graliński, Anna Wróblewska Proceedings of the PolEval 2020 Workshop 2020 386 | Dataset size: Train(1 628), Dev(548), Test(555). Description: The challenge is about information acquisition and inference in the field of natural language processing. Collecting information from real, long documents must deal with complex page layouts by integrating found entities along multiple pages and text sections, tables, plots, forms, etc. To encourage progress in deeper and more complex information extraction, we present a dataset in which systems have to find the most important information about different types of entities from formal documents. These units are not only classes from the systems for recognising units with a standard name (NER) (e.g. person, location or organisation), but also the roles of units in whole documents (e.g. chairman of the board, date of issue). 387 |
388 | 389 | 390 | #### Multilanguage 391 | 392 | * **[XFUND: A Benchmark Dataset for Multilingual Visually Rich Form Understanding](https://aclanthology.org/2022.findings-acl.253/)** 393 |
394 | Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei ACL 2022 395 | Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually rich document understanding tasks recently, which demonstrates the great potential for joint learning across different modalities. However, the existed research work has focused only on the English domain while neglecting the importance of multilingual generalization. In this paper, we introduce a human-annotated multilingual form understanding benchmark dataset named XFUND, which includes form understanding samples in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese). Meanwhile, we present LayoutXLM, a multimodal pre-trained model for multilingual document understanding, which aims to bridge the language barriers for visually rich document understanding. Experimental results show that the LayoutXLM model has significantly outperformed the existing SOTA cross-lingual pre-trained models on the XFUND dataset. 396 |
397 | 398 | 399 | * [Ghega dataset](https://arxiv.org/abs/1906.02427), \[[Website](https://machinelearning.inginf.units.it/data-and-tools/ghega-dataset)\] 400 |
401 | Vishal Sunder, Ashwin Srinivasan, Lovekesh Vig, Gautam Shroff, Rohit Rahul arxiv 2019 402 | The dataset is composed as follows. It contains two groups of documents: 110 data-sheets of electronic components and 136 patents. Each group is further divided in classes: data-sheets classes share the component type and producer; patents classes share the patent source. 403 |
404 | 405 | 406 | ## Useful links 407 | 408 | Data Augmentation: 409 | 1. https://github.com/makcedward/nlpaug ![](https://img.shields.io/github/stars/makcedward/nlpaug.svg?style=social) 410 | 1. https://github.com/dsfsi/textaugment ![](https://img.shields.io/github/stars/dsfsi/textaugment.svg?style=social) 411 | 1. https://github.com/QData/TextAttack ![](https://img.shields.io/github/stars/QData/TextAttack.svg?style=social) 412 | 1. https://github.com/joke2k/faker ![](https://img.shields.io/github/stars/joke2k/faker.svg?style=social) 413 | 1. https://github.com/benkeen/generatedata ![](https://img.shields.io/github/stars/benkeen/generatedata.svg?style=social) 414 | 1. https://github.com/Belval/TextRecognitionDataGenerator ![](https://img.shields.io/github/stars/Belval/TextRecognitionDataGenerator.svg?style=social) 415 | 1. https://github.com/snorkel-team/snorkel ![](https://img.shields.io/github/stars/snorkel-team/snorkel.svg?style=social) 416 | 417 | Related NLP topics: 418 | 1. [Named Entity Recognition (NER)](https://github.com/sebastianruder/NLP-progress/blob/master/english/named_entity_recognition.md) 419 | 1. [Entity Linking (EL)](https://github.com/sebastianruder/NLP-progress/blob/master/english/entity_linking.md) 420 | 1. [Template extraction](https://github.com/prit2596/NLP-Template-Extraction) 421 | 1. [Noun Phrase Canonicalization](https://github.com/sebastianruder/NLP-progress/blob/master/english/information_extraction.md#noun-phrase-canonicalization) 422 | 423 | Others: 424 | 1. [cleanlab](https://github.com/cleanlab/cleanlab) ![](https://img.shields.io/github/stars/cleanlab/cleanlab.svg?style=social) - The standard package for machine learning with noisy labels and finding mislabeled data. Works with most datasets and models. 425 | 1. [CommonRegex](https://github.com/madisonmay/CommonRegex) ![](https://img.shields.io/github/stars/madisonmay/CommonRegex.svg?style=social) - find all times, dates, links, phone numbers, emails, ip addresses, prices, hex colors, and credit card numbers in a string 426 | 1. [Name Parser](https://github.com/derek73/python-nameparser) ![](https://img.shields.io/github/stars/derek73/python-nameparser.svg?style=social) - parsing human names into their individual components 427 | 1. [pyahocorasick](https://github.com/WojciechMula/pyahocorasick) ![](https://img.shields.io/github/stars/WojciechMula/pyahocorasick.svg?style=social) - is a fast and memory efficient library for exact or approximate multi-pattern string search meaning that you can find multiple key strings occurrences at once in some input text 428 | 1. [deepmatcher](https://github.com/anhaidgroup/deepmatcher) ![](https://img.shields.io/github/stars/anhaidgroup/deepmatcher.svg?style=social) - performing entity and text matching using deep learning 429 | 1. [simstring](https://github.com/nullnull/simstring) ![](https://img.shields.io/github/stars/nullnull/simstring.svg?style=social) - A Python implementation of the [SimString](http://www.chokkan.org/software/simstring/index.html.en), a simple and efficient algorithm for approximate string matching. 430 | 1. [RapidFuzz](https://github.com/maxbachmann/RapidFuzz) ![](https://img.shields.io/github/stars/maxbachmann/RapidFuzz.svg?style=social) - RapidFuzz is a fast string matching library for Python and C++, which is using the string similarity calculations from [FuzzyWuzzy](https://github.com/seatgeek/fuzzywuzzy) 431 | 1. [bootleg](https://github.com/HazyResearch/bootleg) ![](https://img.shields.io/github/stars/HazyResearch/bootleg.svg?style=social) - self-supervised named entity disambiguation (NED) system that links mentions in text to entities in a knowledge base 432 | 433 | -------------------------------------------------------------------------------- /topics/ocr/README.md: -------------------------------------------------------------------------------- 1 | ## Table of contents 2 | 3 | 1. [Benchmarks](#benchmarks) 4 | 1. [Papers](#papers) 5 | 1. [Datasets](#datasets) 6 | 1. [Useful links](#useful-links) 7 | 8 | 9 | ## Benchmarks 10 | 11 | 1. Best OCR by Text Extraction Accuracy in 2021, https://research.aimultiple.com/ocr-accuracy/ 12 | 1. Best OCR Software of 2021, https://nanonets.com/blog/ocr-software-best-ocr-software/ 13 | 1. Comparison of OCR tools: how to choose the best tool for your project, https://medium.com/dida-machine-learning/comparison-of-ocr-tools-how-to-choose-the-best-tool-for-your-project-bd21fb9dce6b 14 | 1. Our Search for the Best OCR Tool, and What We Found, 2019, https://source.opennews.org/articles/so-many-ocr-options/ (https://github.com/factful/ocr_testing) 15 | 16 | 17 | ## Papers 18 | 19 | * [DavarOCR: A Toolbox for OCR and Multi-Modal Document Understanding](https://arxiv.org/abs/2207.06695), \[[code/](https://github.com/hikopensource/Davar-Lab-OCR) \] 20 |
21 | Liang Qiao, Hui Jiang, Ying Chen, Can Li, Pengfei Li, Zaisheng Li, Baorui Zou, Dashan Guo, Yingda Xu, Yunlu Xu, Zhanzhan Cheng, Yi Niu ACM MM 2022 22 | This paper presents DavarOCR, an open-source toolbox for OCR and document understanding tasks. DavarOCR currently implements 19 advanced algorithms, covering 9 different task forms. DavarOCR provides detailed usage instructions and the trained models for each algorithm. Compared with the previous opensource OCR toolbox, DavarOCR has relatively more complete support for the sub-tasks of the cutting-edge technology of document understanding. In order to promote the development and application of OCR technology in academia and industry, we pay more attention to the use of modules that different sub-domains of technology can share. 23 |
24 | 25 | * [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282), \[[code/data](https://github.com/microsoft/unilm/tree/master/trocr) \] 26 |
27 | Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei arxiv 2021 28 | Text recognition is a long-standing research problem for document digitalization. Existing approaches for text recognition are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks. 29 |
30 | 31 | * [Lights, Camera, Action! A Framework to Improve NLP Accuracy over OCR documents](https://arxiv.org/abs/2108.02899) 32 |
33 | Amit Gupte, Alexey Romanov, Sahitya Mantravadi, Dalitso Banda, Jianjie Liu, Raza Khan, Lakshmanan Ramu Meenal, Benjamin Han, Soundar Srinivasan Document Intelligence Workshop at KDD 2021 34 | Document digitization is essential for the digital transformation of our societies, yet a crucial step in the process, Optical Character Recognition (OCR), is still not perfect. Even commercial OCR systems can produce questionable output depending on the fidelity of the scanned documents. In this paper, we demonstrate an effective framework for mitigating OCR errors for any downstream NLP task, using Named Entity Recognition (NER) as an example. We first address the data scarcity problem for model training by constructing a document synthesis pipeline, generating realistic but degraded data with NER labels. We measure the NER accuracy drop at various degradation levels and show that a text restoration model, trained on the degraded data, significantly closes the NER accuracy gaps caused by OCR errors, including on an out-of-domain dataset. For the benefit of the community, we have made the document synthesis pipeline available as an open-source project. 35 |
36 | 37 | * [Text Recognition in the Wild: A Survey](https://arxiv.org/pdf/2005.03492.pdf) 38 |
39 | Xiaoxue Chen, Lianwen Jin, Yuanzhi Zhu, Canjie Luo, T. Wang arxiv 2020 40 | The history of text can be traced back over thousands of years. Rich and precise semantic information carried by text is important in a wide range of vision-based application scenarios. Therefore, text recognition in natural scenes has been an active research field in computer vision and pattern recognition. In recent years, with the rise and development of deep learning, numerous methods have shown promising in terms of innovation, practicality, and efficiency. This paper aims to (1) summarize the fundamental problems and the state-of-the-art associated with scene text recognition; (2) introduce new insights and ideas; (3) provide a comprehensive review of publicly available resources; (4) point out directions for future work. In summary, this literature review attempts to present the entire picture of the field of scene text recognition. It provides a comprehensive reference for people entering this field, and could be helpful to inspire future research. Related resources are available at our Github repository: this https URL. 41 |
42 | 43 | 44 | ## Datasets 45 | 46 | 1. Total-Text [paper](http://cs-chan.com/doc/IJDAR2019.pdf) [repo](https://github.com/cs-chan/Total-Text-Dataset) - scene text detection dataset 47 | 1. [Synth90k](https://www.robots.ox.ac.uk/~vgg/data/text/#sec-synth) - popular dataset of single-word synthetic images (90k words, 9M images) 48 | 1. [SROIE](https://rrc.cvc.uab.es/?ch=13) - scanned receipts OCR and information extraction 49 | 1. [FUNSD](https://guillaumejaume.github.io/FUNSD/) - A dataset for Text Detection, Optical Character Recognition, Spatial Layout Analysis and Form Understanding 50 | 1. [RDCL2019](https://www.primaresearch.org/RDCL2019/) - ICDAR Competition on Recognition of Documents with Complex Layouts 51 | 1. [REID2019](https://www.primaresearch.org/REID2019/) - ICDAR Competition on Recognition of Early Indian printed Documents 52 | 1. [RETAS OCR EVALUATION DATASET](https://ciir.cs.umass.edu/downloads/ocr-evaluation/) - scanned books from Gutenberg project 53 | 54 | ## Useful links 55 | 56 | 1. **https://github.com/mindee/doctr - alternative for Tesseract project!** 57 | 2. https://mindee.com/ 58 | 3. https://github.com/open-mmlab/mmocr 59 | 4. https://github.com/Belval/TextRecognitionDataGenerator 60 | 5. http://tc11.cvc.uab.es/datasets/type/ 61 | 6. https://www.primaresearch.org/ 62 | 7. http://iapr-tc11.org/mediawiki/index.php?title=IAPR-TC11:Reading_Systems 63 | -------------------------------------------------------------------------------- /topics/related/README.md: -------------------------------------------------------------------------------- 1 | # Table of contents 2 | 3 | 1. [General](#general) 4 | 1. [Tabular Data Comprehension (TDC)](#tabular-data-comprehension) 5 | 1. [Robotic Process Automation (RPA)](#robotic-process-automation) 6 | 7 | 8 | ## General 9 | 10 | 11 | #### 2021 12 | 13 | * [Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks](https://arxiv.org/abs/2103.14749) 14 |
15 | Curtis G. Northcutt, Anish Athalye, Jonas Mueller ICLR 2021 16 | We algorithmically identify label errors in the test sets of 10 of the most commonly-used computer vision, natural language, and audio datasets, and subsequently study the potential for these label errors to affect benchmark results. Errors in test sets are numerous and widespread: we estimate an average of 3.4% errors across the 10 datasets, where for example 2916 label errors comprise 6% of the ImageNet validation set. Putative label errors are found using confident learning and then human-validated via crowdsourcing (54% of the algorithmically-flagged candidates are indeed erroneously labeled). Surprisingly, we find that lower capacity models may be practically more useful than higher capacity models in real-world datasets with high proportions of erroneously labeled data. For example, on ImageNet with corrected labels: ResNet-18 outperforms ResNet-50 if the prevalence of originally mislabeled test examples increases by just 6%. On CIFAR-10 with corrected labels: VGG-11 outperforms VGG-19 if the prevalence of originally mislabeled test examples increases by 5%. Traditionally, ML practitioners choose which model to deploy based on test accuracy -- our findings advise caution here, proposing that judging models over correctly labeled test sets may be more useful, especially for noisy real-world datasets. 17 |
18 | 19 | * [Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers](https://arxiv.org/abs/2109.04448) 20 |
21 | Stella Frank, Emanuele Bugliarello, Desmond Elliott EMNLP 2021 22 | Pretrained vision-and-language BERTs aim to learn representations that combine information from both modalities. We propose a diagnostic method based on cross-modal input ablation to assess the extent to which these models actually integrate cross-modal information. This method involves ablating inputs from one modality, either entirely or selectively based on cross-modal grounding alignments, and evaluating the model prediction performance on the other modality. Model performance is measured by modality-specific tasks that mirror the model pretraining objectives (e.g. masked language modelling for text). Models that have learned to construct cross-modal representations using both modalities are expected to perform worse when inputs are missing from a modality. We find that recently proposed models have much greater relative difficulty predicting text when visual information is ablated, compared to predicting visual object categories when text is ablated, indicating that these models are not symmetrically cross-modal. 23 |
24 | 25 | * [NT5?! Training T5 to Perform Numerical Reasoning](https://arxiv.org/pdf/2104.07307.pdf), \[[code](https://github.com/lesterpjy/numeric-t5) ![](https://img.shields.io/github/stars/lesterpjy/numeric-t5.svg?style=social)\] 26 |
27 | Peng-Jian Yang, Ying Ting Chen, Yuechan Chen, Daniel Cer arxiv 2021 28 | Numerical reasoning over text (NRoT) presents unique challenges that are not well addressed by existing pre-training objectives. We explore five sequential training schedules that adapt a pre-trained T5 model for NRoT. Our final model is adapted from T5, but further pre-trained on three datasets designed to strengthen skills necessary for NRoT and general reading comprehension before being fine-tuned on the Discrete Reasoning over Text (DROP) dataset. The training improves DROP's adjusted F1 performance (a numeracy-focused score) from 45.90 to 70.83. Our model closes in on GenBERT (72.4), a custom BERT-Base model using the same datasets with significantly more parameters. We show that training the T5 multitasking framework with multiple numerical reasoning datasets of increasing difficulty, good performance on DROP can be achieved without manually engineering partitioned functionality between distributed and symbol modules. 29 |
30 | 31 | * [Data Augmentations for Document Images](http://ceur-ws.org/Vol-2831/paper20.pdf) 32 |
33 | Yunsung Lee, Teakgyu Hong, Seungryong Kim SDU 2021 34 | Data augmentation has the potential to significantly improve the generalization capability of deep neural networks. Especially in image recognition, recent augmentation techniques such as Mixup, CutOut, CutMix, and RandAugment have shown great performance improvement. These augmentation techniques have also shown effectiveness in semi-supervised learning or self-supervised learning. Despite of these effects and usefulness, these techniques cannot be applied directly to document image analysis, which require text semantic fea-ture preservation. To tackle this problem, we propose novel augmentation methods, DocCutout and DocCutMix, that are more suitable for document images, by applying the transform to each word unit and thus preserving text semantic feature during augmentation. We conduct intensive experiments to find the most effective data augmentation techniques among various approaches for document object detection and show our proposed augmentation methods outperform state-of-the-arts with +1.77 AP in PubMed dataset. 35 |
36 | 37 | * [Variational Transformer Networks for Layout Generation](https://arxiv.org/pdf/2104.02416.pdf) 38 |
39 | Diego Martin Arroyo, Janis Postels, Federico Tombari CVPR 2021 40 | Generative models able to synthesize layouts of different kinds (e.g. documents, user interfaces or furniture arrangements) are a useful tool to aid design processes and as a first step in the generation of synthetic data, among other tasks. We exploit the properties of self-attention layers to capture high level relationships between elements in a layout, and use these as the building blocks of the well-known Variational Autoencoder (VAE) formulation. Our proposed Variational Transformer Network (VTN) is capable of learning margins, alignments and other global design rules without explicit supervision. Layouts sampled from our model have a high degree of resemblance to the training data, while demonstrating appealing diversity. In an extensive evaluation on publicly available benchmarks for different layout types VTNs achieve state-of-the-art diversity and perceptual quality. Additionally, we show the capabilities of this method as part of a document layout detection pipeline. 41 |
42 | 43 | * [GRIT: Generative Role-filler Transformers for Document-level Event Entity Extraction](https://arxiv.org/pdf/2008.09249.pdf) 44 |
45 | Xinya Du, Alexander M. Rush, Claire Cardie EACL 2021 46 | We revisit the classic problem of document-level role-filler entity extraction (REE) for template filling. We argue that sentence-level approaches are ill-suited to the task and introduce a generative transformer-based encoder-decoder framework (GRIT) that is designed to model context at the document level: it can make extraction decisions across sentence boundaries; is implicitly aware of noun phrase coreference structure, and has the capacity to respect cross-role dependencies in the template structure. We evaluate our approach on the MUC-4 dataset, and show that our model performs substantially better than prior work. We also show that our modeling choices contribute to model performance, e.g., by implicitly capturing linguistic knowledge such as recognizing coreferent entity mentions. 47 |
48 | 49 | 50 | #### 2020 51 | 52 | * [Multi-modal Information Extraction from Text, Semi-structured, and Tabular Data on the Web](https://www.aclweb.org/anthology/2020.acl-tutorials.6.pdf) 53 |
54 | Xin Luna Dong, Hannaneh Hajishirzi, Colin Lockard, Prashant Shiralkar ACL Tutorials 2020 55 | The World Wide Web contains vast quantities of textual information in several forms: unstructured text, template-based semi-structured webpages (which present data in key-value pairs and lists), and tables. Methods for extracting information from these sources and converting it to a structured form have been a target of research from the natural language processing (NLP), data mining, and database communities. While these researchers have largely separated extraction from web data into different problems based on the modality of the data, they have faced similar problems such as learning with limited labeled data, defining (or avoiding defining) ontologies, making use of prior knowledge, and scaling solutions to deal with the size of the Web. In this tutorial we take a holistic view toward information extraction, exploring the commonalities in the challenges and solutions developed to address these different forms of text. We will explore the approaches targeted at unstructured text that largely rely on learning syntactic or semantic textual patterns, approaches targeted at semi-structured documents that learn to identify structural patterns in the template, and approaches targeting web tables which rely heavily on entity linking and type information. While these different data modalities have largely been considered separately in the past, recent research has started taking a more inclusive approach toward textual extraction, in which the multiple signals offered by textual, layout, and visual clues are combined into a single extraction model made possible by new deep learning approaches. At the same time, trends within purely textual extraction have shifted toward full-document understanding rather than considering sentences as independent units. With this in mind, it is worth considering the information extraction problem as a whole to motivate solutions that harness textual semantics along with visual and semi-structured layout information. We will discuss these approaches and suggest avenues for future work. 56 |
57 | 58 | * [Layout-Aware Text Representations Harm Clustering Documents by Type](https://pdfs.semanticscholar.org/6e3f/adce5f4bea362cf0ca0165c300cec3afe042.pdf) 59 |
60 | Catherine Finegan-Dollak, Ashish Verma Insights 2020 61 | Clustering documents by type—grouping invoices with invoices and articles with articles—is a desirable first step for organizing large collections of document scans. Humans approaching this task use both the semantics of the text and the document layout to assist in grouping like documents. LayoutLM (Xu et al., 2019), a layout-aware transformer built on top of BERT with state-of-the-art performance on document-type classification, could reasonably be expected to outperform regular BERT (Devlin et al., 2018) for document-type clustering. However, we find experimentally that BERT significantly outperforms LayoutLM on this task (p <0.001). We analyze clusters to show where layout awareness is an asset and where it is a liability. 62 |
63 | 64 | * [Self-Supervised Representation Learning on Document Images](https://arxiv.org/pdf/2004.10605.pdf) 65 |
66 | Adrian Cosma, Mihai Ghidoveanu, Michael Panaitescu-Liess, Marius Popescu DAS 2020 67 | This work analyses the impact of self-supervised pre-training on document images in the context of document image classification. While previous approaches explore the effect of self-supervision on natural images, we show that patch-based pre-training performs poorly on document images because of their different structural properties and poor intra-sample semantic information. We propose two context-aware alternatives to improve performance on the Tobacco-3482 image classification task. We also propose a novel method for self-supervision, which makes use of the inherent multi-modality of documents (image and text), which performs better than other popular self-supervised methods, including supervised ImageNet pre-training, on document image classification scenarios with a limited amount of data. 68 |
69 | 70 | #### Older 71 | 72 | * **[Fonduer: Knowledge Base Construction from Richly Formatted Data](https://arxiv.org/pdf/1703.05028.pdf)**, \[[code](https://github.com/HazyResearch/fonduer) ![](https://img.shields.io/github/stars/HazyResearch/fonduer.svg?style=social)\] 73 |
74 | Sen Wu, Luke Hsiao, Xiao Cheng, Braden Hancock, Theodoros Rekatsinas, Philip Levis, Christopher Ré International Conference on Management of Data 2018 75 | We focus on knowledge base construction (KBC) from richly formatted data. In contrast to KBC from text or tabular data, KBC from richly formatted data aims to extract relations conveyed jointly via textual, structural, tabular, and visual expressions. We introduce Fonduer, a machine-learning-based KBC system for richly formatted data. Fonduer presents a new data model that accounts for three challenging characteristics of richly formatted data: (1) prevalent document-level relations, (2) multimodality, and (3) data variety. Fonduer uses a new deep-learning model to automatically capture the representation (i.e., features) needed to learn how to extract relations from richly formatted data. Finally, Fonduer provides a new programming model that enables users to convert domain expertise, based on multiple modalities of information, to meaningful signals of supervision for training a KBC system. Fonduer-based KBC systems are in production for a range of use cases, including at a major online retailer. We compare Fonduer against state-of-the-art KBC approaches in four different domains. We show that Fonduer achieves an average improvement of 41 F1 points on the quality of the output knowledge base---and in some cases produces up to 1.87x the number of correct entries---compared to expert-curated public knowledge bases. We also conduct a user study to assess the usability of Fonduer's new programming model. We show that after using Fonduer for only 30 minutes, non-domain experts are able to design KBC systems that achieve on average 23 F1 points higher quality than traditional machine-learning-based KBC approaches. 76 |
77 | 78 | * [A Benchmark and Evaluation for Text Extraction from PDF](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7991564) 79 |
80 | Hannah Bast, Claudius Korzen JCDL 2017 81 | Extracting the body text from a PDF document is an important but surprisingly difficult task. The reason is that PDF is a layout-based format which specifies the fonts and positions of the individual characters rather than the semantic units of the text (e.g., words or paragraphs) and their role in the document (e.g., body text or caption). There is an abundance of extraction tools, but their quality and the range of their functionality are hard to determine. In this paper, we show how to construct a high-quality benchmark of principally arbitrary size from parallel TeX and PDF data. We construct such a benchmark of 12,098 scientific articles from arXiv.org and make it publicly available. We establish a set of criteria for a clean and independent assessment of the semantic abilities of a given extraction tool. We provide an extensive evaluation of 14 state-of-the-art tools for text extraction from PDF on our benchmark according to our criteria. We include our own method, Icecite, which significantly outperforms all other tools, but is still not perfect. We outline the remaining steps necessary to finally make text extraction from PDF a "solved problem". 82 |
83 | 84 | * [Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval](https://arxiv.org/pdf/1502.07058.pdf) 85 |
86 | Adam W. Harley, Alex Ufkes, Konstantinos G. Derpanis ICDAR 2015 87 | This paper presents a new state-of-the-art for document image classification and retrieval, using features learned by deep convolutional neural networks (CNNs). In object and scene analysis, deep neural nets are capable of learning a hierarchical chain of abstraction from pixel inputs to concise and descriptive representations. The current work explores this capacity in the realm of document analysis, and confirms that this representation strategy is superior to a variety of popular hand-crafted alternatives. Experiments also show that (i) features extracted from CNNs are robust to compression, (ii) CNNs trained on non-document images transfer well to document analysis tasks, and (iii) enforcing region-specific feature-learning is unnecessary given sufficient training data. This work also makes available a new labelled subset of the IIT-CDIP collection, containing 400,000 document images across 16 categories, useful for training new CNNs for document analysis. 88 |
89 | 90 | 91 | ## Tabular Data Comprehension 92 | 93 | [Back to top](#table-of-contents) 94 | 95 | ### Papers 96 | 97 | #### 2021 98 | 99 | * [Open Domain Question Answering over Tables via Dense Retrieval](https://arxiv.org/pdf/2103.12011.pdf), \[[code](https://github.com/google-research/tapas) ![](https://img.shields.io/github/stars/google-research/tapas.svg?style=social)\] 100 |
101 | Jonathan Herzig, Thomas Müller, Syrine Krichene, Julian Martin Eisenschlos NAACL 2021 102 | Recent advances in open-domain QA have led to strong models based on dense retrieval, but only focused on retrieving textual passages. In this work, we tackle open-domain QA over tables for the first time, and show that retrieval can be improved by a retriever designed to handle tabular context. We present an effective pre-training procedure for our retriever and improve retrieval quality with mined hard negatives. As relevant datasets are missing, we extract a subset of NATURAL QUESTIONS (Kwiatkowski et al., 2019) into a Table QA dataset. We find that our retriever improves retrieval results from 72.0 to 81.1 recall@10 and end-to-end QA results from 33.8 to 37.7 exact match, over a BERT based retriever 103 |
104 | #### 2020 105 | 106 | * **[TURL: Table Understanding through Representation Learning](https://arxiv.org/pdf/2006.14806.pdf)**, \[[code](https://github.com/sunlab-osu/TURL) ![](https://img.shields.io/github/stars/sunlab-osu/TURL.svg?style=social)\] 107 |
108 | Xiang Deng, Huan Sun, Alyssa Lees, You Wu, Cong Yu VLDB 2021 109 | Relational tables on the Web store a vast amount of knowledge. Owing to the wealth of such tables, there has been tremendous progress on a variety of tasks in the area of table understanding. However, existing work generally relies on heavily-engineered task-specific features and model architectures. In this paper, we present TURL, a novel framework that introduces the pre-training/fine-tuning paradigm to relational Web tables. During pre-training, our framework learns deep contextualized representations on relational tables in an unsupervised manner. Its universal model design with pre-trained representations can be applied to a wide range of tasks with minimal task-specific fine-tuning. Specifically, we propose a structure-aware Transformer encoder to model the row-column structure of relational tables, and present a new Masked Entity Recovery (MER) objective for pre-training to capture the semantics and knowledge in large-scale unlabeled data. We systematically evaluate TURL with a benchmark consisting of 6 different tasks for table understanding (e.g., relation extraction, cell filling). We show that TURL generalizes well to all tasks and substantially outperforms existing methods in almost all instances. 110 |
111 | 112 | * **[TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/pdf/2004.02349.pdf)**, \[[code](https://github.com/google-research/tapas) ![](https://img.shields.io/github/stars/google-research/tapas.svg?style=social)\] 113 |
114 | Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno, Julian Martin Eisenschlos ACL 2020 115 | Answering natural language questions over tables is usually seen as a semantic parsing task. To alleviate the collection cost of full logical forms, one popular approach focuses on weak supervision consisting of denotations instead of logical forms. However, training semantic parsers from weak supervision poses difficulties, and in addition, the generated logical forms are only used as an intermediate step prior to retrieving the denotation. In this paper, we present TAPAS, an approach to question answering over tables without generating logical forms. TAPAS trains from weak supervision, and predicts the denotation by selecting table cells and optionally applying a corresponding aggregation operator to such selection. TAPAS extends BERT's architecture to encode tables as input, initializes from an effective joint pre-training of text segments and tables crawled from Wikipedia, and is trained end-to-end. We experiment with three different semantic parsing datasets, and find that TAPAS outperforms or rivals semantic parsing models by improving state-of-the-art accuracy on SQA from 55.1 to 67.2 and performing on par with the state-of-the-art on WIKISQL and WIKITQ, but with a simpler model architecture. We additionally find that transfer learning, which is trivial in our setting, from WIKISQL to WIKITQ, yields 48.7 accuracy, 4.2 points above the state-of-the-art. 116 |
117 | 118 | * **[TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data](https://arxiv.org/pdf/2005.08314.pdf)**, \[[code](https://github.com/facebookresearch/TaBERT) ![](https://img.shields.io/github/stars/facebookresearch/TaBERT.svg?style=social)\] 119 |
120 | Pengcheng Yin, Graham Neubig, Wen-tau Yih, Sebastian Riedel ACL 2020 121 | Recent years have witnessed the burgeoning of pretrained language models (LMs) for text-based natural language (NL) understanding tasks. Such models are typically trained on free-form NL text, hence may not be suitable for tasks like semantic parsing over structured data, which require reasoning over both free-form NL questions and structured tabular data (e.g., database tables). In this paper we present TaBERT, a pretrained LM that jointly learns representations for NL sentences and (semi-)structured tables. TaBERT is trained on a large corpus of 26 million tables and their English contexts. In experiments, neural semantic parsers using TaBERT as feature representation layers achieve new best results on the challenging weakly-supervised semantic parsing benchmark WikiTableQuestions, while performing competitively on the text-to-SQL dataset Spider. 122 |
123 | 124 | * [Web Table Extraction, Retrieval and Augmentation: A Survey](https://arxiv.org/pdf/2002.00207.pdf) 125 |
126 | Shuo Zhang, Krisztian Balog ACM Transactions on Intelligent Systems and Technology 2020 127 | Tables are a powerful and popular tool for organizing and manipulating data. A vast number of tables can be found on the Web, which represents a valuable knowledge resource. The objective of this survey is to synthesize and present two decades of research on web tables. In particular, we organize existing literature into six main categories of information access tasks: table extraction, table interpretation, table search, question answering, knowledge base augmentation, and table augmentation. For each of these tasks, we identify and describe seminal approaches, present relevant resources, and point out interdependencies among the different tasks. 128 |
129 | 130 | * [Structure-aware Pre-training for Table Understanding with Tree-based Transformers](https://arxiv.org/pdf/2010.12537.pdf) 131 |
132 | Zhiruo Wang et al. arXiv 2020 133 | Tables are widely used with various structures to organize and present data. Recent attempts on table understanding mainly focus on relational tables, yet overlook to other common table structures. In this paper, we propose TUTA, a unified pre-training architecture for understanding generally structured tables. Since understanding a table needs to leverage both spatial, hierarchical, and semantic information, we adapt the self-attention strategy with several key structure-aware mechanisms. First, we propose a novel tree-based structure called a bi-dimensional coordinate tree, to describe both the spatial and hierarchical information in tables. Upon this, we extend the pre-training architecture with two core mechanisms, namely the tree-based attention and tree-based position embedding. Moreover, to capture table information in a progressive manner, we devise three pre-training objectives to enable representations at the token, cell, and table levels. TUTA pre-trains on a wide range of unlabeled tables and fine-tunes on a critical task in the field of table structure understanding, i.e. cell type classification. Experiment results show that TUTA is highly effective, achieving state-of-the-art on four well-annotated cell type classification datasets. 134 |
135 | 136 | #### 2019 137 | 138 | * [Auto-completion for Data Cells in Relational Tables](https://arxiv.org/pdf/1909.03443.pdf) 139 |
140 | Shuo Zhang, Krisztian Balog CIKM 2019 141 | We address the task of auto-completing data cells in relational tables. Such tables describe entities (in rows) with their attributes (in columns). We present the CellAutoComplete framework to tackle several novel aspects of this problem, including: (i) enabling a cell to have multiple, possibly conflicting values, (ii) supplementing the predicted values with supporting evidence, (iii) combining evidence from multiple sources, and (iv) handling the case where a cell should be left empty. Our framework makes use of a large table corpus and a knowledge base as data sources, and consists of preprocessing, candidate value finding, and value ranking components. Using a purpose-built test collection, we show that our approach is 40% more effective than the best baseline. 142 |
143 | 144 | 145 | * [Learning Semantic Annotations for Tabular Data](https://arxiv.org/pdf/1906.00781.pdf), \[[code](https://github.com/alan-turing-institute/SemAIDA) ![](https://img.shields.io/github/stars/alan-turing-institute/SemAIDA.svg?style=social)\] 146 |
147 | Jiaoyan Chen et al. IJCAI 2019 148 | The usefulness of tabular data such as web tables critically depends on understanding their semantics. This study focuses on column type prediction for tables without any meta data. Unlike traditional lexical matching-based methods, we propose a deep prediction model that can fully exploit a table's contextual semantics, including table locality features learned by a Hybrid Neural Network (HNN), and inter-column semantics features learned by a knowledge base (KB) lookup and query answering this http URL exhibits good performance not only on individual table sets, but also when transferring from one table set to another. 149 |
150 | 151 | * [ColNet: Embedding the Semantics of Web Tables for Column Type Prediction](https://arxiv.org/pdf/1811.01304.pdf), \[[code](https://github.com/alan-turing-institute/SemAIDA) ![](https://img.shields.io/github/stars/alan-turing-institute/SemAIDA.svg?style=social)\] 152 |
153 | Jiaoyan Chen et al. AAAI 2019 154 | Automatically annotating column types with knowledge base (KB) concepts is a critical task to gain a basic understanding of web tables. Current methods rely on either table metadata like column name or entity correspondences of cells in the KB, and may fail to deal with growing web tables with incomplete meta information. In this paper we propose a neural network based column type annotation framework named ColNet which is able to integrate KB reasoning and lookup with machine learning and can automatically train Convolutional Neural Networks for prediction. The prediction model not only considers the contextual semantics within a cell using word representation, but also embeds the semantics of a column by learning locality features from multiple cells. The method is evaluated with DBPedia and two different web table datasets, T2Dv2 from the general Web and Limaye from Wikipedia pages, and achieves higher performance than the state-of-the-art approaches. 155 |
156 | 157 | #### Older 158 | 159 | * [EntiTables: Smart Assistance for Entity-Focused Tables](https://arxiv.org/pdf/1708.08721.pdf), \[[code](https://github.com/iai-group/sigir2017-table) ![](https://img.shields.io/github/stars/iai-group/sigir2017-table.svg?style=social)\] 160 |
161 | Shuo Zhang, Krisztian Balog SIGIR 2017 162 | Tables are among the most powerful and practical tools for organizing and working with data. Our motivation is to equip spreadsheet programs with smart assistance capabilities. We concentrate on one particular family of tables, namely, tables with an entity focus. We introduce and focus on two specifc tasks: populating rows with additional instances (entities) and populating columns with new headings. We develop generative probabilistic models for both tasks. For estimating the components of these models, we consider a knowledge base as well as a large table corpus. Our experimental evaluation simulates the various stages of the user entering content into an actual table. A detailed analysis of the results shows that the models' components are complimentary and that our methods outperform existing approaches from the literature. 163 |
164 | 165 | ### Datasets 166 | 167 | #### Information retrieval from tables 168 | 169 | * [TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance](https://arxiv.org/abs/2105.07624), \[[code\data](https://github.com/NExTplusplus/TAT-QA) ![](https://img.shields.io/github/stars/NExTplusplus/TAT-QA.svg?style=social)\] 170 |
171 | Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, Tat-Seng Chua ACL 2021 172 | Hybrid data combining both tabular and textual content (e.g., financial reports) are quite pervasive in the real world. However, Question Answering (QA) over such hybrid data is largely neglected in existing research. In this work, we extract samples from real financial reports to build a new large-scale QA dataset containing both Tabular And Textual data, named TAT-QA, where numerical reasoning is usually required to infer the answer, such as addition, subtraction, multiplication, division, counting, comparison/sorting, and the compositions. We further propose a novel QA model termed TAGOP, which is capable of reasoning over both tables and text. It adopts sequence tagging to extract relevant cells from the table along with relevant spans from the text to infer their semantics, and then applies symbolic reasoning over them with a set of aggregation operators to arrive at the final answer. TAGOPachieves 58.0% inF1, which is an 11.1% absolute increase over the previous best baseline model, according to our experiments on TAT-QA. But this result still lags far behind performance of expert human, i.e.90.8% in F1. It is demonstrated that our TAT-QA is very challenging and can serve as a benchmark for training and testing powerful QA models that address hybrid form data. 173 |
174 | 175 | * [Open Question Answering over Tables and Text](https://arxiv.org/pdf/2010.10439.pdf), \[[code](https://github.com/wenhuchen/OTT-QA) ![](https://img.shields.io/github/stars/wenhuchen/OTT-QA.svg?style=social)\] 176 |
177 | Wenhu Chen et al. ICLR 2021 178 | In open question answering (QA), the answer to a question is produced by retrieving and then analyzing documents that might contain answers to the question. Most open QA systems have considered only retrieving information from unstructured text. Here we consider for the first time open QA over both tabular and textual data and present a new large-scale dataset Open Table-Text Question Answering (OTT-QA) to evaluate performance on this task. Most questions in OTT-QA require multi-hop inference across tabular data and unstructured text, and the evidence required to answer a question can be distributed in different ways over these two types of input, making evidence retrieval challenging---our baseline model using an iterative retriever and BERT-based reader achieves an exact match score less than 10%. We then propose two novel techniques to address the challenge of retrieving and aggregating evidence for OTT-QA. The first technique is to use "early fusion" to group multiple highly relevant tabular and textual units into a fused block, which provides more context for the retriever to search for. The second technique is to use a cross-block reader to model the cross-dependency between multiple retrieved evidences with global-local sparse attention. Combining these two techniques improves the score significantly, to above 27%. 179 |
180 | 181 | * [FeTaQA: Free-form Table Question Answering](https://arxiv.org/pdf/2104.00369.pdf), \[[code](https://github.com/Yale-LILY/FeTaQA) ![](https://img.shields.io/github/stars/Yale-LILY/FeTaQA.svg?style=social)\] 182 |
183 | Linyong Nan et al. arXiv 2021 184 | Existing table question answering datasets contain abundant factual questions that primarily evaluate the query and schema comprehension capability of a system, but they fail to include questions that require complex reasoning and integration of information due to the constraint of the associated short-form answers. To address these issues and to demonstrate the full challenge of table question answering, we introduce FeTaQA, a new dataset with 10K Wikipedia-based {table, question, free-form answer, supporting table cells} pairs. FeTaQA yields a more challenging table question answering setting because it requires generating free-form text answers after retrieval, inference, and integration of multiple discontinuous facts from a structured knowledge source. Unlike datasets of generative QA over text in which answers are prevalent with copies of short text spans from the source, answers in our dataset are human-generated explanations involving entities and their high-level relations. We provide two benchmark methods for the proposed task: a pipeline method based on semantic-parsing-based QA systems and an end-to-end method based on large pretrained text generation models, and show that FeTaQA poses a challenge for both methods. 185 |
186 | 187 | * [INFOTABS: Inference on Tables as Semi-structured Data](https://arxiv.org/abs/2005.06117), \[[webpage\code\data](https://infotabs.github.io/) \] 188 |
189 | Vivek Gupta, Maitrey Mehta, Pegah Nokhiz, Vivek Srikumar ACL 2020 190 | In this paper, we observe that semi-structured tabulated text is ubiquitous; understanding them requires not only comprehending the meaning of text fragments, but also implicit relationships between them. We argue that such data can prove as a testing ground for understanding how we reason about information. To study this, we introduce a new dataset called INFOTABS, comprising of human-written textual hypotheses based on premises that are tables extracted from Wikipedia info-boxes. Our analysis shows that the semi-structured, multi-domain and heterogeneous nature of the premises admits complex, multi-faceted reasoning. Experiments reveal that, while human annotators agree on the relationships between a table-hypothesis pair, several standard modeling strategies are unsuccessful at the task, suggesting that reasoning about tables can pose a difficult modeling challenge. 191 |
192 | 193 | * [TabFact: A Large-scale Dataset for Table-based Fact Verification](https://openreview.net/pdf?id=rkeJRhNYDH), \[[code](https://github.com/wenhuchen/Table-Fact-Checking) ![](https://img.shields.io/github/stars/wenhuchen/Table-Fact-Checking.svg?style=social)\] 194 |
195 | Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang,Shiyang Li, Xiyou Zhou, William Yang Wang ICLR 2020 196 | The problem of verifying whether a textual hypothesis holds based on the given evidence, also known as fact verification, plays an important role in the study of natural language understanding and semantic representation. However, existing studies are mainly restricted to dealing with unstructured evidence (e.g., natural language sentences and documents, news, etc), while verification under structured evidence, such as tables, graphs, and databases, remains unexplored. This paper specifically aims to study the fact verification given semi-structured data as evidence. To this end, we construct a large-scale dataset called TabFact with 16k Wikipedia tables as the evidence for 118k human-annotated natural language statements, which are labeled as either ENTAILED or REFUTED. TabFact is challenging since it involves both soft linguistic reasoning and hard symbolic reasoning. To address these reasoning challenges, we design two different models: Table-BERT and Latent Program Algorithm (LPA). Table-BERT leverages the state-of-the-art pre-trained language model to encode the linearized tables and statements into continuous vectors for verification. LPA parses statements into LISP-like programs and executes them against the tables to obtain the returned binary value for verification. Both methods achieve similar accuracy but still lag far behind human performance. We also perform a comprehensive analysis to demonstrate great future opportunities. 197 |
198 | 199 | * [Search-based Neural Structured Learning for Sequential Question Answering](https://www.aclweb.org/anthology/P17-1167.pdf), \[[Github](https://github.com/microsoft/DynSP) ![](https://img.shields.io/github/stars/microsoft/DynSP.svg?style=social)\], \[[page](https://www.microsoft.com/en-us/download/details.aspx?id=54253)\] 200 |
201 | Mohit Iyyer, Wen-tau Yih, Ming-Wei Chang ACL 2017 202 | Recent work in semantic parsing for question answering has focused on long and complicated questions, many of which would seem unnatural if asked in a normal conversation between two humans. In an effort to explore a conversational QA setting, we present a more realistic task: answering sequences of simple but inter-related questions. We collect a dataset of 6,066 question sequences that inquire about semi-structured tables from Wikipedia, with 17,553 question-answer pairs in total. To solve this sequential question answering task, we propose a novel dynamic neural semantic parsing framework trained using a weakly supervised reward-guided search. Our model effectively leverages the sequential context to outperform state-of-the-art QA systems that are designed to answer highly complex questions. 203 |
204 | 205 | * **[Compositional Semantic Parsing on Semi-Structured Tables](https://www.aclweb.org/anthology/P15-1142.pdf)**, \[[page](https://ppasupat.github.io/WikiTableQuestions/)\] 206 |
207 | Panupong Pasupat, Percy Liang ACL 2015 208 | Two important aspects of semantic parsing for question answering are the breadth of the knowledge source and the depth of logical compositionality. While existing work trades off one aspect for another, this paper simultaneously makes progress on both fronts through a new task: answering complex questions on semi-structured tables using question-answer pairs as supervision. The central challenge arises from two compounding factors: the broader domain results in an open-ended set of relations, and the deeper compositionality results in a combinatorial explosion in the space of logical forms. We propose a logical-form driven parsing algorithm guided by strong typing constraints and show that it obtains significant improvements over natural baselines. For evaluation, we created a new dataset of 22,033 complex questions on Wikipedia tables, which is made publicly available. 209 |
210 | 211 | #### Collections of not annotated tables 212 | 213 | * **[A Large Public Corpus of Web Tables containing Time and Context Metadata](http://gdac.uqam.ca/WWW2016-Proceedings/companion/p75.pdf)**, \[[page](http://webdatacommons.org/webtables/)] 214 |
215 | Oliver Lehmberg et al. WWW 2016 216 | The Web contains vast amounts of HTML tables. Most of these tables are used for layout purposes, but a small subset of the tables is relational, meaning that they contain structured data describing a set of entities [2]. As these relational Web tables cover a very wide range of different topics, there is a growing body of research investigating the utility of Web table data for completing cross-domain knowledge bases [6], for extending arbitrary tables with additional attributes [7, 4], as well as for translating data values [5]. The existing research shows the potentials of Web tables. However, comparing the performance of the different systems is difficult as up till now each system is evaluated using a different corpus of Web tables and as most of the corpora are owned by large search engine companies and are thus not accessible to the public. In this poster, we present a large public corpus of Web tables which contains over 233 million tables and has been extracted from the July 2015 version of the CommonCrawl. By publishing the corpus as well as all tools that we used to extract it from the crawled data, we intend to provide a common ground for evaluating Web table systems. The main difference of the corpus compared to an earlier corpus that we extracted from the 2012 version of the CommonCrawl as well as the corpus extracted by Eberius et al. [3] from the 2014 version of the CommonCrawl is that the current corpus contains a richer set of metadata for each table. This metadata includes table-specific information such as table orientation, table caption, header row, and key column, but also context information such as the text before and after the table, the title of the HTML page, as well as timestamp information that was found before and after the table. The context information can be useful for recovering the semantics of a table [7]. The timestamp information is crucial for fusing time-depended data, such as alternative population numbers for a city [8]. 217 |
218 | 219 | * [Top-k entity augmentation using consistent set covering](https://wwwdb.inf.tu-dresden.de/misc/publications/rea.pdf), \[[page](https://wwwdb.inf.tu-dresden.de/misc/dwtc/)\] 220 |
221 | Julian Eberius et al. SSDBM 2015 222 | Entity augmentation is a query type in which, given a set of entities and a large corpus of possible data sources, the values of a missing attribute are to be retrieved. State of the art methods return a single result that, to cover all queried entities, is fused from a potentially large set of data sources. We argue that queries on large corpora of heterogeneous sources using information retrieval and automatic schema matching methods can not easily return a single result that the user can trust, especially if the result is composed from a large number of sources that user has to verify manually. We therefore propose to process these queries in a Top-k fashion, in which the system produces multiple minimal consistent solutions from which the user can choose to resolve the uncertainty of the data sources and methods used. In this paper, we introduce and formalize the problem of consistent, multi-solution set covering, and present algorithms based on a greedy and a genetic optimization approach. We then apply these algorithms to Web table-based entity augmentation. The publication further includes a Web table corpus with 100M tables, and a Web table retrieval and matching system in which these algorithms are implemented. Our experiments show that the consistency and minimality of the augmentation results can be improved using our set covering approach, without loss of precision or coverage and while producing multiple alternative query results. 223 |
224 | 225 | * [Methods for exploring and mining tables on Wikipedia](https://www.researchgate.net/publication/261849268_Methods_for_exploring_and_mining_tables_on_Wikipedia), \[[page](https://downey-n1.cs.northwestern.edu/public/)\] 226 |
227 | Chandra Bhagavatula, Thanapon Noraset, Doug Downey ACM SIGKDD 2013 228 | Knowledge bases extracted automatically from the Web present new opportunities for data mining and exploration. Given a large, heterogeneous set of extracted relations, new tools are needed for searching the knowledge and uncovering relationships of interest. We present WikiTables, a Web application that enables users to interactively explore tabular knowledge extracted from Wikipedia. In experiments, we show that WikiTables substantially outperforms baselines on the novel task of automatically joining together disparate tables to uncover "interesting" relationships between table columns. We find that a "Semantic Relatedness" measure that leverages the Wikipedia link structure accounts for a majority of this improvement. Further, on the task of keyword search for tables, we show that WikiTables performs comparably to Google Fusion Tables despite using an order of magnitude fewer tables. Our work also includes the release of a number of public resources, including over 15 million tuples of extracted tabular data, manually annotated evaluation sets, and public APIs. 229 |
230 | 231 | 232 | ## Robotic Process Automation 233 | 234 | [Back to top](#table-of-contents) 235 | 236 | * [Towards Quantifying the Effects of Robotic Process Automation](http://dbis.eprints.uni-ulm.de/1959/1/fopas_wew_2020a.pdf) 237 |
238 | Judith Wewerka, Manfred Reichert EDOCW 2020 239 | Robotic Process Automation (RPA) is the automation of rule-based routine processes to increase process efficiency and to reduce process costs. In practice, however, RPA is often applied without knowledge of the concrete effects its introduction will have on the automated process and the involved stakeholders. Accordingly, literature on the quantitative effects of RPA is scarce. The objective of this paper is to provide empirical insights into improvements and deteriorations of business processes achieved in twelve RPA projects in the automotive industry. The results indicate that the positive benefits promised in literature are not always achieved in practice. In particular, shorter case duration and better quality are not confirmed by the empirical data gathered in the considered RPA projects. These quantitative insights constitute a valuable contribution to the currently rather qualitative literature on RPA. 240 |
241 | 242 | * [Automated Discovery of Data Transformations for Robotic Process Automation](https://arxiv.org/pdf/2001.01007.pdf) 243 |
244 | Volodymyr Leno, Marlon Dumas, Marcello La Rosa, Fabrizio Maria Maggi, Artem Polyvyanyy AAAI-20 workshop on IPA 2020 245 | Robotic Process Automation (RPA) is a technology for automating repetitive routines consisting of sequences of user interactions with one or more applications. In order to fully exploit the opportunities opened by RPA, companies need to discover which specific routines may be automated, and how. In this setting, this paper addresses the problem of analyzing User Interaction (UI) logs in order to discover routines where a user transfers data from one spreadsheet or (Web) form to another. The paper maps this problem to that of discovering data transformations by example - a problem for which several techniques are available. The paper shows that a naive application of a state-of-the-art technique for data transformation discovery is computationally inefficient. Accordingly, the paper proposes two optimizations that take advantage of the information in the UI log and the fact that data transfers across applications typically involve copying alphabetic and numeric tokens separately. The proposed approach and its optimizations are evaluated using UI logs that replicate a real-life repetitive data transfer routine. 246 |
247 | 248 | * [A Unified Conversational Assistant Framework for Business Process Automation](https://arxiv.org/pdf/2001.03543.pdf) 249 |
250 | Yara Rizk, Abhishek Bhandwalder, S. Boag, T. Chakraborti, Vatche Isahagian, Y. Khazaeni, Falk Pollock, M. Unuvar - 2020 251 | Business process automation is a booming multi-billion-dollar industry that promises to remove menial tasks from workers' plates -- through the introduction of autonomous agents -- and free up their time and brain power for more creative and engaging tasks. However, an essential component to the successful deployment of such autonomous agents is the ability of business users to monitor their performance and customize their execution. A simple and user-friendly interface with a low learning curve is necessary to increase the adoption of such agents in banking, insurance, retail and other domains. As a result, proactive chatbots will play a crucial role in the business automation space. Not only can they respond to users' queries and perform actions on their behalf but also initiate communication with the users to inform them of the system's behavior. This will provide business users a natural language interface to interact with, monitor and control autonomous agents. In this work, we present a multi-agent orchestration framework to develop such proactive chatbots by discussing the types of skills that can be composed into agents and how to orchestrate these agents. Two use cases on a travel preapproval business process and a loan application business process are adopted to qualitatively analyze the proposed framework based on four criteria: performance, coding overhead, scalability, and agent overlap. 252 |
253 | 254 | * [Robotic Process Automation - A Systematic Literature Review and Assessment Framework](https://arxiv.org/pdf/2012.11951.pdf) 255 |
256 | Judith Wewerka, Manfred Reichert - 2020 257 | Robotic Process Automation (RPA) is the automation of rule-based routine processes to increase efficiency and to reduce costs. Due to the utmost importance of process automation in industry, RPA attracts increasing attention in the scientific field as well. This paper presents the state-of-the-art in the RPA field by means of a Systematic Literature Review (SLR). In this SLR, 63 publications are identified, categorised, and analysed along well-defined research questions. From the SLR findings, moreover, a framework for systematically analysing, assessing, and comparing existing as well as upcoming RPA works is derived. The discovered thematic clusters advise further investigations in order to develop an even more detailed structural research approach for RPA. 258 |
259 | 260 | * [Robotic Process Automation](https://link.springer.com/article/10.1007/s12599-018-0542-4) 261 |
262 | Wil M. P. van der Aalst, Martin Bichler, Armin Heinzl Business & Information Systems Engineering 2018 263 | A foundational question for many BISE (Business and Information Systems Engineering) authors and readers is “What should be automated and what should be done by humans?” This question is not new. However, developments in data science, machine learning, and artificial intelligence force us to revisit this question continuously. Robotic Process Automation (RPA) is one of these developments. RPA is an umbrella term for tools that operate on the user interface of other computer systems in the way a human would do. RPA aims to replace people by automation done in an “outside-in’’ manner. This differs from the classical “inside-out” approach to improve information systems. Unlike traditional workflow technology, the information system remains unchanged. Gartner defines Robotic Process Automation (RPA) as follows: “RPA tools perform [if, then, else] statements on structured data, typically using a combination of user interface interactions, or by connecting to APIs to drive client servers, mainframes or HTML code. An RPA tool operates by mapping a process in the RPA tool language for the software robot to follow, with runtime allocated to execute the script by a control dashboard.” (Tornbohm 2017). Hence, RPA tools aim to reduce the burden of repetitive, simple tasks on employees (Aguirre and Rodriguez 2017). Commercial vendors of RPA tools have witnessed a surge in demand. Moreover, many new vendors entered the market in the last 2 years. This is no surprise as most organizations are still looking for ways to cut costs and quickly link legacy applications together. RPA is currently seen as a way to quickly achieve a high Return on Investment (RoI). There are dedicated RPA vendors like AutomationEdge, Automation Anywhere, Blue Prism, Kryon Systems, Softomotive, and UiPath that only offer RPA software (Le Clair 2017; Tornbohm 2017). There are also many other vendors that have embedded RPA functionality in their software or that are offering several tools (not just RPA). For example, Pegasystems and Cognizant provide RPA next to traditional BPM, CRM, and BI functionality. The goal of this editorial is to reflect on these developments and to discuss RPA research challenges for the BISE community. 264 |
265 | 266 | * [Robotic Process Automation of Unstructured Data with Machine Learning](https://pdfs.semanticscholar.org/bb4c/ec661f4d5d0b83c49353b896f16ed7bdd55e.pdf) 267 |
268 | Anna Wróblewska, Tomasz Stanisławek, Bartłomiej Prus-Zajączkowski, Łukasz Garncarek FedCSCIS 2018 269 | In this paper we present our work in progress on building an artificial intelligence system dedicated to tasks regarding the processing of formal documents used in various kinds of business procedures. The main challenge is to build machine learning (ML) models to improve the quality and efficiency of business processes involving image processing, optical character recognition (OCR), text mining and information extraction. In the paper we introduce the research and application field, some common techniques used in this area and our preliminary results and conclusions. 270 |
271 | 272 | -------------------------------------------------------------------------------- /topics/sdu/README.md: -------------------------------------------------------------------------------- 1 | ## Table of contents 2 | 3 | 1. [Papers](#papers) 4 | 1. [Datasets](#datasets) 5 | 6 | 7 | ## Papers 8 | 9 | #### 2021 10 | 11 | * [On Generating Extended Summaries of Long Documents](https://arxiv.org/abs/2012.14136v1), \[[code](https://github.com/Georgetown-IR-Lab/ExtendedSumm) ![](https://img.shields.io/github/stars/Georgetown-IR-Lab/ExtendedSumm.svg?style=social)\] 12 |
13 | Sajad Sotudeh, Arman Cohan, Nazli Goharian SDU 2021 14 | Prior work in document summarization has mainly focused on generating short summaries of a document. While this type of summary helps get a high-level view of a given document, it is desirable in some cases to know more detailed information about its salient points that can't fit in a short summary. This is typically the case for longer documents such as a research paper, legal document, or a book. In this paper, we present a new method for generating extended summaries of long papers. Our method exploits hierarchical structure of the documents and incorporates it into an extractive summarization model through a multi-task learning approach. We then present our results on three long summarization datasets, arXiv-Long, PubMed-Long, and Longsumm. Our method outperforms or matches the performance of strong baselines. Furthermore, we perform a comprehensive analysis over the generated results, shedding insights on future research for long-form summary generation task. Our analysis shows that our multi-tasking approach can adjust extraction probability distribution to the favor of summary-worthy sentences across diverse sections. 15 |
16 | 17 | #### 2020 18 | 19 | * [Acronym Identification and Disambiguation Shared Tasksfor Scientific Document Understanding](https://arxiv.org/pdf/2012.11760.pdf), \[[code/data](https://github.com/amirveyseh/AAAI-21-SDU-shared-task-2-AD) ![](https://img.shields.io/github/stars/amirveyseh/AAAI-21-SDU-shared-task-2-AD.svg?style=social)\] 20 |
21 | Amir Pouran Ben Veyseh, Franck Dernoncourt, Quan Hung Tran, Thien Huu Nguyen COLING 2020 22 | Acronyms are the short forms of longer phrases and they are frequently used in writing, especially scholarly writing, to save space and facilitate the communication of information. As such, every text understanding tool should be capable of recognizing acronyms in text (i.e., acronym identification) and also finding their correct meaning (i.e., acronym disambiguation). As most of the prior works on these tasks are restricted to the biomedical domain and use unsupervised methods or models trained on limited datasets, they fail to perform well for scientific document understanding. To push forward research in this direction, we have organized two shared task for acronym identification and acronym disambiguation in scientific documents, named AI@SDU and AD@SDU, respectively. The two shared tasks have attracted 52 and 43 participants, respectively. While the submitted systems make substantial improvements compared to the existing baselines, there are still far from the human-level performance. This paper reviews the two shared tasks and the prominent participating systems for each of them. 23 |
24 | 25 | * [AxCell: Automatic Extraction of Resultsfrom Machine Learning Papers](https://arxiv.org/abs/2004.14356), \[[code](https://github.com/paperswithcode/axcell) ![](https://img.shields.io/github/stars/paperswithcode/axcell.svg?style=social)\] 26 |
27 | Marcin Kardas, Piotr Czapla, Pontus Stenetorp, Sebastian Ruder, Sebastian Riedel, Ross Taylor, Robert Stojnic EMNLP 2020 28 | Tracking progress in machine learning has become increasingly difficult with the recent explosion in the number of papers. In this paper, we present AxCell, an automatic machine learning pipeline for extracting results from papers. AxCell uses several novel components, including a table segmentation subtask, to learn relevant structural knowledge that aids extraction. When compared with existing methods, our approach significantly improves the state of the art for results extraction. We also release a structured, annotated dataset for training models for results extraction, and a dataset for evaluating the performance of models on this task. Lastly, we show the viability of our approach enables it to be used for semi-automated results extraction in production, suggesting our improvements make this task practically viable for the first time. Code is available on GitHub. 29 |
30 | 31 | 32 | * [A New Neural Search and Insights Platform for Navigating and Organizing AI Research](https://www.aclweb.org/anthology/2020.sdp-1.23.pdf), \[[Website](https://search.zeta-alpha.com/)\] 33 |
34 | Marzieh Fadaee, Olga Gureenkova, Fernando Rejon Barrera, Carsten Schnober, Wouter Weerkamp, Jakub Zavrel arxiv 2020 35 | To provide AI researchers with modern tools for dealing with the explosive growth of the research literature in their field, we introduce a new platform, AI Research Navigator, that combines classical keyword search with neural retrieval to discover and organize relevant literature. The system provides search at multiple levels of textual granularity, from sentences to aggregations across documents, both in natural language and through navigation in a domain-specific Knowledge Graph. We give an overview of the overall architecture of the system and of the components for document analysis, question answering, search, analytics, expert search, and recommendations. 36 |
37 | 38 | ### Datasets 39 | 40 | * [SciREX: A Challenge Dataset for Document-Level Information Extraction](https://www.aclweb.org/anthology/2020.acl-main.670.pdf) \[[code/data](https://github.com/allenai/SciREX) ![](https://img.shields.io/github/stars/allenai/SciREX.svg?style=social)\] 41 |
42 | Sarthak Jain, Madeleine van Zuylen, Hannaneh Hajishirzi, Iz Beltagy ACL 2020 43 | Extracting information from full documents is an important problem in many domains, but most previous work focus on identifying relationships within a sentence or a paragraph. It is challenging to create a large-scale information extraction (IE) dataset at the document level since it requires an understanding of the whole document to annotate entities and their document-level relationships that usually span beyond sentences or even sections. In this paper, we introduce SciREX, a document level IE dataset that encompasses multiple IE tasks, including salient entity identification and document level N-ary relation identification from scientific articles. We annotate our dataset by integrating automatic and human annotations, leveraging existing scientific knowledge resources. We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE. Analyzing the model performance shows a significant gap between human performance and current baselines, inviting the community to use our dataset as a challenge to develop document-level IE models. 44 |
45 | 46 | * [TableArXiv: Scientific Table Search Using Keyword Queries](https://arxiv.org/abs/1707.03423) \[[Website](http://boston.lti.cs.cmu.edu/eager/table-arxiv/)\] 47 |
48 | Kyle Yingkai Gao, Jamie Callan arxiv 2017 49 | Tables are common and important in scientific documents, yet most text-based document search systems do not capture structures and semantics specific to tables. How to bridge different types of mismatch between keywords queries and scientific tables and what influences ranking quality needs to be carefully investigated. This paper considers the structure of tables and gives different emphasis to table components. On the query side, thanks to external knowledge such as knowledge bases and ontologies, key concepts are extracted and used to build structured queries, and target quantity types are identified and used to expand original queries. A probabilistic framework is proposed to incorporate structural and semantic information from both query and table sides. We also construct and release TableArXiv, a high quality dataset with 105 queries and corresponding relevance judgements for scientific table search. Experiments demonstrate significantly higher accuracy overall and at the top of the rankings than several baseline methods. 50 |
51 | --------------------------------------------------------------------------------