DocParser: Hierarchical Document Structure Parsing from Renderings

Authors

  • Johannes Rausch Department of Computer Science, ETH Zurich
  • Octavio Martinez Department of Computer Science, ETH Zurich
  • Fabian Bissig Department of Computer Science, ETH Zurich
  • Ce Zhang Department of Computer Science, ETH Zurich
  • Stefan Feuerriegel Department of Management, Technology, and Economics, ETH Zurich

DOI:

https://doi.org/10.1609/aaai.v35i5.16558

Keywords:

Applications, Information Extraction

Abstract

Translating renderings (e. g. PDFs, scans) into hierarchical document structures is extensively demanded in the daily routines of many real-world applications. However, a holistic, principled approach to inferring the complete hierarchical structure in documents is missing. As a remedy, we developed “DocParser”: an end-to-end system for parsing complete document structure – including all text elements, nested figures, tables, and table cell structures. Our second contribution is to provide a dataset for evaluating hierarchical document structure parsing. Our third contribution is to propose a scalable learning framework for settings where domain-specific data are scarce, which we address by a novel approach to weak supervision that significantly improves the document structure parsing performance. Our experiments confirm the effectiveness of our proposed weak supervision: Compared to the baseline without weak supervision, it improves the mean average precision for detecting document entities by 39.1% and improves the F1 score of classifying hierarchical relations by 35.8%.

Downloads

Published

2021-05-18

How to Cite

Rausch, J., Martinez, O., Bissig, F., Zhang, C., & Feuerriegel, S. (2021). DocParser: Hierarchical Document Structure Parsing from Renderings. Proceedings of the AAAI Conference on Artificial Intelligence, 35(5), 4328-4338. https://doi.org/10.1609/aaai.v35i5.16558

Issue

Section

AAAI Technical Track on Data Mining and Knowledge Management