DocFormerv2: Local Features for Document Understanding

Srikar Appalaraju; Peng Tang; Qi Dong; Nishant Sankaran; Yichu Zhou; R. Manmatha

doi:10.1609/aaai.v38i2.27828

Authors

Srikar Appalaraju AWS AI Labs
Peng Tang AWS AI Labs
Qi Dong AWS AI Labs
Nishant Sankaran AWS AI Labs
Yichu Zhou School of Computing at University of Utah
R. Manmatha AWS AI Labs

DOI:

https://doi.org/10.1609/aaai.v38i2.27828

Keywords:

CV: Language and Vision, CV: Multi-modal Vision

Abstract

We propose DocFormerv2, a multi-modal transformer for Visual Document Understanding (VDU). The VDU domain entails understanding documents (beyond mere OCR predictions) e.g., extracting information from a form, VQA for documents and other tasks. VDU is challenging as it needs a model to make sense of multiple modalities (visual, language and spatial) to make a prediction. Our approach, termed DocFormerv2 is an encoder-decoder transformer which takes as input - vision, language and spatial features. DocFormerv2 is pre-trained with unsupervised tasks employed asymmetrically i.e., two novel document tasks on encoder and one on the auto-regressive decoder. The unsupervised tasks have been carefully designed to ensure that the pre-training encourages local-feature alignment between multiple modalities. DocFormerv2 when evaluated on nine challenging datasets shows state-of-the-art performance on all over strong baselines - On TabFact (+4.3%), InfoVQA (+1.4%), FUNSD (+1.0%). Furthermore, to show generalization capabilities, on three VQA tasks involving scene-text, DocFormerv2 outperforms previous comparably-sized models and even does better than much larger models (such as GIT2, PaLI and Flamingo) on these tasks. Extensive ablations show that due to its novel pre-training tasks, DocFormerv2 understands multiple modalities better than prior-art in VDU.

DocFormerv2: Local Features for Document Understanding

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information