Extracting Zero-shot Structured Information from Form-like Documents: Pretraining with Keys and Triggers

Rongyu Cao; Ping Luo

doi:10.1609/aaai.v35i14.17494

Authors

Rongyu Cao Key Lab of Intelligent Information Processing of Chinese Academy of Sciences University of Chinese Academy of Sciences
Ping Luo Key Lab of Intelligent Information Processing of Chinese Academy of Sciences University of Chinese Academy of Sciences Peng Cheng Laboratory

DOI:

https://doi.org/10.1609/aaai.v35i14.17494

Keywords:

Information Extraction

Abstract

In this paper, we revisit the problem of extracting the values of a given set of key fields from form-like documents. It is the vital step to support many downstream applications, such as knowledge base construction, question answering, document comprehension and so on. Previous studies ignore the semantics of the given keys by considering them only as the class labels, and thus might be incapable to handle zero-shot keys. Meanwhile, although these models often leverage the attention mechanism, the learned features might not reflect the true proxy of explanations on why humans would recognize the value for the key, and thus could not well generalize to new documents. To address these issues, we propose a Key-Aware and Trigger-Aware (KATA) extraction model. With the input key, it explicitly learns two mappings, namely from key representations to trigger representations and then from trigger representations to values. These two mappings might be intrinsic and invariant across different keys and documents. With a large training set automatically constructed based on the Wikipedia data, we pre-train these two mappings. Experiments with the fine-tuning step to two applications show that the proposed model achieves more than 70% accuracy for the extraction of zero-shot keys while previous methods all fail.

Extracting Zero-shot Structured Information from Form-like Documents: Pretraining with Keys and Triggers

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information