Alignment-Enriched Tuning for Patch-Level Pre-trained Document Image Models

Lei Wang; Jiabang He; Xing Xu; Ning Liu; Hui Liu

doi:10.1609/aaai.v37i2.25357

Authors

Lei Wang University of Electronic Science and Technology of China Singapore Management University
Jiabang He University of Electronic Science and Technology of China
Xing Xu University of Electronic Science and Technology of China
Ning Liu Beijing Forestry University
Hui Liu Beijing Rongda Technology Co., Ltd.

DOI:

https://doi.org/10.1609/aaai.v37i2.25357

Keywords:

CV: Multi-modal Vision, CV: Visual Reasoning & Symbolic Representations, ML: Multimodal Learning, SNLP: Applications

Abstract

Alignment between image and text has shown promising improvements on patch-level pre-trained document image models. However, investigating more effective or finer-grained alignment techniques during pre-training requires a large amount of computation cost and time. Thus, a question naturally arises: Could we fine-tune the pre-trained models adaptive to downstream tasks with alignment objectives and achieve comparable or better performance? In this paper, we propose a new model architecture with alignment-enriched tuning (dubbed AETNet) upon pre-trained document image models, to adapt downstream tasks with the joint task-specific supervised and alignment-aware contrastive objective. Specifically, we introduce an extra visual transformer as the alignment-ware image encoder and an extra text transformer as the alignment-ware text encoder before multimodal fusion. We consider alignment in the following three aspects: 1) document-level alignment by leveraging the cross-modal and intra-modal contrastive loss; 2) global-local alignment for modeling localized and structural information in document images; and 3) local-level alignment for more accurate patch-level information. Experiments on various downstream tasks show that AETNet can achieve state-of-the-art performance on various downstream tasks. Notably, AETNet consistently outperforms state-of-the-art pre-trained models, such as LayoutLMv3 with fine-tuning techniques, on three different downstream tasks. Code is available at https://github.com/MAEHCM/AET.

Alignment-Enriched Tuning for Patch-Level Pre-trained Document Image Models

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription