HybriDLA: Hybrid Generation for Document Layout Analysis

Yufan Chen; Omar Moured; Ruiping Liu; Junwei Zheng; Kunyu Peng; Jiaming Zhang; Rainer Stiefelhagen

doi:10.1609/aaai.v40i4.37308

Authors

Yufan Chen Karlsruhe Institute of Technology
Omar Moured Karlsruhe Institute of Technology
Ruiping Liu Karlsruhe Institute of Technology
Junwei Zheng Karlsruhe Institute of Technology
Kunyu Peng Karlsruhe Institute of Technology
Jiaming Zhang Hunan University
Rainer Stiefelhagen Karlsruhe Institute of Technology

DOI:

https://doi.org/10.1609/aaai.v40i4.37308

Abstract

Conventional document layout analysis (DLA) traditionally depends on empirical priors or a fixed set of learnable queries executed in a single forward pass. While sufficient for early-generation documents with a small, predetermined number of regions, this paradigm struggles with contemporary documents, which exhibit diverse element counts and increasingly complex layouts. To address challenges posed by modern documents, we present HybriDLA, a novel generative framework that unifies diffusion and autoregressive decoding within a single layer. The diffusion component iteratively refines bounding-box hypotheses, whereas the autoregressive component injects semantic and contextual awareness, enabling precise region prediction even in highly varied layouts. To further enhance detection quality, we design a multi-scale feature-fusion encoder that captures both fine-grained and high-level visual cues. This architecture elevates performance to 83.5% mean Average Precision (mAP). Extensive experiments on the DocLayNet and M6Doc benchmarks demonstrate that HybriDLA sets a state-of-the-art performance, outperforming previous approaches.

HybriDLA: Hybrid Generation for Document Layout Analysis

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information