HybriDLA: Hybrid Generation for Document Layout Analysis

Authors

  • Yufan Chen Karlsruhe Institute of Technology
  • Omar Moured Karlsruhe Institute of Technology
  • Ruiping Liu Karlsruhe Institute of Technology
  • Junwei Zheng Karlsruhe Institute of Technology
  • Kunyu Peng Karlsruhe Institute of Technology
  • Jiaming Zhang Hunan University
  • Rainer Stiefelhagen Karlsruhe Institute of Technology

DOI:

https://doi.org/10.1609/aaai.v40i4.37308

Abstract

Conventional document layout analysis (DLA) traditionally depends on empirical priors or a fixed set of learnable queries executed in a single forward pass. While sufficient for early-generation documents with a small, predetermined number of regions, this paradigm struggles with contemporary documents, which exhibit diverse element counts and increasingly complex layouts. To address challenges posed by modern documents, we present HybriDLA, a novel generative framework that unifies diffusion and autoregressive decoding within a single layer. The diffusion component iteratively refines bounding-box hypotheses, whereas the autoregressive component injects semantic and contextual awareness, enabling precise region prediction even in highly varied layouts. To further enhance detection quality, we design a multi-scale feature-fusion encoder that captures both fine-grained and high-level visual cues. This architecture elevates performance to 83.5% mean Average Precision (mAP). Extensive experiments on the DocLayNet and M6Doc benchmarks demonstrate that HybriDLA sets a state-of-the-art performance, outperforming previous approaches.

Published

2026-03-14

How to Cite

Chen, Y., Moured, O., Liu, R., Zheng, J., Peng, K., Zhang, J., & Stiefelhagen, R. (2026). HybriDLA: Hybrid Generation for Document Layout Analysis. Proceedings of the AAAI Conference on Artificial Intelligence, 40(4), 3147–3155. https://doi.org/10.1609/aaai.v40i4.37308

Issue

Section

AAAI Technical Track on Computer Vision I