LaTeX2Layout: High-Fidelity, Scalable Document Layout Annotation Pipeline for Layout Detection

Feijiang Han; Zelong Wang; Bowen Wang; Xinxin Liu; Skyler Cheung; Delip Rao; Chris Callison-Burch; Lyle Ungar

doi:10.1609/aaai.v40i37.40349

Authors

Feijiang Han University of Pennsylvania
Zelong Wang University of Pennsylvania
Bowen Wang University of Pennsylvania
Xinxin Liu University of Pennsylvania
Skyler Cheung University of Pennsylvania
Delip Rao University of Pennsylvania
Chris Callison-Burch University of Pennsylvania
Lyle Ungar University of Pennsylvania

DOI:

https://doi.org/10.1609/aaai.v40i37.40349

Abstract

General-purpose Vision-Language Models (VLMs) are increasingly integral to modern AI systems for document understanding, yet their ability to perform fine-grained layout analysis remains severely underdeveloped. Overcoming this limitation requires large-scale, high-fidelity training datasets. However, current annotation methods that rely on parsing rendered PDFs are costly, error-prone, and difficult to scale. We propose a different paradigm: extracting ground-truth layout directly from the LaTeX compilation process rather than the final PDF. We present LaTeX2Layout, a generalizable procedural pipeline that recovers pixel-accurate bounding boxes and reading order from compiler traces. This enables the generation of a 140K-page dataset, including 120K programmatically generated synthetic variants that more than double the layout diversity of real-world data. Using this dataset, we fine-tune an efficient 3B-parameter VLM with an easy-to-hard curriculum that accelerates convergence. Our model achieves Kendall's tau=0.95 for reading order and mAP@50=0.91 for element grounding, delivering nearly 200% relative improvement over strong zero-shot baselines such as GPT-4o and Claude-3.7.

LaTeX2Layout: High-Fidelity, Scalable Document Layout Annotation Pipeline for Layout Detection

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information