Layout-Aware Document Parsing with Visual-Linguistic Fusion: The DATA-LUX with Academic Content Service Provider

Min Chan Kim; Yeonkyung Kim; Jae Won Lee; Ki Hwan Kim; Ji Woo Kwak; Jae Hong Park

doi:10.1609/aaai.v40i47.41438

Authors

Min Chan Kim Kyung Hee University
Yeonkyung Kim Kyung Hee University
Jae Won Lee ALLBIGDAT
Ki Hwan Kim ALLBIGDAT
Ji Woo Kwak ALLBIGDAT
Jae Hong Park Kyung Hee University

DOI:

https://doi.org/10.1609/aaai.v40i47.41438

Abstract

Many organizations are increasingly relying on unstructured documents such as PDFs and scanned forms to support downstream large language model (LLM) services, including search, summarization, and recommendation. However, traditional OCR systems struggle with diverse layouts of documents, leading to frequent errors and high costs of labor. So, this study developed DATALUX - a robust document layout system that trans-forms unstructured documents into structured, machine-readable data suitable for automation. Built on a trans-former-based detector, DATALUX incorporates several modules for layout refinement, text-visual fusion, and layer-wise optimization to improve coherence and generalization across diverse layouts. Around January 2025, we successfully deployed DATALUX into one of the largest academic content service firms (Nurimedia) in South Korea. This firm faced the challenge of extracting metadata and references from thousands of academic pa-pers submitted in various formats. Also, the existing LLM-based tools provided unreliable results. So, they needed to process them manually, creating bottlenecks in both labor and time. However, DATALUX enabled the automatic structuring of over 100,000 research papers a year, improving extraction accuracy to over 97%, reducing costs by more than USD 185K annually, and accelerating processing speed by 8.7 times. These deployment results suggest that DATALUX enables scalable and efficient document automation in complex and high-volume environments successfully. We thus believe that our DATALUX has a significant impact on both academia and industry practices.

Layout-Aware Document Parsing with Visual-Linguistic Fusion: The DATA-LUX with Academic Content Service Provider

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information