LiteLong: Resource-Efficient Long-Context Data Synthesis for LLMs

Authors

  • Junlong Jia School of Artificial Intelligence, Beihang University Zhongguancun Laboratory, Beijing LMIB, NLSDE, Beihang University, Beijing
  • Xing Wu Institute of Information Engineering, Chinese Academy of Sciences Xiaohongshu Inc
  • Chaochen Gao Institute of Information Engineering, Chinese Academy of Sciences
  • Ziyang Chen Institute of Information Engineering, Chinese Academy of Sciences
  • Zijia Lin Tsinghua University
  • Zhongzhi Li Xiaohongshu Inc
  • Weinong Wang Xiaohongshu Inc
  • Haotian Xu Xiaohongshu Inc
  • Donghui Jin School of Artificial Intelligence, Beihang University LMIB, NLSDE, Beihang University, Beijing
  • Debing Zhang Xiaohongshu Inc
  • Binghui Guo School of Artificial Intelligence, Beihang University Zhongguancun Laboratory, Beijing Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, Beijing LMIB, NLSDE, Beihang University, Beijing

DOI:

https://doi.org/10.1609/aaai.v40i37.40390

Abstract

High-quality long-context data is essential for training large language models (LLMs) capable of processing extensive documents, yet existing synthesis approaches using relevance-based aggregation face challenges of computational efficiency. We present LiteLong, a resource-efficient method for synthesizing long-context data through structured topic organization and multi-agent debate. Our approach leverages the BISAC book classification system to provide a comprehensive hierarchical topic organization, and then employs a debate mechanism with multiple LLMs to generate diverse, high-quality topics within this structure. For each topic, we use lightweight BM25 retrieval to obtain relevant documents and concatenate them into 128K-token training samples. Experiments on HELMET and Ruler benchmarks demonstrate that LiteLong achieves competitive long-context performance and can seamlessly integrate with other long-dependency enhancement methods. LiteLong makes high-quality long-context data synthesis more accessible by reducing both computational and data engineering costs, facilitating further research in long-context language training.

Downloads

Published

2026-03-14

How to Cite

Jia, J., Wu, X., Gao, C., Chen, Z., Lin, Z., Li, Z., Wang, W., Xu, H., Jin, D., Zhang, D., & Guo, B. (2026). LiteLong: Resource-Efficient Long-Context Data Synthesis for LLMs. Proceedings of the AAAI Conference on Artificial Intelligence, 40(37), 31274-31282. https://doi.org/10.1609/aaai.v40i37.40390

Issue

Section

AAAI Technical Track on Natural Language Processing II