Scaling Towards the Information Boundary of Instructions through Data Synthesizing
DOI:
https://doi.org/10.1609/aaai.v40i36.40311Abstract
High-quality instructions are crucial for aligning pretrained models to improve their performance on downstream tasks. Although current instruction datasets have reached tens of millions of samples, models finetuned on them may still struggle with complex instruction following and tasks in rare domains. This is primarily due to limited expansion in both "coverage" (coverage of task types and knowledge areas) and "depth" (instruction complexity) of the instruction set. To address this issue, we propose a systematic instruction data construction framework, which integrates a hierarchical labeling system, an informative seed selection algorithm, an evolutionary data synthesis process, and a model deficiency diagnosis with targeted data generation. These components form an iterative closed-loop to continuously enhance the coverage and depth of instruction data. Based on this framework, we construct Infinity Instruct Subject, a high-quality dataset containing approximately 1.5 million instructions. Experiments on multiple foundation models and benchmark tasks demonstrate its effectiveness in improving instruction-following capabilities. Further analyses suggest that InfinityInstruct-Subject shows enlarged coverage and depth compared to comparable synthesized instruction datasets.Downloads
Published
2026-03-14
How to Cite
Du, L., Zhao, H., Ju, Y., & Pan, T. (2026). Scaling Towards the Information Boundary of Instructions through Data Synthesizing. Proceedings of the AAAI Conference on Artificial Intelligence, 40(36), 30566-30574. https://doi.org/10.1609/aaai.v40i36.40311
Issue
Section
AAAI Technical Track on Natural Language Processing I