Key-Point-Driven Data Synthesis with Its Enhancement on Mathematical Reasoning

Authors

  • Yiming Huang Microsoft
  • Xiao Liu Microsoft
  • Yeyun Gong Microsoft
  • Zhibin Gou Microsoft
  • Yelong Shen Microsoft
  • Nan Duan Microsoft
  • Weizhu Chen Microsoft

DOI:

https://doi.org/10.1609/aaai.v39i23.34593

Abstract

Large language models have shown great potential in complex reasoning tasks, yet their performance is often hampered by the scarcity of high-quality and reasoning-focused training datasets. Addressing this challenge, we propose Key-PointDriven Data Synthesis (KPDDS), a novel data synthesis framework that synthesizes question-answer pairs by leveraging key points and exemplar practices from authentic data sources. KPDDS ensures the generation of novel questions with rigorous quality control and substantial scalability. As a result, we present KPMath, an extensive synthetic dataset tailored for mathematical reasoning, comprising over 800K questionanswer pairs. Utilizing KPMath and augmenting it with additional reasoning-intensive corpora, we create the comprehensive KPMath-Plus dataset. Our experiments demonstrate that this dataset can enhance the mathematical reasoning performance of models across various architectures and sizes. The Qwen1.5-72B model, fine-tuned on KPMath-Plus, achieves 87.0% accuracy on GSM8K and 58.3% on MATH, surpassing competitors in the 7B to 72B range and best commercial models like GPT-4 across multiple math reasoning datasets.

Downloads

Published

2025-04-11

How to Cite

Huang, Y., Liu, X., Gong, Y., Gou, Z., Shen, Y., Duan, N., & Chen, W. (2025). Key-Point-Driven Data Synthesis with Its Enhancement on Mathematical Reasoning. Proceedings of the AAAI Conference on Artificial Intelligence, 39(23), 24176–24184. https://doi.org/10.1609/aaai.v39i23.34593

Issue

Section

AAAI Technical Track on Natural Language Processing II