A Compact Model for Mathematics Problem Representations Distilled from BERT

Authors

  • Hao Ming Faculty of Artificial Intelligence in Education, Central China Normal University
  • Xinguo Yu Faculty of Artificial Intelligence in Education, Central China Normal University Central China Normal University Wollongong Joint Institute, Central China Normal University
  • Xiaotian Cheng Faculty of Artificial Intelligence in Education, Central China Normal University Central China Normal University Wollongong Joint Institute, Central China Normal University
  • Zhenquan Shen Faculty of Artificial Intelligence in Education, Central China Normal University
  • Xiaopan Lyu Faculty of Artificial Intelligence in Education, Central China Normal University

DOI:

https://doi.org/10.1609/aaai.v39i23.34669

Abstract

Large language models (LLMs) have made significant advancements in math problem solving, but their large size and high latency render them impractical for real-world applications in intelligent mathematics solvers. Recently, task-agnostic compact models have been developed to replace LLMs in general natural language processing tasks. However, these models often struggle to acquire sufficient math-related knowledge from LLMs, leading to unsatisfactory performance in solving math word problems (MWPs). To develop a specialized compact model for representing MWPs, we develop the knowledge distillation (KD) technique to extract mathematical semantics knowledge from the large pre-trained model BERT. Effective knowledge types and distillation strategies are explored through extensive experiments. Our KD algorithm employs multi-knowledge distillation to extract fundamental knowledge from hidden states in the middle to lower layers, while also incorporating knowledge of mathematical relations and symbol constraints from higher-layer outputs and math decoder outputs, by leveraging bottleneck networks. Pre-training tasks on MWP datasets, such as masked language modeling and part-of-speech tagging, are also utilized to enhance the generalization of the compact model for MWP understanding. Additionally, a simple parameter mixing strategy is employed to prevent catastrophic forgetting of acquired knowledge. Our findings indicate that our approach can reduce the size of a BERT model by 10% while retaining approximately 95% of its performance on MWP datasets, outperforming the mainstream BERT-based task-agnostic compact models. The efficacy of each component has been validated through ablation studies.

Published

2025-04-11

How to Cite

Ming, H., Yu, X., Cheng, X., Shen, Z., & Lyu, X. (2025). A Compact Model for Mathematics Problem Representations Distilled from BERT. Proceedings of the AAAI Conference on Artificial Intelligence, 39(23), 24867–24875. https://doi.org/10.1609/aaai.v39i23.34669

Issue

Section

AAAI Technical Track on Natural Language Processing II