A Compact Model for Mathematics Problem Representations Distilled from BERT

Hao Ming; Xinguo Yu; Xiaotian Cheng; Zhenquan Shen; Xiaopan Lyu

doi:10.1609/aaai.v39i23.34669

Authors

Hao Ming Faculty of Artificial Intelligence in Education, Central China Normal University
Xinguo Yu Faculty of Artificial Intelligence in Education, Central China Normal University Central China Normal University Wollongong Joint Institute, Central China Normal University
Xiaotian Cheng Faculty of Artificial Intelligence in Education, Central China Normal University Central China Normal University Wollongong Joint Institute, Central China Normal University
Zhenquan Shen Faculty of Artificial Intelligence in Education, Central China Normal University
Xiaopan Lyu Faculty of Artificial Intelligence in Education, Central China Normal University

DOI:

https://doi.org/10.1609/aaai.v39i23.34669

Abstract

Large language models (LLMs) have made significant advancements in math problem solving, but their large size and high latency render them impractical for real-world applications in intelligent mathematics solvers. Recently, task-agnostic compact models have been developed to replace LLMs in general natural language processing tasks. However, these models often struggle to acquire sufficient math-related knowledge from LLMs, leading to unsatisfactory performance in solving math word problems (MWPs). To develop a specialized compact model for representing MWPs, we develop the knowledge distillation (KD) technique to extract mathematical semantics knowledge from the large pre-trained model BERT. Effective knowledge types and distillation strategies are explored through extensive experiments. Our KD algorithm employs multi-knowledge distillation to extract fundamental knowledge from hidden states in the middle to lower layers, while also incorporating knowledge of mathematical relations and symbol constraints from higher-layer outputs and math decoder outputs, by leveraging bottleneck networks. Pre-training tasks on MWP datasets, such as masked language modeling and part-of-speech tagging, are also utilized to enhance the generalization of the compact model for MWP understanding. Additionally, a simple parameter mixing strategy is employed to prevent catastrophic forgetting of acquired knowledge. Our findings indicate that our approach can reduce the size of a BERT model by 10% while retaining approximately 95% of its performance on MWP datasets, outperforming the mainstream BERT-based task-agnostic compact models. The efficacy of each component has been validated through ablation studies.

A Compact Model for Mathematics Problem Representations Distilled from BERT

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information