FUSION: Dataset Pruning via Fusing Uncertainty with Structural Information for Optimal Neural Training in Crystal Property Prediction

Authors

  • Xiean Wang School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
  • Pin Chen School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China National Supercomputer Center in Guangzhou, China
  • Liqin Tan School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
  • Yutong Lu School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China National Supercomputer Center in Guangzhou, China
  • Qingsong Zou School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China Guangdong Province Key Laboratory of Computational Science, Sun Yat-sen University, Guangzhou, China

DOI:

https://doi.org/10.1609/aaai.v40i31.39863

Abstract

The rapid expansion of materials databases offers unprecedented opportunities for accelerating materials discovery via machine learning. However, the widespread assumption that larger datasets inherently produce better models does not hold in practice. We propose FUSION (Fusing Uncertainty with Structural Information for Optimal Neural training), an offline dataset pruning strategy that synergistically combines uncertainty quantification with crystallographic structure analysis via geometric fingerprinting, framing dataset pruning as a discrete optimization problem. Through evaluation across 3 benchmark datasets, FUSION consistently outperforms baselines, including random pruning, uncertainty sampling, weighting factor pruning, diversity sampling, and active learning. It demonstrates robust transferability across 11 diverse architectures, outperforming random pruning by 1.91–13.65% across different datasets, with an average improvement of 6.36%. Moreover, our analysis suggests that different models exhibit varying robustness characteristics when faced with pruned training data, highlighting the importance of model selection tailored to dataset composition. We identify optimal pruning points where removing just 0–8% of training data improves model performance, yielding gains up to 12.67% in specific model–dataset combinations. These results establish a new paradigm for materials informatics that prioritizes data quality over quantity, offering a pathway toward more efficient and sustainable machine learning workflows in computational materials science.

Downloads

Published

2026-03-14

How to Cite

Wang, X., Chen, P., Tan, L., Lu, Y., & Zou, Q. (2026). FUSION: Dataset Pruning via Fusing Uncertainty with Structural Information for Optimal Neural Training in Crystal Property Prediction. Proceedings of the AAAI Conference on Artificial Intelligence, 40(31), 26553–26561. https://doi.org/10.1609/aaai.v40i31.39863

Issue

Section

AAAI Technical Track on Machine Learning VIII