UNSEEN: Enhancing Dataset Pruning from a Generalization Perspective

Authors

  • Furui Xu EPIC Lab, Shanghai Jiaotong University East China University of Science and Technology
  • Shaobo Wang EPIC Lab, Shanghai Jiaotong University Alibaba Group
  • Jiajun Zhang Beijing Jiaotong University
  • Chenghao Sun Central South University
  • Haixiang Tang University of Illinois at Urbana-Champaign
  • Linfeng Zhang EPIC Lab, Shanghai Jiaotong University

DOI:

https://doi.org/10.1609/aaai.v40i32.39938

Abstract

The growing scale of datasets in deep learning has introduced significant computational challenges. Dataset pruning addresses this challenge by constructing a compact but informative coreset from the full dataset with comparable performance. Previous approaches typically establish scoring metrics based on specific criteria to identify representative samples. However, these methods predominantly rely on sample scores obtained from the model's performance during the training (i.e., fitting) phase. As scoring models achieve near-optimal performance on training data, such fitting-centric approaches induce a dense distribution of sample scores within a narrow numerical range. This concentration reduces the distinction between samples and hinders effective selection. To address this challenge, we conduct dataset pruning from the perspective of generalization, i.e., scoring samples based on models not exposed to them during training. We propose a plug-and-play framework, UNSEEN, which can be integrated into existing dataset pruning methods. Additionally, conventional score-based methods are single-step and rely on models trained solely on the complete dataset, providing limited perspective on the importance of samples. To address this limitation, we scale UNSEEN to multi-step scenarios and propose an incremental selection technique through scoring models trained on varying coresets, and optimize the quality of the coreset dynamically. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art (SOTA) methods on CIFAR-10, CIFAR-100, and ImageNet-1K. Notably, on ImageNet-1K, UNSEEN achieves lossless performance while reducing training data by 30%.

Downloads

Published

2026-03-14

How to Cite

Xu, F., Wang, S., Zhang, J., Sun, C., Tang, H., & Zhang, L. (2026). UNSEEN: Enhancing Dataset Pruning from a Generalization Perspective. Proceedings of the AAAI Conference on Artificial Intelligence, 40(32), 27224–27232. https://doi.org/10.1609/aaai.v40i32.39938

Issue

Section

AAAI Technical Track on Machine Learning IX