UNSEEN: Enhancing Dataset Pruning from a Generalization Perspective

Furui Xu; Shaobo Wang; Jiajun Zhang; Chenghao Sun; Haixiang Tang; Linfeng Zhang

doi:10.1609/aaai.v40i32.39938

Authors

Furui Xu EPIC Lab, Shanghai Jiaotong University East China University of Science and Technology
Shaobo Wang EPIC Lab, Shanghai Jiaotong University Alibaba Group
Jiajun Zhang Beijing Jiaotong University
Chenghao Sun Central South University
Haixiang Tang University of Illinois at Urbana-Champaign
Linfeng Zhang EPIC Lab, Shanghai Jiaotong University

DOI:

https://doi.org/10.1609/aaai.v40i32.39938

Abstract

The growing scale of datasets in deep learning has introduced significant computational challenges. Dataset pruning addresses this challenge by constructing a compact but informative coreset from the full dataset with comparable performance. Previous approaches typically establish scoring metrics based on specific criteria to identify representative samples. However, these methods predominantly rely on sample scores obtained from the model's performance during the training (i.e., fitting) phase. As scoring models achieve near-optimal performance on training data, such fitting-centric approaches induce a dense distribution of sample scores within a narrow numerical range. This concentration reduces the distinction between samples and hinders effective selection. To address this challenge, we conduct dataset pruning from the perspective of generalization, i.e., scoring samples based on models not exposed to them during training. We propose a plug-and-play framework, UNSEEN, which can be integrated into existing dataset pruning methods. Additionally, conventional score-based methods are single-step and rely on models trained solely on the complete dataset, providing limited perspective on the importance of samples. To address this limitation, we scale UNSEEN to multi-step scenarios and propose an incremental selection technique through scoring models trained on varying coresets, and optimize the quality of the coreset dynamically. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art (SOTA) methods on CIFAR-10, CIFAR-100, and ImageNet-1K. Notably, on ImageNet-1K, UNSEEN achieves lossless performance while reducing training data by 30%.

UNSEEN: Enhancing Dataset Pruning from a Generalization Perspective

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information