Investigating Data Pruning for Pretraining Biological Foundation Models at Scale

Yifan Wu; Jiyue Jiang; Xichen Ye; Yiqi Wang; Chang Zhou; Yitao Xu; Jiayang Chen; He Hu; Weizhong Zhang; Cheng Jin; Jiao Yuan; Yu Li

doi:10.1609/aaai.v40i2.37102

Authors

Yifan Wu The Chinese University of Hong Kong Fudan University
Jiyue Jiang The Chinese University of Hong Kong
Xichen Ye Fudan University
Yiqi Wang Fudan University
Chang Zhou The Chinese University of Hong Kong
Yitao Xu The Chinese University of Hong Kong
Jiayang Chen The Chinese University of Hong Kong
He Hu Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)
Weizhong Zhang Fudan University
Cheng Jin Fudan University
Jiao Yuan Guangzhou National Laboratory Guangzhou Medical University
Yu Li The Chinese University of Hong Kong

DOI:

https://doi.org/10.1609/aaai.v40i2.37102

Abstract

Biological foundation models (BioFMs), pretrained on large-scale biological sequences, have recently shown strong potential in providing meaningful representations for diverse downstream bioinformatics tasks. However, such models often rely on millions to billions of training sequences and billions of parameters, resulting in prohibitive computational costs and significant barriers to reproducibility and accessibility—particularly for academic labs. To address these challenges, we investigate the feasibility of data pruning for BioFM pretraining and propose a post-hoc influence-guided data pruning framework tailored to biological domains. Our approach first introduces a subset-based self-influence formulation that enables efficient estimation of sample importance at low computational cost. Built upon this, we propose two simple yet effective selection strategies: Top-k Influence (Top I) and Coverage-Centric Influence (CCI). Then, we empirically validate our method on two representative BioFMs: RNA-FM and ESM-C. For RNA, our framework consistently outperforms random selection baselines under an extreme pruning rate of over 99%, which displays our framework's effectiveness. Furthermore, we demonstrate the generalizability of our framework on protein-related tasks using ESM-C. Specifically, our coreset even outperforms random 10x subsets in both RNA and protein settings, revealing substantial redundancy in biological sequence datasets. These findings underscore the potential of influence-guided data pruning to substantially reduce the computational cost of BioFM pretraining, paving the way for more efficient, accessible, and sustainable biological AI research.

Investigating Data Pruning for Pretraining Biological Foundation Models at Scale

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information