Quality over Quantity: Boosting Data Efficiency Through Ensembled Multimodal Data Curation

Authors

  • Jinda Xu Shanghai Jiao Tong University
  • Yuhao Song HAOMO.AI Technology Co., Ltd.
  • Daming Wang HAOMO.AI Technology Co., Ltd.
  • Weiwei Zhao Shanghai Jiao Tong University
  • Minghua Chen HAOMO.AI Technology Co., Ltd.
  • Kangliang Chen HAOMO.AI Technology Co., Ltd.
  • Qinya Li Shanghai Jiao Tong University

DOI:

https://doi.org/10.1609/aaai.v39i20.35481

Abstract

In an era overwhelmed by vast amounts of data, the effective curation of web-crawl datasets is essential for optimizing model performance. This paper tackles the challenges associated with the unstructured and heterogeneous nature of such datasets. Traditional heuristic curation methods often inadequately capture complex features, resulting in biases and the exclusion of relevant data. We introduce an advanced, learning-driven approach, Ensemble Curation Of DAta ThroUgh Multimodal Operators, called EcoDatum, which employs a novel quality-guided deduplication method to balance feature distribution. EcoDatum strategically integrates various unimodal and multimodal data curation operators within a weak supervision ensemble framework, utilizing automated optimization to effectively score each data point. EcoDatum, which significantly improves the data curation quality and efficiency, outperforms existing state-of-the-art (SOTA) techniques, ranking 1st on the DataComp leaderboard with an average performance score of 0.182 across 38 diverse evaluation datasets. This represents a 28% improvement over the DataComp baseline method, demonstrating its effectiveness in improving dataset curation and model training efficiency.

Downloads

Published

2025-04-11

How to Cite

Xu, J., Song, Y., Wang, D., Zhao, W., Chen, M., Chen, K., & Li, Q. (2025). Quality over Quantity: Boosting Data Efficiency Through Ensembled Multimodal Data Curation. Proceedings of the AAAI Conference on Artificial Intelligence, 39(20), 21761–21769. https://doi.org/10.1609/aaai.v39i20.35481

Issue

Section

AAAI Technical Track on Machine Learning VI