RICo: Refined In-Context Contribution for Automatic Instruction-Tuning Data Selection

Authors

  • Yixin Yang Peking University
  • Qingxiu Dong Peking University
  • Linli Yao Peking University
  • Fangwei Zhu Peking University
  • Weilin Luo Huawei Noah’s Ark Lab
  • Bin Wang Huawei Noah's Ark Lab
  • Zhifang Sui Peking University

DOI:

https://doi.org/10.1609/aaai.v40i40.40732

Abstract

Data selection for instruction tuning is crucial for improving the performance of large language models (LLMs) while reducing training costs. In this paper, we propose Refined Contribution Measurement with In-Context Learning (RICo), a novel gradient-free method that quantifies the fine-grained contribution of individual samples to both task-level and global-level model performance. RICo enables more accurate identification of high-contribution data, leading to better instruction tuning. We also introduce a lightweight selection paradigm trained on RICo scores, enabling scalable data selection with strictly linear inference complexity. Extensive experiments on 3 LLMs across 12 benchmarks and 5 pairwise evaluation sets demonstrate the effectiveness of RICo. Remarkably, on LLaMA3.1-8B, models trained in 15% of RICo-selected data outperform full datasets by 5.42 percentage points and exceed the best performance of widely used selection methods by 1.48 percentage points. We further analyze high-contribution samples selected by RICo, which show both diverse tasks and appropriate difficulty levels, rather than merely the most difficult cases.

Downloads

Published

2026-03-14

How to Cite

Yang, Y., Dong, Q., Yao, L., Zhu, F., Luo, W., Wang, B., & Sui, Z. (2026). RICo: Refined In-Context Contribution for Automatic Instruction-Tuning Data Selection. Proceedings of the AAAI Conference on Artificial Intelligence, 40(40), 34349–34357. https://doi.org/10.1609/aaai.v40i40.40732

Issue

Section

AAAI Technical Track on Natural Language Processing V