Task Contamination: Language Models May Not Be Few-Shot Anymore

Authors

  • Changmao Li University of California, Santa Cruz
  • Jeffrey Flanigan University of California, Santa Cruz

DOI:

https://doi.org/10.1609/aaai.v38i16.29808

Keywords:

NLP: (Large) Language Models, NLP: Interpretability, Analysis, and Evaluation of NLP Models

Abstract

Large language models (LLMs) offer impressive performance in various zero-shot and few-shot tasks. However, their success in zero-shot or few-shot settings may be affected by task contamination, a potential limitation that has not been thoroughly examined. This paper investigates how zero-shot and few-shot performance of LLMs has changed chronologically over datasets released over time, and over LLMs released over time. Utilizing GPT-3 series models and several other recent open-sourced LLMs, and controlling for dataset difficulty, we find that datasets released prior to the LLM training data creation date perform surprisingly better than datasets released post the LLM training data creation date. This strongly indicates that, for many LLMs, there exists task contamination on zero-shot and few-shot evaluation for datasets prior to the LLMs' training data creation date. Additionally, we utilize training data inspection, training data extraction, and a membership inference attack, which reveal further evidence of task contamination. Importantly, we find that for tasks with no possibility of task contamination, LLMs rarely demonstrate statistically significant improvements over simple majority baselines, in both zero and few-shot settings.

Published

2024-03-24

How to Cite

Li, C., & Flanigan, J. (2024). Task Contamination: Language Models May Not Be Few-Shot Anymore. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16), 18471-18480. https://doi.org/10.1609/aaai.v38i16.29808

Issue

Section

AAAI Technical Track on Natural Language Processing I