Towards Understanding In-Context Learning of Transformers Under Non-I.I.D. Scenarios

Qilu Shen; Yingjie Wang; Jinhai Xiang

doi:10.1609/aaai.v40i30.39724

Authors

Qilu Shen Key Laboratory of Smart Farming for Agricultural Animals, Wuhan, Hubei, China College of Informatics, Huazhong Agricultural University, Wuhan, Hubei, China Agricultural Bioinformatics Key Laboratory of Hubei Province, Wuhan, Hubei, China Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, Wuhan, Hubei, China
Yingjie Wang College of Control Science and Engineering, China University of Petroleum (East China), Qingdao, China
Jinhai Xiang Key Laboratory of Smart Farming for Agricultural Animals, Wuhan, Hubei, China College of Informatics, Huazhong Agricultural University, Wuhan, Hubei, China Agricultural Bioinformatics Key Laboratory of Hubei Province, Wuhan, Hubei, China Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, Wuhan, Hubei, China

DOI:

https://doi.org/10.1609/aaai.v40i30.39724

Abstract

Understanding the generalization behavior of in-context learning (ICL) in Transformers remains a fundamental challenge, as most existing theoretical analyses are based on the assumption that data are independently and identically distributed (i.i.d.), an assumption that often does not hold in practice. Motivated by the theoretical insight that ICL operates similarly to gradient-based optimization, we leverage the concept of gradient stability to establish generalization error bounds for ICL under a general non-i.i.d. setting. Our analysis shows that two factors play a central role in ICL generalization: the number of demonstrations in the prompt and their distributional alignment with the query. In particular, increasing the number of demonstrations and improving their alignment with the query distribution lead to better generalization, even without any parameter tuning. Under mild conditions, we further prove that the generalization error can achieve the optimal convergence rate of O(N^(-1/2)), where N is the number of demonstrations. Our empirical evaluations validate the effectiveness of our theoretical findings.

Towards Understanding In-Context Learning of Transformers Under Non-I.I.D. Scenarios

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information