Optimization and Robustness-Informed Membership Inference Attacks for LLMs
DOI:
https://doi.org/10.1609/aaai.v40i39.40587Abstract
The proliferation of Large Language Models (LLMs) has raised concerns over training data privacy. Membership Inference Attacks (MIA), aiming to identify whether specific data was used for training, pose significant privacy risks. However, existing MIA methods struggle to address the scale and complexity of modern LLMs. This paper introduces OR-MIA, a novel MIA framework inspired by model optimization and input robustness. First, training data points are expected to exhibit smaller gradient norms due to optimization dynamics. Second, member samples show greater stability, with gradient norms being less sensitive to controlled input perturbations. OR-MIA leverages these principles by perturbing inputs, computing gradient norms, and using them as features for a robust classifier to distinguish members from non-members. Evaluations on LLMs (70M to 6B parameters) and various datasets demonstrate that OR-MIA outperforms existing methods, achieving over 90% accuracy. Our findings highlight a critical vulnerability in LLMs and underscore the need for improved privacy-preserving training paradigms.Published
2026-03-14
How to Cite
Song, Z., Zhang, Q., Li, M., & Shu, Y. (2026). Optimization and Robustness-Informed Membership Inference Attacks for LLMs. Proceedings of the AAAI Conference on Artificial Intelligence, 40(39), 33047–33055. https://doi.org/10.1609/aaai.v40i39.40587
Issue
Section
AAAI Technical Track on Natural Language Processing IV