An Invariant Latent Space Perspective on Language Model Inversion
DOI:
https://doi.org/10.1609/aaai.v40i33.40004Abstract
Language model inversion (LMI), i.e., recovering hidden prompts from outputs, emerges as a concrete threat to user privacy and system security. We recast LMI as reusing the LLM's own latent space and propose the Invariant Latent Space Hypothesis (ILSH): (1) diverse outputs from the same source prompt should preserve consistent semantics (source invariance), and (2) input<->output cyclic mappings should be self-consistent within a shared latent space (cyclic invariance). Accordingly, we present Inv2A, which treats the LLM as an invariant decoder and learns only a lightweight inverse encoder that maps outputs to a denoised pseudo-representation. When multiple outputs are available, they are sparsely concatenated at the representation layer to increase information density. Training proceeds in two stages: contrastive alignment (source invariance) and supervised reinforcement (cyclic invariance). An optional training-free neighborhood search can refine local performance. Across 9 datasets covering user and system prompt scenarios, Inv2A outperforms baselines by an average of 4.77% BLEU score while reducing dependence on large inverse corpora. Our analysis further shows that prevalent defenses provide limited protection, underscoring the need for stronger strategies.Downloads
Published
2026-03-14
How to Cite
Ye, W., Hu, J., Wang, H., Ti, X., Xiao, Z., Chen, H., … Zhao, J. (2026). An Invariant Latent Space Perspective on Language Model Inversion. Proceedings of the AAAI Conference on Artificial Intelligence, 40(33), 27818–27826. https://doi.org/10.1609/aaai.v40i33.40004
Issue
Section
AAAI Technical Track on Machine Learning X