An Invariant Latent Space Perspective on Language Model Inversion

Authors

  • Wentao Ye Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
  • Jiaqi Hu Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
  • Haobo Wang Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
  • Xinpeng Ti Zhejiang University Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
  • Zhiqing Xiao Zhejiang University
  • Hao Chen Zhejiang University
  • Liyao Li Zhejiang University
  • Lei Feng Southeast University
  • Sai Wu Zhejiang University
  • Junbo Zhao Zhejiang University

DOI:

https://doi.org/10.1609/aaai.v40i33.40004

Abstract

Language model inversion (LMI), i.e., recovering hidden prompts from outputs, emerges as a concrete threat to user privacy and system security. We recast LMI as reusing the LLM's own latent space and propose the Invariant Latent Space Hypothesis (ILSH): (1) diverse outputs from the same source prompt should preserve consistent semantics (source invariance), and (2) input<->output cyclic mappings should be self-consistent within a shared latent space (cyclic invariance). Accordingly, we present Inv2A, which treats the LLM as an invariant decoder and learns only a lightweight inverse encoder that maps outputs to a denoised pseudo-representation. When multiple outputs are available, they are sparsely concatenated at the representation layer to increase information density. Training proceeds in two stages: contrastive alignment (source invariance) and supervised reinforcement (cyclic invariance). An optional training-free neighborhood search can refine local performance. Across 9 datasets covering user and system prompt scenarios, Inv2A outperforms baselines by an average of 4.77% BLEU score while reducing dependence on large inverse corpora. Our analysis further shows that prevalent defenses provide limited protection, underscoring the need for stronger strategies.

Downloads

Published

2026-03-14

How to Cite

Ye, W., Hu, J., Wang, H., Ti, X., Xiao, Z., Chen, H., … Zhao, J. (2026). An Invariant Latent Space Perspective on Language Model Inversion. Proceedings of the AAAI Conference on Artificial Intelligence, 40(33), 27818–27826. https://doi.org/10.1609/aaai.v40i33.40004

Issue

Section

AAAI Technical Track on Machine Learning X