Steering Representations, Safeguarding Privacy: A Cross-Modal Privacy Protection Method for Generative AI

Authors

  • Jie Zhang Shanghai Artificial Intelligence Laboratory
  • Chenxu Niu Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences
  • Zhefeng Nan Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences
  • Yangyan Xu HiThink Research
  • Jinta Weng School of Cyber Security, University of Chinese Academy of Sciences

DOI:

https://doi.org/10.1609/aaai.v40i41.40773

Abstract

Privacy concerns have long been a critical issue in AI models. With the rapid advancement of generative AI, the privacy awareness of models has drawn attention, raising new challenges for privacy protection that is independent of data and tasks. This paper introduces a novel framework for enhancing privacy protection through directional steering in representation space, which seamlessly integrates with both language and vision-language models. Specifically, we first construct a comprehensive privacy-related dataset based on the Solove taxonomy of privacy. Then, we leverage this dataset to enhance model privacy awareness in the representation space, steering the model to protect privacy during inference. Experiments on 12 models validate the effectiveness and generalization of our method. Moreover, we demonstrate the transferability of privacy-enhanced representations between same-source large language models (LLMs) and vision-language models (VLMs), offering a scalable solution for privacy protection in frontier AI models.

Downloads

Published

2026-03-14

How to Cite

Zhang, J., Niu, C., Nan, Z., Xu, Y., & Weng, J. (2026). Steering Representations, Safeguarding Privacy: A Cross-Modal Privacy Protection Method for Generative AI. Proceedings of the AAAI Conference on Artificial Intelligence, 40(41), 34719–34727. https://doi.org/10.1609/aaai.v40i41.40773

Issue

Section

AAAI Technical Track on Natural Language Processing VI