Scaling Effects on Latent Representation Edits in GPT Models (Student Abstract)

Authors

  • Austin L. Davis University of Central Florida
  • Gita Sukthankar University of Central Florida

DOI:

https://doi.org/10.1609/aaai.v39i28.35245

Abstract

Probing classifiers are a technique for understanding and modifying the operation of neural networks in which a smaller classifier is trained to use the model's internal representation to learn a related probing task. Similar to a neural electrode array, training probing classifiers can help researchers both discern and edit the internal representation of a neural network. This paper presents an evaluation of the use of probing classifiers to modify the internal hidden state of a chess-playing transformer. We demonstrate that intervention vector scaling should follow a negative exponential according to the length of the input to ensure model outputs remain semantically valid after editing the residual stream activations.

Published

2025-04-11

How to Cite

Davis, A. L., & Sukthankar, G. (2025). Scaling Effects on Latent Representation Edits in GPT Models (Student Abstract). Proceedings of the AAAI Conference on Artificial Intelligence, 39(28), 29343-29344. https://doi.org/10.1609/aaai.v39i28.35245