Scaling Effects on Latent Representation Edits in GPT Models (Student Abstract)
DOI:
https://doi.org/10.1609/aaai.v39i28.35245Abstract
Probing classifiers are a technique for understanding and modifying the operation of neural networks in which a smaller classifier is trained to use the model's internal representation to learn a related probing task. Similar to a neural electrode array, training probing classifiers can help researchers both discern and edit the internal representation of a neural network. This paper presents an evaluation of the use of probing classifiers to modify the internal hidden state of a chess-playing transformer. We demonstrate that intervention vector scaling should follow a negative exponential according to the length of the input to ensure model outputs remain semantically valid after editing the residual stream activations.Downloads
Published
2025-04-11
How to Cite
Davis, A. L., & Sukthankar, G. (2025). Scaling Effects on Latent Representation Edits in GPT Models (Student Abstract). Proceedings of the AAAI Conference on Artificial Intelligence, 39(28), 29343-29344. https://doi.org/10.1609/aaai.v39i28.35245
Issue
Section
AAAI Student Abstract and Poster Program