Efficient Transcoder Adaptation for Fine-Tuned Models: Revealing Medical Reasoning Mechanisms in Large Language Models

Zhouxing Tan; Hanlin Xue; Yulong Wan; Ruochong Xiong; Xu Chu; Xiang Li; Junfei Liu

doi:10.1609/aaai.v40i39.40604

Authors

Zhouxing Tan National Engineering Research Center for Software Engineering, Peking University
Hanlin Xue School of Software and Microelectronics, Peking University
Yulong Wan National Engineering Research Center for Software Engineering, Peking University
Ruochong Xiong National Engineering Research Center for Software Engineering, Peking University
Xu Chu School of Software and Microelectronics, Peking University
Xiang Li College of Electrical and Information Engineering, Northeast Agricultural University
Junfei Liu National Engineering Research Center for Software Engineering, Peking University

DOI:

https://doi.org/10.1609/aaai.v40i39.40604

Abstract

Large language models (LLMs) suffer from a lack of decision-making transparency, limiting their deployment in high-stakes domains such as healthcare. We propose a mechanistic interpretability framework that introduces two novel paradigms: Medical Fine-Tuning with Frozen Attention Layers (FTFA) and Posterior Adaptation Transcoders (PAT). FTFA freezes attention layers while fine-tuning only feed-forward network (FFN) parameters, enabling PAT to efficiently adapt pre-trained transcoders on the same data. This approach achieves over 1000× efficiency improvement compared to training transcoders from scratch. We theoretically justify this methodology and demonstrate its cost-effectiveness for cross-domain transfer. Transcoders are sparse autoencoders that replace MLP layers to provide interpretable feature representations. By substituting MLP layers of both base Gemma2-2b and its medical fine-tuned variant with per-layer transcoders, we enable feature-level attribution analysis. Through systematic pruning and node merging of resulting attribution graphs, we construct human-interpretable decision pathways. Our analysis reveals that LLMs employ two parallel mechanisms for medical diagnosis: pattern matching and multi-hop reasoning, with fine-tuned models demonstrating enhanced correct reasoning patterns. This work provides a practical framework for training transcoders on fine-tuned models at minimal cost, enabling broader application of mechanistic interpretability across domains and potentially guiding model training through transcoder-based analysis.

Efficient Transcoder Adaptation for Fine-Tuned Models: Revealing Medical Reasoning Mechanisms in Large Language Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information