Efficient Transcoder Adaptation for Fine-Tuned Models: Revealing Medical Reasoning Mechanisms in Large Language Models

Authors

  • Zhouxing Tan National Engineering Research Center for Software Engineering, Peking University
  • Hanlin Xue School of Software and Microelectronics, Peking University
  • Yulong Wan National Engineering Research Center for Software Engineering, Peking University
  • Ruochong Xiong National Engineering Research Center for Software Engineering, Peking University
  • Xu Chu School of Software and Microelectronics, Peking University
  • Xiang Li College of Electrical and Information Engineering, Northeast Agricultural University
  • Junfei Liu National Engineering Research Center for Software Engineering, Peking University

DOI:

https://doi.org/10.1609/aaai.v40i39.40604

Abstract

Large language models (LLMs) suffer from a lack of decision-making transparency, limiting their deployment in high-stakes domains such as healthcare. We propose a mechanistic interpretability framework that introduces two novel paradigms: Medical Fine-Tuning with Frozen Attention Layers (FTFA) and Posterior Adaptation Transcoders (PAT). FTFA freezes attention layers while fine-tuning only feed-forward network (FFN) parameters, enabling PAT to efficiently adapt pre-trained transcoders on the same data. This approach achieves over 1000× efficiency improvement compared to training transcoders from scratch. We theoretically justify this methodology and demonstrate its cost-effectiveness for cross-domain transfer. Transcoders are sparse autoencoders that replace MLP layers to provide interpretable feature representations. By substituting MLP layers of both base Gemma2-2b and its medical fine-tuned variant with per-layer transcoders, we enable feature-level attribution analysis. Through systematic pruning and node merging of resulting attribution graphs, we construct human-interpretable decision pathways. Our analysis reveals that LLMs employ two parallel mechanisms for medical diagnosis: pattern matching and multi-hop reasoning, with fine-tuned models demonstrating enhanced correct reasoning patterns. This work provides a practical framework for training transcoders on fine-tuned models at minimal cost, enabling broader application of mechanistic interpretability across domains and potentially guiding model training through transcoder-based analysis.

Downloads

Published

2026-03-14

How to Cite

Tan, Z., Xue, H., Wan, Y., Xiong, R., Chu, X., Li, X., & Liu, J. (2026). Efficient Transcoder Adaptation for Fine-Tuned Models: Revealing Medical Reasoning Mechanisms in Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(39), 33196–33204. https://doi.org/10.1609/aaai.v40i39.40604

Issue

Section

AAAI Technical Track on Natural Language Processing IV