Query-Routed Activation Editing with Truth-hierarchical Preference Optimization

Authors

  • Kewei Liao School of Computer Science and Engineering, Beihang University State Key Laboratory of Complex & Critical Software Environment
  • Tianbo Wang School of Computer Science and Engineering, Beihang University State Key Laboratory of Complex & Critical Software Environment
  • Yuqing Ma Institute of Artificial Intelligence, Beihang University State Key Laboratory of Complex & Critical Software Environment
  • Zhange Zhang Institute of Artificial Intelligence, Beihang University State Key Laboratory of Complex & Critical Software Environment
  • Zhicheng Geng Institute of Artificial Intelligence, Beihang University State Key Laboratory of Complex & Critical Software Environment
  • Xiaowei Zhao Zhongguancun Laboratory
  • Jiakai Wang Zhongguancun Laboratory State Key Laboratory of Complex & Critical Software Environment
  • Xianglong Liu School of Computer Science and Engineering, Beihang University State Key Laboratory of Complex & Critical Software Environment

DOI:

https://doi.org/10.1609/aaai.v40i38.40468

Abstract

Hallucination has emerged as a pivotal challenge of Large Language Models (LLMs) that generate plausible yet non‑factual content, significantly impeding the trustworthy AI applications in real-world scenarios like medical diagnosis and autonomous driving. Editing the internal activations of LLMs during inference has shown promising effectiveness in mitigating hallucinations with minimal cost. However, previous editing approaches neglect the query‑specific inference pathways that require tailored truthful steering vectors, resulting in suboptimal hallucination mitigation. To address these issues, we propose the Query-Routed Activation Editing (QRAE) framework, which comprises Divergence-sensitive Head Routing (DHR) and Truth-hierarchical Preference Steering (TPS), to fully leverage query-specific semantics for adaptive activation editing. Specifically, DHR is proposed to establish a query-aware head selection criterion, thereby dynamically routing to truth-critical attention heads. Subsequently, TPS introduces a query-specific steering vector calibration policy with the guidance of progressive truth-preferred optimization, enabling precise and adaptive editing for each distinct query. Extensive experiments on the widely recognized TruthfulQA benchmark demonstrate that QRAE outperforms SOTA methods by up to 13.2% in MC1. Meanwhile, QRAE demonstrates strong generalization to out-of-distribution TriviaQA and Natural Questions benchmarks.

Published

2026-03-14

How to Cite

Liao, K., Wang, T., Ma, Y., Zhang, Z., Geng, Z., Zhao, X., Wang, J., & Liu, X. (2026). Query-Routed Activation Editing with Truth-hierarchical Preference Optimization. Proceedings of the AAAI Conference on Artificial Intelligence, 40(38), 31979-31987. https://doi.org/10.1609/aaai.v40i38.40468

Issue

Section

AAAI Technical Track on Natural Language Processing III