Query-Routed Activation Editing with Truth-hierarchical Preference Optimization

Kewei Liao; Tianbo Wang; Yuqing Ma; Zhange Zhang; Zhicheng Geng; Xiaowei Zhao; Jiakai Wang; Xianglong Liu

doi:10.1609/aaai.v40i38.40468

Authors

Kewei Liao School of Computer Science and Engineering, Beihang University State Key Laboratory of Complex & Critical Software Environment
Tianbo Wang School of Computer Science and Engineering, Beihang University State Key Laboratory of Complex & Critical Software Environment
Yuqing Ma Institute of Artificial Intelligence, Beihang University State Key Laboratory of Complex & Critical Software Environment
Zhange Zhang Institute of Artificial Intelligence, Beihang University State Key Laboratory of Complex & Critical Software Environment
Zhicheng Geng Institute of Artificial Intelligence, Beihang University State Key Laboratory of Complex & Critical Software Environment
Xiaowei Zhao Zhongguancun Laboratory
Jiakai Wang Zhongguancun Laboratory State Key Laboratory of Complex & Critical Software Environment
Xianglong Liu School of Computer Science and Engineering, Beihang University State Key Laboratory of Complex & Critical Software Environment

DOI:

https://doi.org/10.1609/aaai.v40i38.40468

Abstract

Hallucination has emerged as a pivotal challenge of Large Language Models (LLMs) that generate plausible yet non‑factual content, significantly impeding the trustworthy AI applications in real-world scenarios like medical diagnosis and autonomous driving. Editing the internal activations of LLMs during inference has shown promising effectiveness in mitigating hallucinations with minimal cost. However, previous editing approaches neglect the query‑specific inference pathways that require tailored truthful steering vectors, resulting in suboptimal hallucination mitigation. To address these issues, we propose the Query-Routed Activation Editing (QRAE) framework, which comprises Divergence-sensitive Head Routing (DHR) and Truth-hierarchical Preference Steering (TPS), to fully leverage query-specific semantics for adaptive activation editing. Specifically, DHR is proposed to establish a query-aware head selection criterion, thereby dynamically routing to truth-critical attention heads. Subsequently, TPS introduces a query-specific steering vector calibration policy with the guidance of progressive truth-preferred optimization, enabling precise and adaptive editing for each distinct query. Extensive experiments on the widely recognized TruthfulQA benchmark demonstrate that QRAE outperforms SOTA methods by up to 13.2% in MC1. Meanwhile, QRAE demonstrates strong generalization to out-of-distribution TriviaQA and Natural Questions benchmarks.

Query-Routed Activation Editing with Truth-hierarchical Preference Optimization

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information