WALKSAFE: Risk-aware Graph Random Walk with Bi-GRPO for LLM Safety

Authors

  • Shilong Pan National University of Defense Technology
  • Zhiliang Tian National University of Defense Technology
  • Wanlong Yu National University of Defense Technology
  • Zhen Huang National University of Defense Technology
  • Qingyu Qiu National University of Defense Technology
  • Zihan Chen National University of Defense Technology
  • Zhonghao Sun National University of Defense Technology
  • Minlie Huang Tsinghua University, Tsinghua University
  • Dongsheng Li National University of Defense Technology

DOI:

https://doi.org/10.1609/aaai.v40i38.40543

Abstract

Large language models (LLMs) may generate harmful outputs on malicious inputs. Existing safety methods, including prompt engineering and model editing, rely on hand-crafted templates or target-driven parameter modifications, limiting their generalizability in unseen harmful scenarios. Post-training aims to ensure LLM safety in general domains via supervised fine-tuning (SFT) or reinforcement learning (RL) on diverse malicious inputs. SFT needs annotated refusal samples while RL learns to refuse risk by exploring diverse harmful inputs. However, these methods tend to harshly refuse over any possible risks, sacrificing potentially useful information and degrading model utility. We argue that realistic malicious inputs often mix both harmful and helpful semantics (i.e., entities and relations), and LLMs should identify and remove only harmful relations while preserving useful ones. Thus, the original malicious user inputs can shift into safe queries, to which LLMs can respond safely and helpfully. In this paper, we propose WALKSAFE, a graph-based risk-aware training framework that enables LLMs to identify potential risks of key semantics (entities and relations) in user inputs via graph structure. By filtering harmful relations, LLMs can respond to safe input queries and then generate their corresponding safe and helpful responses. First, we model all entities and relations in the inputs with a graph structure. Second, we adopt a risk-aware random walk on the graph to quantify potential risk under multiple entities and relations. Then, we reconstruct safe queries by filtering harmful relations to promote the LLM to answer safely and helpfully rather than with direct refusals. Finally, we propose Bi-GRPO to post-train LLMs. As vanilla GRPO conducts only the intra-group comparison, Bi-GRPO performs both intra-group and inter-group comparisons between different response groups. The extra inter-group rewards encourage the model to distinguish harmful and safe semantics, and thus prefer safe and helpful responses. Experiments on three LLMs show that our models obtain SOTA results.

Downloads

Published

2026-03-14

How to Cite

Pan, S., Tian, Z., Yu, W., Huang, Z., Qiu, Q., Chen, Z., … Li, D. (2026). WALKSAFE: Risk-aware Graph Random Walk with Bi-GRPO for LLM Safety. Proceedings of the AAAI Conference on Artificial Intelligence, 40(38), 32655–32663. https://doi.org/10.1609/aaai.v40i38.40543

Issue

Section

AAAI Technical Track on Natural Language Processing III