CALM: Curiosity-Driven Auditing for Large Language Models

Authors

  • Xiang Zheng City University of Hong Kong
  • Longxiang Wang City University of Hong Kong
  • Yi Liu City University of Hong Kong
  • Xingjun Ma Fudan University
  • Chao Shen Xi’an Jiaotong University
  • Cong Wang City University of Hong Kong

DOI:

https://doi.org/10.1609/aaai.v39i26.34991

Abstract

Auditing Large Language Models (LLMs) is a crucial and challenging task. In this study, we focus on auditing black-box LLMs without access to their parameters, only to the provided service. We treat this type of auditing as a black-box optimization problem where the goal is to automatically uncover input-output pairs of the target LLMs that exhibit illegal, immoral, or unsafe behaviors. For instance, we may seek a non-toxic input that the target LLM responds to with a toxic output or an input that induces the hallucinative response from the target LLM containing politically sensitive individuals. This black-box optimization is challenging due to the scarcity of feasible points, the discrete nature of the prompt space, and the large search space. To address these challenges, we propose Curiosity-Driven Auditing for Large Language Models (CALM), which uses intrinsically motivated reinforcement learning to finetune an LLM as the auditor agent to uncover potential harmful and biased input-output pairs of the target LLM. CALM successfully identifies derogatory completions involving celebrities and uncovers inputs that elicit specific names under the black-box setting. This work offers a promising direction for auditing black-box LLMs.

Downloads

Published

2025-04-11

How to Cite

Zheng, X., Wang, L., Liu, Y., Ma, X., Shen, C., & Wang, C. (2025). CALM: Curiosity-Driven Auditing for Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 39(26), 27757-27764. https://doi.org/10.1609/aaai.v39i26.34991

Issue

Section

AAAI Technical Track on AI Alignment