A Multi-Agent Conversational Bandit Approach to Online Evaluation and Selection of User-Aligned LLM Responses

Authors

  • Xiangxiang Dai The Chinese University of Hong Kong
  • Yuejin Xie Huazhong University of Science and Technology
  • Maoli Liu The Chinese University of Hong Kong
  • Xuchuang Wang University of Massachusetts at Amherst
  • Zhuohua Li Guangzhou Institute of Technology, Xidian University The Chinese University of Hong Kong
  • Huanyu Wang Huawei Technologies Ltd.
  • John C.S. Lui The Chinese University of Hong Kong

DOI:

https://doi.org/10.1609/aaai.v40i44.41064

Abstract

Prompt-based offline methods are commonly used to optimize large language model (LLM) responses, but evaluating these responses is computationally intensive and often fails to accommodate diverse response styles. This study introduces a novel online evaluation framework that employs a multi-agent conversational bandit model to select optimal responses while aligning with user preferences dynamically. To tackle challenges such as high-dimensional features, large response sets, adaptive conversational needs, and multi-device access, we propose MACO, Multi-Agent Conversational Online Learning, which comprises two key components: (1) MACO-A: Executed by local agents, it employs an online elimination mechanism to filter out low-quality responses. (2) MACO-S: Executed by the cloud server, it adaptively adjusts selection strategies based on aggregated preference data. An adaptive preference mechanism triggers asynchronous conversations to enhance alignment efficiency. Theoretical analysis demonstrates that MACO achieves near-optimal regret bounds, matching state-of-the-art performance in various degenerate cases. Extensive experiments utilizing Google and OpenAI text embedding models on the real-world datasets with different response styles, combined with Llama and GPT-4o, show that MACO consistently outperforms baseline methods by at least 8.29% across varying response set sizes and numbers of agents.

Published

2026-03-14

How to Cite

Dai, X., Xie, Y., Liu, M., Wang, X., Li, Z., Wang, H., & Lui, J. C. (2026). A Multi-Agent Conversational Bandit Approach to Online Evaluation and Selection of User-Aligned LLM Responses. Proceedings of the AAAI Conference on Artificial Intelligence, 40(44), 37323–37331. https://doi.org/10.1609/aaai.v40i44.41064

Issue

Section

AAAI Special Track on AI Alignment