Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute

Authors

  • Jianhao Chen State Key Laboratory for Novel Software Technology, Nanjing University Shanghai Artificial Intelligence Laboratory
  • Zishuo Xun Shanghai Artificial Intelligence Laboratory University of Auckland
  • Bocheng Zhou Shanghai Artificial Intelligence Laboratory
  • Han Qi Shanghai Artificial Intelligence Laboratory
  • Hangfan Zhang The Pennsylvania State University
  • Qiaosheng Zhang Shanghai Artificial Intelligence Laboratory
  • Yang Chen Shanghai Artificial Intelligence Laboratory
  • Wei Hu State Key Laboratory for Novel Software Technology, Nanjing University
  • Yuzhong Qu State Key Laboratory for Novel Software Technology, Nanjing University
  • Shuyue Hu Shanghai Artificial Intelligence Laboratory

DOI:

https://doi.org/10.1609/aaai.v40i24.39094

Abstract

This paper presents a simple, effective, and cost-efficient strategy, named ModelSwitch, to improve LLM performance by scaling test-time compute. ModelSwitch builds upon the repeated-sampling-then-voting framework, with a novel twist: incorporating multiple models, even weaker ones, to leverage their complementary strengths that potentially arise from diverse training data and paradigms. By using sample consistency as a signal, our strategy dynamically switches between models. Theoretical analysis highlights the efficiency and performance advantages of our strategy. Extensive experiments on seven datasets demonstrate that our strategy not only outperforms self-consistency and state-of-the-art multi-agent debate approaches, but also significantly reduces inference costs. Additionally, our strategy requires only a few comparable LLMs to achieve optimal performance and can be extended with verification methods, demonstrating the potential of leveraging multiple LLMs in the generation-verification paradigm.

Downloads

Published

2026-03-14

How to Cite

Chen, J., Xun, Z., Zhou, B., Qi, H., Zhang, H., Zhang, Q., Chen, Y., Hu, W., Qu, Y., & Hu, S. (2026). Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute. Proceedings of the AAAI Conference on Artificial Intelligence, 40(24), 20083-20091. https://doi.org/10.1609/aaai.v40i24.39094

Issue

Section

AAAI Technical Track on Machine Learning I