Small Models Exhibit Limited Answer Consistency in Repetition Trials of the Multiple-Choice MMLU-Redux and MedQA Benchmarks

Authors

  • Claudio Pinhanez IBM Research Brazil
  • Paulo Cavalin IBM Research Brazil
  • Cassia Sanctos IBM Research Brazil
  • Marcelo Carpinette Grave IBM Research Brazil

DOI:

https://doi.org/10.1609/aaai.v40i39.40550

Abstract

This work explores the consistency of LLMs in answering multiple times the same question. In particular, we study how known, open-source LLMs respond to 10 repetitions of questions from the multiple-choice benchmarks MMLU-Redux and MedQA, considering different inference temperatures, small (2B-10B parameters) vs. medium models (50B-80B), finetuned vs. base models, and other parameters. The paper also examines the effects of requiring answer consistency in repetitive inferences on accuracy and the trade-offs involved in deciding which model best provides both of them, for what we propose some new representations. Results show that the number of questions which can be answered consistently vary wildly among models but typically is in the 50%-85% range for small models and that accuracy among consistent answers correlates to overall accuracy at low inference temperatures. Results for medium-sized models seem to indicate much higher levels of answer consistency.

Downloads

Published

2026-03-14

How to Cite

Pinhanez, C., Cavalin, P., Sanctos, C., & Grave, M. C. (2026). Small Models Exhibit Limited Answer Consistency in Repetition Trials of the Multiple-Choice MMLU-Redux and MedQA Benchmarks. Proceedings of the AAAI Conference on Artificial Intelligence, 40(39), 32719–32727. https://doi.org/10.1609/aaai.v40i39.40550

Issue

Section

AAAI Technical Track on Natural Language Processing IV