Small Models Exhibit Limited Answer Consistency in Repetition Trials of the Multiple-Choice MMLU-Redux and MedQA Benchmarks

Claudio Pinhanez; Paulo Cavalin; Cassia Sanctos; Marcelo Carpinette Grave

doi:10.1609/aaai.v40i39.40550

Small Models Exhibit Limited Answer Consistency in Repetition Trials of the Multiple-Choice MMLU-Redux and MedQA Benchmarks

Authors

Claudio Pinhanez IBM Research Brazil
Paulo Cavalin IBM Research Brazil
Cassia Sanctos IBM Research Brazil
Marcelo Carpinette Grave IBM Research Brazil

DOI:

https://doi.org/10.1609/aaai.v40i39.40550

Abstract

This work explores the consistency of LLMs in answering multiple times the same question. In particular, we study how known, open-source LLMs respond to 10 repetitions of questions from the multiple-choice benchmarks MMLU-Redux and MedQA, considering different inference temperatures, small (2B-10B parameters) vs. medium models (50B-80B), finetuned vs. base models, and other parameters. The paper also examines the effects of requiring answer consistency in repetitive inferences on accuracy and the trade-offs involved in deciding which model best provides both of them, for what we propose some new representations. Results show that the number of questions which can be answered consistently vary wildly among models but typically is in the 50%-85% range for small models and that accuracy among consistent answers correlates to overall accuracy at low inference temperatures. Results for medium-sized models seem to indicate much higher levels of answer consistency.

AAAI-26 / IAAI-26 / EAAI-26 Proceedings Cover

Downloads

Published

2026-03-14

How to Cite

Pinhanez, C., Cavalin, P., Sanctos, C., & Grave, M. C. (2026). Small Models Exhibit Limited Answer Consistency in Repetition Trials of the Multiple-Choice MMLU-Redux and MedQA Benchmarks. Proceedings of the AAAI Conference on Artificial Intelligence, 40(39), 32719–32727. https://doi.org/10.1609/aaai.v40i39.40550

Download Citation

Issue

Vol. 40 No. 39: AAAI-26 Technical Tracks 39

Section

AAAI Technical Track on Natural Language Processing IV

Small Models Exhibit Limited Answer Consistency in Repetition Trials of the Multiple-Choice MMLU-Redux and MedQA Benchmarks

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information