Incoherence as Oracle-less Measure of Error in LLM-Based Code Generation

Authors

  • Thomas Jean-Michel Valentin Ecole Normale Supérieure Paris-Saclay
  • Ardi Madadi Max Planck Institute for Security and Privacy
  • Gaetano Sapia Max Planck Institute for Security and Privacy
  • Marcel Böhme Max Planck Institute for Security and Privacy

DOI:

https://doi.org/10.1609/aaai.v40i39.40616

Abstract

Generating code from a natural language programming task is one of the most successful applications of Large Language Models (LLMs). Yet, the generated program may be buggy. Without an oracle, such as an existing, correct implementation or a formal specification, can we somehow estimate how likely the generated program is correct? In this paper, we propose a measure of incorrectness, called incoherence, that can be estimated efficiently in the absence of an oracle and allows us to establish a lower bound on the error, i.e., the probability that the LLM-generated program for that specification is incorrect. In our experiments, our incoherence-based methodology can automatically identify about two-thirds of incorrect programs without reports of false positives for the average task. In fact, an oracle-based evaluation of LLMs can be reliably replaced by an incoherence-based evaluation. In particular, we find a very strong agreement between the ranking of LLMs by the number of programs deemed correct via an oracle (pass@1) and the ranking of LLMs by the number of programs deemed correct via incoherence.

Downloads

Published

2026-03-14

How to Cite

Valentin, T. J.-M., Madadi, A., Sapia, G., & Böhme, M. (2026). Incoherence as Oracle-less Measure of Error in LLM-Based Code Generation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(39), 33305–33313. https://doi.org/10.1609/aaai.v40i39.40616

Issue

Section

AAAI Technical Track on Natural Language Processing IV