Interactive Evaluation of Large Language Models for Multi-Requirement Software Engineering Tasks

Dimitrios Rontogiannis; Maxime Peyrard; Nicolas Baldwin; Martin Josifoski; Robert West; Dimitrios Gunopulos

doi:10.1609/aaai.v40i39.40564

Authors

Dimitrios Rontogiannis Max Planck Institute for Software Systems
Maxime Peyrard Université Grenoble Alpes, CNRS, Grenoble INP, LIG
Nicolas Baldwin FAIR at Meta
Martin Josifoski FAIR at Meta
Robert West EPFL
Dimitrios Gunopulos Department of Informatics and Telecommunications, National and Kapodistrian University of Athens

DOI:

https://doi.org/10.1609/aaai.v40i39.40564

Abstract

Standard single-turn, static benchmarks fall short in evaluating the nuanced capabilities of Large Language Models (LLMs) on complex tasks such as software engineering. In this work, we propose a novel interactive evaluation framework that assesses LLMs on multi-requirement programming tasks through structured, feedback-driven dialogue. Each task is modeled as a requirement dependency graph, and an "interviewer" LLM, aware of the ground-truth solution, provides minimal, targeted hints to an "interviewee" model to help correct errors and fulfill target constraints. This dynamic protocol enables fine-grained diagnostic insights into model behavior, uncovering strengths and systematic weaknesses that static benchmarks fail to measure. We build on DevAI, a benchmark of 55 curated programming tasks, by adding ground-truth solutions and evaluating the relevance and utility of interviewer hints through expert annotation. Our results highlight the importance of dynamic evaluation in advancing the development of collaborative code-generating agents.

Interactive Evaluation of Large Language Models for Multi-Requirement Software Engineering Tasks

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information