Talk2Code: A Multi-Turn Interaction Benchmark with Dual-Track Evaluation for Code Generation

Weibin Yang; Liangru Xie; Jieyun Cai; Yuxiang Yan; Hong-Ning Dai; Hao Wang

doi:10.1609/aaai.v40i40.40730

Authors

Weibin Yang Guangzhou Institute of Technology, Xidian University
Liangru Xie School of Cyber Engineering, Xidian University
Jieyun Cai School of Cyber Engineering, Xidian University
Yuxiang Yan School of Cyber Engineering, Xidian University
Hong-Ning Dai Department of Computer Science, Hong Kong Baptist University
Hao Wang School of Cyber Engineering, Xidian University

DOI:

https://doi.org/10.1609/aaai.v40i40.40730

Abstract

While large language models (LLMs) have demonstrated strong capabilities in code generation, current benchmarks primarily focus on single-turn scenarios, neglecting the complexity of multi-turn interactions and user diversity. To address this gap, we introduce Talk2Code, the first benchmark for user-stratified multi-turn dialogue code generation evaluation across algorithmic problem-solving and backend programming tasks.A distinctive feature of our benchmark is its user-stratified interaction modeling. For identical coding tasks, we construct dialogue trajectories tailored for novice, intermediate, and expert users, capturing their distinct expectations and communication patterns.To facilitate comprehensive evaluation, we propose a multi-dimensional evaluation framework assessing both code quality and interaction experience through a novel Dual-track Evaluation Method. In the Direct Generation Track, the benchmark provides golden dialogue context (excluding the final code) directly to the LLM for code generation. In contrast, the Interactive Dialogue Track simulates realistic multi-turn interactions, prompting the model to proactively clarify instructions and gather requirements before generating solutions. Code quality is evaluated in both tracks by Test Pass Rate and Success Rate, while interaction experience is assessed exclusively within the Interactive Dialogue Track through subjective and alignment indicators. Our benchmark and multi-dimensional indicator system collectively establish a new paradigm for evaluating adaptive, user-aware AI coding assistants.

Talk2Code: A Multi-Turn Interaction Benchmark with Dual-Track Evaluation for Code Generation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information