Talk2Code: A Multi-Turn Interaction Benchmark with Dual-Track Evaluation for Code Generation

Authors

  • Weibin Yang Guangzhou Institute of Technology, Xidian University
  • Liangru Xie School of Cyber Engineering, Xidian University
  • Jieyun Cai School of Cyber Engineering, Xidian University
  • Yuxiang Yan School of Cyber Engineering, Xidian University
  • Hong-Ning Dai Department of Computer Science, Hong Kong Baptist University
  • Hao Wang School of Cyber Engineering, Xidian University

DOI:

https://doi.org/10.1609/aaai.v40i40.40730

Abstract

While large language models (LLMs) have demonstrated strong capabilities in code generation, current benchmarks primarily focus on single-turn scenarios, neglecting the complexity of multi-turn interactions and user diversity. To address this gap, we introduce Talk2Code, the first benchmark for user-stratified multi-turn dialogue code generation evaluation across algorithmic problem-solving and backend programming tasks.A distinctive feature of our benchmark is its user-stratified interaction modeling. For identical coding tasks, we construct dialogue trajectories tailored for novice, intermediate, and expert users, capturing their distinct expectations and communication patterns.To facilitate comprehensive evaluation, we propose a multi-dimensional evaluation framework assessing both code quality and interaction experience through a novel Dual-track Evaluation Method. In the Direct Generation Track, the benchmark provides golden dialogue context (excluding the final code) directly to the LLM for code generation. In contrast, the Interactive Dialogue Track simulates realistic multi-turn interactions, prompting the model to proactively clarify instructions and gather requirements before generating solutions. Code quality is evaluated in both tracks by Test Pass Rate and Success Rate, while interaction experience is assessed exclusively within the Interactive Dialogue Track through subjective and alignment indicators. Our benchmark and multi-dimensional indicator system collectively establish a new paradigm for evaluating adaptive, user-aware AI coding assistants.

Downloads

Published

2026-03-14

How to Cite

Yang, W., Xie, L., Cai, J., Yan, Y., Dai, H.-N., & Wang, H. (2026). Talk2Code: A Multi-Turn Interaction Benchmark with Dual-Track Evaluation for Code Generation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(40), 34331–34339. https://doi.org/10.1609/aaai.v40i40.40730

Issue

Section

AAAI Technical Track on Natural Language Processing V