ConInstruct: Evaluating Large Language Models on Conflict Detection and Resolution in Instructions

Authors

  • Xingwei He Department of Computer Science, The University of Hong Kong, Hong Kong, China
  • Qianru Zhang Department of Computer Science, The University of Hong Kong, Hong Kong, China
  • Pengfei Chen School of Artificial Intelligence, Xidian University, Xi’an, China
  • Guanhua Chen Department of Statistics and Data Science, Southern University of Science and Technology, Shenzhen, China
  • Linlin Yu Department of Computer Science, Augusta University, Augusta, GA, USA
  • Yuan Yuan School of Computer Science and Engineering, Beihang University, Beijing 100191, China Qingdao Research Institute, Beihang University Hangzhou Innovation Institute, Beihang University
  • Siu-Ming Yiu Department of Computer Science, The University of Hong Kong, Hong Kong, China

DOI:

https://doi.org/10.1609/aaai.v40i37.40356

Abstract

Instruction-following is a critical capability of Large Language Models (LLMs). While existing works primarily focus on assessing how well LLMs adhere to user instructions, they often overlook scenarios where instructions contain conflicting constraints—a common occurrence in complex prompts. The behavior of LLMs under such conditions remains under-explored. To bridge this gap, we introduce ConInstruct, a benchmark specifically designed to assess LLMs' ability to detect and resolve conflicts within user instructions. Using this dataset, we evaluate LLMs' conflict detection performance and analyze their conflict resolution behavior. Our experiments reveal two key findings: (1) Most proprietary LLMs exhibit strong conflict detection capabilities, whereas among open-source models, only DeepSeek-R1 demonstrates similarly strong performance. DeepSeek-R1 and Claude-4.5-Sonnet achieve the highest average F1-scores at 91.5% and 87.3%, respectively, ranking first and second overall. (2) Despite their strong conflict detection abilities, LLMs rarely explicitly notify users about the conflicts or request clarification when faced with conflicting constraints. These results underscore a critical shortcoming in current LLMs and highlight an important area for future improvement when designing instruction-following LLMs.

Downloads

Published

2026-03-14

How to Cite

He, X., Zhang, Q., Chen, P., Chen, G., Yu, L., Yuan, Y., & Yiu, S.-M. (2026). ConInstruct: Evaluating Large Language Models on Conflict Detection and Resolution in Instructions. Proceedings of the AAAI Conference on Artificial Intelligence, 40(37), 30969–30977. https://doi.org/10.1609/aaai.v40i37.40356

Issue

Section

AAAI Technical Track on Natural Language Processing II