Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing
DOI:
https://doi.org/10.1609/aaai.v40i14.38187Abstract
Recent diffusion-based image editing methods have made great strides in text-guided tasks but often struggle with complex, indirect instructions. Additionally, current models frequently exhibit poor identity preservation, unintended edits, or rely on manual masks. To overcome these limitations, we introduce X-Planner, a Multimodal Large Language Model (MLLM)-based planning system that bridges user intent with editing model capabilities. X-Planner uses chain-of-thought reasoning to systematically break down complex instructions into simpler sub-instructions. For each one, X-Planner automatically generates precise edit types and segmentation masks, enabling localized, identity-preserving edits without applying external tools or models during inference. To enable the training of such a planner, we also introduce a fully automated, reproducible pipeline to generate large-scale, high-quality training data. Our complete system achieves state-of-the-art results on both existing and newly proposed complex instruction-based editing benchmarks.Downloads
Published
2026-03-14
How to Cite
Yeh, C.-H., Wang, Y., Zhao, N., Zhang, R., Li, Y., Ma, Y., & Singh, K. K. (2026). Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing. Proceedings of the AAAI Conference on Artificial Intelligence, 40(14), 11991-11999. https://doi.org/10.1609/aaai.v40i14.38187
Issue
Section
AAAI Technical Track on Computer Vision XI