Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing

Authors

  • Chun-Hsiao Yeh UC Berkeley Adobe Research
  • Yilin Wang Adobe Research
  • Nanxuan Zhao Adobe Research
  • Richard Zhang Adobe Research
  • Yuheng Li Adobe Research
  • Yi Ma UC Berkeley HKU
  • Krishna Kumar Singh Adobe Research

DOI:

https://doi.org/10.1609/aaai.v40i14.38187

Abstract

Recent diffusion-based image editing methods have made great strides in text-guided tasks but often struggle with complex, indirect instructions. Additionally, current models frequently exhibit poor identity preservation, unintended edits, or rely on manual masks. To overcome these limitations, we introduce X-Planner, a Multimodal Large Language Model (MLLM)-based planning system that bridges user intent with editing model capabilities. X-Planner uses chain-of-thought reasoning to systematically break down complex instructions into simpler sub-instructions. For each one, X-Planner automatically generates precise edit types and segmentation masks, enabling localized, identity-preserving edits without applying external tools or models during inference. To enable the training of such a planner, we also introduce a fully automated, reproducible pipeline to generate large-scale, high-quality training data. Our complete system achieves state-of-the-art results on both existing and newly proposed complex instruction-based editing benchmarks.

Published

2026-03-14

How to Cite

Yeh, C.-H., Wang, Y., Zhao, N., Zhang, R., Li, Y., Ma, Y., & Singh, K. K. (2026). Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing. Proceedings of the AAAI Conference on Artificial Intelligence, 40(14), 11991-11999. https://doi.org/10.1609/aaai.v40i14.38187

Issue

Section

AAAI Technical Track on Computer Vision XI