Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing

Chun-Hsiao Yeh; Yilin Wang; Nanxuan Zhao; Richard Zhang; Yuheng Li; Yi Ma; Krishna Kumar Singh

doi:10.1609/aaai.v40i14.38187

Authors

Chun-Hsiao Yeh UC Berkeley Adobe Research
Yilin Wang Adobe Research
Nanxuan Zhao Adobe Research
Richard Zhang Adobe Research
Yuheng Li Adobe Research
Yi Ma UC Berkeley HKU
Krishna Kumar Singh Adobe Research

DOI:

https://doi.org/10.1609/aaai.v40i14.38187

Abstract

Recent diffusion-based image editing methods have made great strides in text-guided tasks but often struggle with complex, indirect instructions. Additionally, current models frequently exhibit poor identity preservation, unintended edits, or rely on manual masks. To overcome these limitations, we introduce X-Planner, a Multimodal Large Language Model (MLLM)-based planning system that bridges user intent with editing model capabilities. X-Planner uses chain-of-thought reasoning to systematically break down complex instructions into simpler sub-instructions. For each one, X-Planner automatically generates precise edit types and segmentation masks, enabling localized, identity-preserving edits without applying external tools or models during inference. To enable the training of such a planner, we also introduce a fully automated, reproducible pipeline to generate large-scale, high-quality training data. Our complete system achieves state-of-the-art results on both existing and newly proposed complex instruction-based editing benchmarks.

Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information