RAA: Achieving Interactive Remove/Add Anything via Fully Synthetic Data
DOI:
https://doi.org/10.1609/aaai.v40i9.37648Abstract
Precise and controllable image editing, especially object removal and insertion, represents one of the most common demands in image manipulation. However, existing methods suffer from severe limitations. Mask-based inpainting often introduces visual artifacts and semantic inconsistencies, while instruction-based approaches lack accurate spatial control and tend to unintentionally modify background regions. To address these issues, we propose two key contributions. First, we develop a fully automated and self-improving pipeline for synthetic data generation. This pipeline utilizes a Large Language Model (LLM) to generate diverse prompts, a Diffusion Transformer (DiT) fine-tuned evolutionarily to synthesize high-quality images, and a Multimodal LLM (MLLM) combined with open-set object detector for automated quality control and annotation. This process produces the Remove/Add Dataset (RAD), consisting of over 514,510 high-quality image pairs, each richly annotated with bounding boxes, segmentation masks, and a variety of editing instructions. Second, based on RAD, we introduce Remove/Add Anything (RAA), a novel editing framework with precise spatial control. Built upon a diffusion-based inpainting model, RAA achieves high editing accuracy by conditioning on both textual instructions and an explicitly defined region of interest (ROI), enabling efficient fine-tuning while maintaining global visual coherence. Extensive experiments demonstrate that RAA significantly outperforms existing open-source methods on both addition and removal tasks, and even slightly surpasses costly proprietary models.Published
2026-03-14
How to Cite
Liu, D., Hou, H., Hou, Z., Han, S., Huang, Z., Zhan, M., … Zhao, Z. (2026). RAA: Achieving Interactive Remove/Add Anything via Fully Synthetic Data. Proceedings of the AAAI Conference on Artificial Intelligence, 40(9), 7123–7131. https://doi.org/10.1609/aaai.v40i9.37648
Issue
Section
AAAI Technical Track on Computer Vision VI