RAA: Achieving Interactive Remove/Add Anything via Fully Synthetic Data

Delong Liu; Haotian Hou; Zhaohui Hou; Shihao Han; Zhiyuan Huang; Mingjie Zhan; Fei Su; Zhicheng Zhao

doi:10.1609/aaai.v40i9.37648

Authors

Delong Liu Beijing University of Posts and Telecommunications
Haotian Hou Beihang University SenseTime
Zhaohui Hou SenseTime
Shihao Han SenseTime
Zhiyuan Huang SenseTime
Mingjie Zhan SenseTime
Fei Su Beijing University of Posts and Telecommunications Beijing Key Laboratory of Network System and Network Culture, China Key Laboratory of Interactive Technology and Experience System, Ministry of Culture and Tourism, Beijing, China
Zhicheng Zhao Beijing University of Posts and Telecommunications Beijing Key Laboratory of Network System and Network Culture, China Key Laboratory of Interactive Technology and Experience System, Ministry of Culture and Tourism, Beijing, China

DOI:

https://doi.org/10.1609/aaai.v40i9.37648

Abstract

Precise and controllable image editing, especially object removal and insertion, represents one of the most common demands in image manipulation. However, existing methods suffer from severe limitations. Mask-based inpainting often introduces visual artifacts and semantic inconsistencies, while instruction-based approaches lack accurate spatial control and tend to unintentionally modify background regions. To address these issues, we propose two key contributions. First, we develop a fully automated and self-improving pipeline for synthetic data generation. This pipeline utilizes a Large Language Model (LLM) to generate diverse prompts, a Diffusion Transformer (DiT) fine-tuned evolutionarily to synthesize high-quality images, and a Multimodal LLM (MLLM) combined with open-set object detector for automated quality control and annotation. This process produces the Remove/Add Dataset (RAD), consisting of over 514,510 high-quality image pairs, each richly annotated with bounding boxes, segmentation masks, and a variety of editing instructions. Second, based on RAD, we introduce Remove/Add Anything (RAA), a novel editing framework with precise spatial control. Built upon a diffusion-based inpainting model, RAA achieves high editing accuracy by conditioning on both textual instructions and an explicitly defined region of interest (ROI), enabling efficient fine-tuning while maintaining global visual coherence. Extensive experiments demonstrate that RAA significantly outperforms existing open-source methods on both addition and removal tasks, and even slightly surpasses costly proprietary models.

RAA: Achieving Interactive Remove/Add Anything via Fully Synthetic Data

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information