RAA: Achieving Interactive Remove/Add Anything via Fully Synthetic Data

Authors

  • Delong Liu Beijing University of Posts and Telecommunications
  • Haotian Hou Beihang University SenseTime
  • Zhaohui Hou SenseTime
  • Shihao Han SenseTime
  • Zhiyuan Huang SenseTime
  • Mingjie Zhan SenseTime
  • Fei Su Beijing University of Posts and Telecommunications Beijing Key Laboratory of Network System and Network Culture, China Key Laboratory of Interactive Technology and Experience System, Ministry of Culture and Tourism, Beijing, China
  • Zhicheng Zhao Beijing University of Posts and Telecommunications Beijing Key Laboratory of Network System and Network Culture, China Key Laboratory of Interactive Technology and Experience System, Ministry of Culture and Tourism, Beijing, China

DOI:

https://doi.org/10.1609/aaai.v40i9.37648

Abstract

Precise and controllable image editing, especially object removal and insertion, represents one of the most common demands in image manipulation. However, existing methods suffer from severe limitations. Mask-based inpainting often introduces visual artifacts and semantic inconsistencies, while instruction-based approaches lack accurate spatial control and tend to unintentionally modify background regions. To address these issues, we propose two key contributions. First, we develop a fully automated and self-improving pipeline for synthetic data generation. This pipeline utilizes a Large Language Model (LLM) to generate diverse prompts, a Diffusion Transformer (DiT) fine-tuned evolutionarily to synthesize high-quality images, and a Multimodal LLM (MLLM) combined with open-set object detector for automated quality control and annotation. This process produces the Remove/Add Dataset (RAD), consisting of over 514,510 high-quality image pairs, each richly annotated with bounding boxes, segmentation masks, and a variety of editing instructions. Second, based on RAD, we introduce Remove/Add Anything (RAA), a novel editing framework with precise spatial control. Built upon a diffusion-based inpainting model, RAA achieves high editing accuracy by conditioning on both textual instructions and an explicitly defined region of interest (ROI), enabling efficient fine-tuning while maintaining global visual coherence. Extensive experiments demonstrate that RAA significantly outperforms existing open-source methods on both addition and removal tasks, and even slightly surpasses costly proprietary models.

Downloads

Published

2026-03-14

How to Cite

Liu, D., Hou, H., Hou, Z., Han, S., Huang, Z., Zhan, M., … Zhao, Z. (2026). RAA: Achieving Interactive Remove/Add Anything via Fully Synthetic Data. Proceedings of the AAAI Conference on Artificial Intelligence, 40(9), 7123–7131. https://doi.org/10.1609/aaai.v40i9.37648

Issue

Section

AAAI Technical Track on Computer Vision VI