DiT4Edit: Diffusion Transformer for Image Editing

Authors

  • Kunyu Feng Peking University
  • Yue Ma The Hong Kong University of Science and Technology
  • Bingyuan Wang The Hong Kong University of Science and Technology (Guangzhou)
  • Chenyang Qi The Hong Kong University of Science and Technology
  • Haozhe Chen Peking University
  • Qifeng Chen The Hong Kong University of Science and Technology
  • Zeyu Wang The Hong Kong University of Science and Technology (Guangzhou) The Hong Kong University of Science and Technology

DOI:

https://doi.org/10.1609/aaai.v39i3.32304

Abstract

Despite recent advances in UNet-based image editing, methods for shape-aware object editing in high-resolution images are still lacking. Compared to UNet, Diffusion Transformers (DiT) demonstrate superior capabilities to effectively capture the long-range dependencies among patches, leading to higher-quality image generation. In this paper, we propose DiT4Edit, the first Diffusion Transformer-based image editing framework. Specifically, DiT4Edit uses the DPM-Solver inversion algorithm to obtain the inverted latents, reducing the number of steps compared to the DDIM inversion algorithm commonly used in UNet-based frameworks. Additionally, we design unified attention control and patch merging, tailored for transformer computation streams. This integration allows our framework to generate higher-quality edited images faster. Our design leverages the advantages of DiT, enabling it to surpass UNet structures in image editing, especially in high-resolution and arbitrary-size images. Extensive experiments demonstrate the strong performance of DiT4Edit in various editing scenarios, highlighting the potential of diffusion transformers for image editing.

Downloads

Published

2025-04-11

How to Cite

Feng, K., Ma, Y., Wang, B., Qi, C., Chen, H., Chen, Q., & Wang, Z. (2025). DiT4Edit: Diffusion Transformer for Image Editing. Proceedings of the AAAI Conference on Artificial Intelligence, 39(3), 2969–2977. https://doi.org/10.1609/aaai.v39i3.32304

Issue

Section

AAAI Technical Track on Computer Vision II