Object Fusion via Diffusion Time-step for Customized Image Editing with Single Example

Authors

  • Xue Song College of Computer Science and Artificial Intelligence, Fudan University
  • Zhongqi Yue College of Computing and Data Science, Nanyang Technological University
  • Jiequan Cui School of Computer Science and Information Engineering, Hefei University of Technology
  • Hanwang Zhang College of Computing and Data Science, Nanyang Technological University
  • Jingjing Chen College of Computer Science and Artificial Intelligence, Fudan University Institute of Trustworthy Embodied AI, Fudan University

DOI:

https://doi.org/10.1609/aaai.v40i11.37869

Abstract

We tackle the task of customized image editing using a text-conditioned Diffusion Model (DM). The goal is to fuse the subject in a reference image (e.g., sunglasses) with a source one (e.g., a boy), while retaining the fidelity of them both (e.g., the boy wearing the sunglasses). An intuitive approach, called LoRA fusion, first separately trains a DM LoRA for each image to encode its details. Then the two LoRAs are linearly combined by a weight to generate a fused image. Unfortunately, even through careful grid search or learning the weight, this approach still trades off the fidelity of one image against the other. We point out that the evil lies in the overlooked role of diffusion time-step in the generation process, i.e., a smaller time-step controls the generation of a more fine-grained attribute. For example, a large LoRA weight for the source may help preserve its fine-grained details (e.g., face attributes) at a small time-step, but could overpower the reference subject LoRA and lose the fidelity of its overall shape at a larger time-step. To address this deficiency, we propose TimeFusion, which learns a time-step-specific LoRA fusion weight that resolves the trade-off, i.e., generating the source and reference subject in high fidelity given their respective prompt. Then we can customize image editing using this weight and a target prompt.

Published

2026-03-14

How to Cite

Song, X., Yue, Z., Cui, J., Zhang, H., & Chen, J. (2026). Object Fusion via Diffusion Time-step for Customized Image Editing with Single Example. Proceedings of the AAAI Conference on Artificial Intelligence, 40(11), 9127–9134. https://doi.org/10.1609/aaai.v40i11.37869

Issue

Section

AAAI Technical Track on Computer Vision VIII