Re-Attentional Controllable Video Diffusion Editing

Yuanzhi Wang; Yong Li; Mengyi Liu; Xiaoya Zhang; Xin Liu; Zhen Cui; Antoni B. Chan

doi:10.1609/aaai.v39i8.32876

Authors

Yuanzhi Wang School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China. Department of Content Security, Kuaishou Technology, Beijing, China.
Yong Li School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China. Department of Computer Science, City University of Hong Kong, Hong Kong, China.
Mengyi Liu Department of Content Security, Kuaishou Technology, Beijing, China.
Xiaoya Zhang School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China.
Xin Liu SeetaCloud, Nanjing, China
Zhen Cui School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China.
Antoni B. Chan Department of Computer Science, City University of Hong Kong, Hong Kong, China.

DOI:

https://doi.org/10.1609/aaai.v39i8.32876

Abstract

Editing videos with textual guidance has garnered popularity due to its streamlined process which mandates users to solely edit the text prompt corresponding to the source video. Recent studies have explored and exploited large-scale text-to-image diffusion models for text-guided video editing, resulting in remarkable video editing capabilities. However, they may still suffer from some limitations such as mislocated objects, incorrect number of objects. Therefore, the controllability of video editing remains a formidable challenge. In this paper, we aim to challenge the above limitations by proposing a Re-Attentional Controllable Video Diffusion Editing (ReAtCo) method. Specially, to align the spatial placement of the target objects with the edited text prompt in a training-free manner, we propose a Re-Attentional Diffusion (RAD) to refocus the cross-attention activation responses between the edited text prompt and the target video during the denoising stage, resulting in a spatially location-aligned and semantically high-fidelity manipulated video. In particular, to faithfully preserve the invariant region content with less border artifacts, we propose an Invariant Region-guided Joint Sampling (IRJS) strategy to mitigate the intrinsic sampling errors w.r.t the invariant regions at each denoising timestep and constrain the generated content to be harmonized with the invariant region content. Experimental results verify that ReAtCo consistently improves the controllability of video diffusion editing and achieves superior video editing performance.

Re-Attentional Controllable Video Diffusion Editing

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information