CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility
DOI:
https://doi.org/10.1609/aaai.v39i10.33203Abstract
Video inpainting is a crucial task with diverse applications, including fine-grained video editing, video recovery, and video dewatermarking. However, most existing video inpainting methods primarily focus on visual content completion while neglecting text information. There are only a limited number of text-guided video inpainting techniques, and these techniques struggle with maintaining visual quality and exhibit poor semantic representation capabilities. In this paper, we introduce CoCoCo, a text-guided video inpainting diffusion framework. To address the aforementioned challenges, we enhance both the training data and model structure. Specifically, we devise an instance-aware region selection strategy for masked area sampling and develop a novel motion block that incorporates efficient 3D full attention and textual cross attention. Additionally, our CoCoCo framework can be seamlessly integrated with various personalized text-to-image diffusion models through a delicate training-free transfer mechanism. Comprehensive experiments demonstrate that CoCoCo can create high-quality visual content with enhanced temporal consistency, improved text controllability, and better compatibility with personalized image models.Downloads
Published
2025-04-11
How to Cite
Zi, B., Zhao, S., Qi, X., Wang, J., Shi, Y., Chen, Q., … Zhang, L. (2025). CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility. Proceedings of the AAAI Conference on Artificial Intelligence, 39(10), 11067–11076. https://doi.org/10.1609/aaai.v39i10.33203
Issue
Section
AAAI Technical Track on Computer Vision IX