Fast Multi-view Consistent 3D Editing with Video Priors
DOI:
https://doi.org/10.1609/aaai.v40i4.37286Abstract
Text-driven 3D editing enables user-friendly 3D object or scene editing with text instructions. Due to the lack of multi-view consistency priors, existing methods typically resort to employ 2D generation or editing models to process per-view individually, followed by iterative 2D-3D-2D updating. However, these methods are not only time-consuming but also prone to yielding over-smoothed results, since iterative process averages the different editing signals gathered from different views. In this paper, we propose, an early and pioneering work of generative Video Prior based 3D Editing, ViP3DE in short, to repurpose the temporal consistency priors from pre-trained video generation models to achieve consistent 3D editing within a single forward pass. Our key insight is to condition the video generation model on a single edited view to generate other consistent edited views for 3D updating directly, thereby bypassing iterative editing paradigm. First, 3D updating requires edited views to be paired with specific camera poses. To this end, we propose \textit{motion-preserved noise blending} for the video model to generate edited views at predefined camera poses. In addition, we introduce \textit{geometrically aware denoising} to further enhance multi-view consistency by integrating 3D geometric priors into video models. Extensive experiments demonstrate that our proposed ViP3DE can achieve high-quality 3D editing results even within a single forward pass, significantly outperforming existing methods in both editing quality and editing time cost.Downloads
Published
2026-03-14
How to Cite
Chen, L., Li, R., Zhang, G., Wang, P., & Zhang, L. (2026). Fast Multi-view Consistent 3D Editing with Video Priors. Proceedings of the AAAI Conference on Artificial Intelligence, 40(4), 2948-2956. https://doi.org/10.1609/aaai.v40i4.37286
Issue
Section
AAAI Technical Track on Computer Vision I