SceneGenesis: 3D Scene Synthesis via Semantic Structural Priors and Mesh-Guided Video-Geometry Fusion

Authors

  • Yueming Zhao School of Computer Science and Engineering, Beihang University, Beijing, China
  • Hongyu Yang School of Artificial Intelligence, Beihang University, Beijing, China State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China Shanghai Artificial Intelligence Laboratory, Shanghai, China
  • Di Huang School of Computer Science and Engineering, Beihang University, Beijing, China

DOI:

https://doi.org/10.1609/aaai.v40i16.38333

Abstract

Generating high-quality, controllable, and structurally consistent 3D scenes in complex multi-object environments remains a fundamental challenge. We present SceneGenesis, a unified framework that synthesizes 3D scenes by combining semantic structural priors with mesh-guided video–geometry fusion. SceneGenesis first employs large language models to convert textual descriptions into category-aware object specifications, which are transformed into structured meshes using procedural approximations and pretrained asset generators, enabling precise layout control and scalable scene construction. To obtain rich and style-controllable appearances, SceneGenesis generates multi-view video representations conditioned on the initialized structure. A mesh-guided video–geometry fusion module then consolidates video evidence with mesh priors through mesh-conditioned fragment initialization, progressive geometric refinement, and structure-aware optimization, substantially improving global geometric fidelity and visual realism. Experiments demonstrate that SceneGenesis supports flexible style variation and object-level editing while achieving strong controllability, scalability, and structural quality.

Downloads

Published

2026-03-14

How to Cite

Zhao, Y., Yang, H., & Huang, D. (2026). SceneGenesis: 3D Scene Synthesis via Semantic Structural Priors and Mesh-Guided Video-Geometry Fusion. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16), 13305–13313. https://doi.org/10.1609/aaai.v40i16.38333

Issue

Section

AAAI Technical Track on Computer Vision XIII