Human Motion Synthesis in 3D Scenes via Unified Scene Semantic Occupancy

Authors

  • Jingyu Gong School of Computer Science and Technology, East China Normal University, Shanghai, China Chongqing Key Laboratory of Precision Optics, Chongqing Institute of East China Normal University, Chongqing, China Shanghai Key Laboratory of Computer Software Evaluating and Testing, Shanghai Development Center of Computer Software Technology, Shanghai, China
  • Kunkun Tong School of Computer Science and Technology, East China Normal University, Shanghai, China
  • Zhuoran Chen School of Computer Science and Technology, East China Normal University, Shanghai, China
  • Chuanhan Yuan College of Computer Science, Chongqing University, Chongqing, China
  • Mingang Chen Shanghai Key Laboratory of Computer Software Evaluating and Testing, Shanghai Development Center of Computer Software Technology, Shanghai, China
  • Zhizhong Zhang School of Computer Science and Technology, East China Normal University, Shanghai, China Shanghai Key Laboratory of Computer Software Evaluating and Testing, Shanghai Development Center of Computer Software Technology, Shanghai, China
  • Xin Tan School of Computer Science and Technology, East China Normal University, Shanghai, China Chongqing Key Laboratory of Precision Optics, Chongqing Institute of East China Normal University, Chongqing, China
  • Yuan Xie School of Computer Science and Technology, East China Normal University, Shanghai, China Chongqing Key Laboratory of Precision Optics, Chongqing Institute of East China Normal University, Chongqing, China

DOI:

https://doi.org/10.1609/aaai.v40i6.42421

Abstract

Human motion synthesis in 3D scenes relies heavily on scene comprehension, while current methods focus mainly on scene structure but ignore the semantic understanding. In this paper, we propose a human motion synthesis framework that take an unified Scene Semantic Occupancy (SSO) for scene representation, termed SSOMotion. We design a bi-directional tri-plane decomposition to derive a compact version of the SSO, and scene semantics are mapped to an unified feature space via CLIP encoding and shared linear dimensionality reduction. Such strategy can derive the fine-grained scene semantic structures while significantly reduce redundant computations. We further take these scene hints and movement direction derived from instructions for motion control via frame-wise scene query. Extensive experiments and ablation studies conducted on cluttered scenes using ShapeNet furniture, as well as scanned scenes from PROX and Replica datasets, demonstrate its cutting-edge performance while validating its effectiveness and generalization ability.

Published

2026-03-14

How to Cite

Gong, J., Tong, K., Chen, Z., Yuan, C., Chen, M., Zhang, Z., … Xie, Y. (2026). Human Motion Synthesis in 3D Scenes via Unified Scene Semantic Occupancy. Proceedings of the AAAI Conference on Artificial Intelligence, 40(6), 4248–4256. https://doi.org/10.1609/aaai.v40i6.42421

Issue

Section

AAAI Technical Track on Computer Vision III