Diffusion Implicit Policy for Unpaired Scene-aware Motion Synthesis

Authors

  • Jingyu Gong School of Computer Science and Technology, East China Normal University, Shanghai, China Chongqing Key Laboratory of Precision Optics, Chongqing Institute of East China Normal University, Chongqing, China Shanghai Key Laboratory of Computer Software Evaluating and Testing, Shanghai, China
  • Chong Zhang School of Computer Science and Technology, East China Normal University, Shanghai, China
  • Fengqi Liu School of Computer Science, Shanghai Jiao Tong University, Shanghai, China
  • Ke Fan School of Computer Science, Shanghai Jiao Tong University, Shanghai, China
  • Qianyu Zhou College of Computer Science and Technology, Jilin University, Jilin, China
  • Xin Tan School of Computer Science and Technology, East China Normal University, Shanghai, China Chongqing Key Laboratory of Precision Optics, Chongqing Institute of East China Normal University, Chongqing, China
  • Zhizhong Zhang School of Computer Science and Technology, East China Normal University, Shanghai, China Shanghai Key Laboratory of Computer Software Evaluating and Testing, Shanghai, China
  • Yuan Xie School of Computer Science and Technology, East China Normal University, Shanghai, China Chongqing Key Laboratory of Precision Optics, Chongqing Institute of East China Normal University, Chongqing, China

DOI:

https://doi.org/10.1609/aaai.v40i6.42422

Abstract

Scene-aware motion synthesis has been widely researched recently due to its numerous applications. Prevailing methods rely heavily on paired motion-scene data, while it is difficult to generalize to diverse scenes when trained only on a few specific ones. Thus, we propose a unified framework, termed Diffusion Implicit Policy (DIP), for scene-aware motion synthesis, where paired motion-scene data are no longer necessary. In this paper, we disentangle human-scene interaction from motion synthesis during training, and then introduce an interaction-based implicit policy into motion diffusion during inference. Synthesized motion can be derived through iterative diffusion denoising and implicit policy optimization, thus motion naturalness and interaction plausibility can be maintained simultaneously. For long-term motion synthesis, we introduce motion blending in joint rotation power space. The proposed method is evaluated on synthesized scenes with ShapeNet furniture, and real scenes from PROX and Replica. Results show that our framework presents better motion naturalness and interaction plausibility than cutting-edge methods. This also indicates the feasibility of utilizing the DIP for motion synthesis in more general tasks and versatile scenes.

Published

2026-03-14

How to Cite

Gong, J., Zhang, C., Liu, F., Fan, K., Zhou, Q., Tan, X., … Xie, Y. (2026). Diffusion Implicit Policy for Unpaired Scene-aware Motion Synthesis. Proceedings of the AAAI Conference on Artificial Intelligence, 40(6), 4257–4265. https://doi.org/10.1609/aaai.v40i6.42422

Issue

Section

AAAI Technical Track on Computer Vision III