IntentMotion: Learning Intent-Aware Human Motion from Language in 3D Scenes

Wenfeng Song; Shi Zheng; Xinyu Zhang; Xingliang Jin; Aimin Hao; Fei Hou; Xia Hou; Shuai Li

doi:10.1609/aaai.v40i3.37188

Authors

Wenfeng Song College of Computer Science, Beijing Information Science and Technology University
Shi Zheng College of Computer Science, Beijing Information Science and Technology University
Xinyu Zhang State Key Laboratory of Virtual Reality Technology and Systems, Beihang University
Xingliang Jin School of Computer Science and Technology, East China Normal University
Aimin Hao State Key Laboratory of Virtual Reality Technology and Systems, Beihang University
Fei Hou Key Laboratory of System Software (CAS), State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, China University of Chinese Academy of Sciences, China
Xia Hou College of Computer Science, Beijing Information Science and Technology University
Shuai Li State Key Laboratory of Virtual Reality Technology and Systems, Beihang University Zhongguancun Laboratory, China

DOI:

https://doi.org/10.1609/aaai.v40i3.37188

Abstract

Generating human motion in complex 3D scenes from text is a challenging task with broad applications. However, existing methods often overlook realistic physical contact, resulting in visually plausible but physically unrealistic motion, e.g., penetration. To alleviate this, we propose IntentMotion, a novel framework that generates human motion in 3D scenes from natural language instructions by explicitly modeling intent. We first introduce the Intention-Guided Contact Field (IGCF). This differentiable voxel-based contact region representation explicitly aligns parsed language roles with spatial contact regions through a hierarchical attention mechanism. IGCF is jointly trained with a diffusion-based motion generator, allowing contact predictions to adapt dynamically through gradient feedback. To improve the controllability and physics-aware motion, we further propose an Intention-Aware Diffusion Model (IADM), which decouples the high-level semantic planning from the low-level contact refinement in a coarse-to-fine process. The optimized contact cues are utilized to guide the synthesis of a coarse trajectory, followed by refining detailed pose sequences under IGCF supervision. Experiments on the HUMANISE and LINGO datasets demonstrate that our IntentMotion outperforms recent baselines in contact accuracy, semantic alignment, and generalization to unseen scenes.

IntentMotion: Learning Intent-Aware Human Motion from Language in 3D Scenes

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information