IntentMotion: Learning Intent-Aware Human Motion from Language in 3D Scenes
DOI:
https://doi.org/10.1609/aaai.v40i3.37188Abstract
Generating human motion in complex 3D scenes from text is a challenging task with broad applications. However, existing methods often overlook realistic physical contact, resulting in visually plausible but physically unrealistic motion, e.g., penetration. To alleviate this, we propose IntentMotion, a novel framework that generates human motion in 3D scenes from natural language instructions by explicitly modeling intent. We first introduce the Intention-Guided Contact Field (IGCF). This differentiable voxel-based contact region representation explicitly aligns parsed language roles with spatial contact regions through a hierarchical attention mechanism. IGCF is jointly trained with a diffusion-based motion generator, allowing contact predictions to adapt dynamically through gradient feedback. To improve the controllability and physics-aware motion, we further propose an Intention-Aware Diffusion Model (IADM), which decouples the high-level semantic planning from the low-level contact refinement in a coarse-to-fine process. The optimized contact cues are utilized to guide the synthesis of a coarse trajectory, followed by refining detailed pose sequences under IGCF supervision. Experiments on the HUMANISE and LINGO datasets demonstrate that our IntentMotion outperforms recent baselines in contact accuracy, semantic alignment, and generalization to unseen scenes.Published
2026-03-14
How to Cite
Song, W., Zheng, S., Zhang, X., Jin, X., Hao, A., Hou, F., Hou, X., & Li, S. (2026). IntentMotion: Learning Intent-Aware Human Motion from Language in 3D Scenes. Proceedings of the AAAI Conference on Artificial Intelligence, 40(3), 2065-2073. https://doi.org/10.1609/aaai.v40i3.37188
Issue
Section
AAAI Technical Track on Cognitive Modeling & Cognitive Systems