IntentMotion: Learning Intent-Aware Human Motion from Language in 3D Scenes

Authors

  • Wenfeng Song College of Computer Science, Beijing Information Science and Technology University
  • Shi Zheng College of Computer Science, Beijing Information Science and Technology University
  • Xinyu Zhang State Key Laboratory of Virtual Reality Technology and Systems, Beihang University
  • Xingliang Jin School of Computer Science and Technology, East China Normal University
  • Aimin Hao State Key Laboratory of Virtual Reality Technology and Systems, Beihang University
  • Fei Hou Key Laboratory of System Software (CAS), State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, China University of Chinese Academy of Sciences, China
  • Xia Hou College of Computer Science, Beijing Information Science and Technology University
  • Shuai Li State Key Laboratory of Virtual Reality Technology and Systems, Beihang University Zhongguancun Laboratory, China

DOI:

https://doi.org/10.1609/aaai.v40i3.37188

Abstract

Generating human motion in complex 3D scenes from text is a challenging task with broad applications. However, existing methods often overlook realistic physical contact, resulting in visually plausible but physically unrealistic motion, e.g., penetration. To alleviate this, we propose IntentMotion, a novel framework that generates human motion in 3D scenes from natural language instructions by explicitly modeling intent. We first introduce the Intention-Guided Contact Field (IGCF). This differentiable voxel-based contact region representation explicitly aligns parsed language roles with spatial contact regions through a hierarchical attention mechanism. IGCF is jointly trained with a diffusion-based motion generator, allowing contact predictions to adapt dynamically through gradient feedback. To improve the controllability and physics-aware motion, we further propose an Intention-Aware Diffusion Model (IADM), which decouples the high-level semantic planning from the low-level contact refinement in a coarse-to-fine process. The optimized contact cues are utilized to guide the synthesis of a coarse trajectory, followed by refining detailed pose sequences under IGCF supervision. Experiments on the HUMANISE and LINGO datasets demonstrate that our IntentMotion outperforms recent baselines in contact accuracy, semantic alignment, and generalization to unseen scenes.

Downloads

Published

2026-03-14

How to Cite

Song, W., Zheng, S., Zhang, X., Jin, X., Hao, A., Hou, F., Hou, X., & Li, S. (2026). IntentMotion: Learning Intent-Aware Human Motion from Language in 3D Scenes. Proceedings of the AAAI Conference on Artificial Intelligence, 40(3), 2065-2073. https://doi.org/10.1609/aaai.v40i3.37188

Issue

Section

AAAI Technical Track on Cognitive Modeling & Cognitive Systems