Multiple Human Motion Understanding

Authors

  • Lei Li VitaSight University of Washington
  • Sen Jia VitaSight University of Washington
  • Jenq-Neng Hwang University of Washington, Seattle

DOI:

https://doi.org/10.1609/aaai.v40i8.37556

Abstract

We introduce LLaMMo (Large Language and Multi-Person Motion Assistant), the first instruction-tuning multimodal framework tailored for multi-human motion analysis. LLaMMo incorporates a novel human-centric and social-temporal learner that models and fuses both intra-person dynamics and inter-person dependencies, yielding robust, context-aware representations of complex group behaviors while maintaining low computational overhead. To support LLaMMo, we construct LLaVerse, a large-scale dataset with fine-grained manual annotations covering diverse multi-person activities spanning daily social interaction and professional team sports. Built on top of LLaVerse, we also propose LLaMI-Bench, a dedicated benchmark for evaluating multi-human behavior understanding across motion and video modalities. Extensive experiments demonstrate that LLaMMo consistently outperforms baselines in understanding multi-person interactions under low-latency settings, with notable gains in both social and sport-specific contexts.

Downloads

Published

2026-03-14

How to Cite

Li, L., Jia, S., & Hwang, J.-N. (2026). Multiple Human Motion Understanding. Proceedings of the AAAI Conference on Artificial Intelligence, 40(8), 6297–6305. https://doi.org/10.1609/aaai.v40i8.37556

Issue

Section

AAAI Technical Track on Computer Vision V