Multiple Human Motion Understanding
DOI:
https://doi.org/10.1609/aaai.v40i8.37556Abstract
We introduce LLaMMo (Large Language and Multi-Person Motion Assistant), the first instruction-tuning multimodal framework tailored for multi-human motion analysis. LLaMMo incorporates a novel human-centric and social-temporal learner that models and fuses both intra-person dynamics and inter-person dependencies, yielding robust, context-aware representations of complex group behaviors while maintaining low computational overhead. To support LLaMMo, we construct LLaVerse, a large-scale dataset with fine-grained manual annotations covering diverse multi-person activities spanning daily social interaction and professional team sports. Built on top of LLaVerse, we also propose LLaMI-Bench, a dedicated benchmark for evaluating multi-human behavior understanding across motion and video modalities. Extensive experiments demonstrate that LLaMMo consistently outperforms baselines in understanding multi-person interactions under low-latency settings, with notable gains in both social and sport-specific contexts.Published
2026-03-14
How to Cite
Li, L., Jia, S., & Hwang, J.-N. (2026). Multiple Human Motion Understanding. Proceedings of the AAAI Conference on Artificial Intelligence, 40(8), 6297–6305. https://doi.org/10.1609/aaai.v40i8.37556
Issue
Section
AAAI Technical Track on Computer Vision V