A Unified Masked Autoencoder with Patchified Skeletons for Motion Synthesis

Authors

  • Esteve Valls Mascaró Technische Universität Wien (TU Wien), Vienna, Austria
  • Hyemin Ahn Ulsan National Institute of Science and Technology
  • Dongheui Lee Technische Universität Wien (TU Wien), Vienna, Austria German Aerospace Center (DLR),

DOI:

https://doi.org/10.1609/aaai.v38i6.28333

Keywords:

CV: Motion & Tracking, CV: Vision for Robotics & Autonomous Driving, HAI: Understanding People, Theories, Concepts and Methods, ML: Deep Generative Models & Autoencoders

Abstract

The synthesis of human motion has traditionally been addressed through task-dependent models that focus on specific challenges, such as predicting future motions or filling in intermediate poses conditioned on known key-poses. In this paper, we present a novel task-independent model called UNIMASK-M, which can effectively address these challenges using a unified architecture. Our model obtains comparable or better performance than the state-of-the-art in each field. Inspired by Vision Transformers (ViTs), our UNIMASK-M model decomposes a human pose into body parts to leverage the spatio-temporal relationships existing in human motion. Moreover, we reformulate various pose-conditioned motion synthesis tasks as a reconstruction problem with different masking patterns given as input. By explicitly informing our model about the masked joints, our UNIMASK-M becomes more robust to occlusions. Experimental results show that our model successfully forecasts human motion on the Human3.6M dataset while achieving state-of-the-art results in motion inbetweening on the LaFAN1 dataset for long transition periods.

Published

2024-03-24

How to Cite

Valls Mascaró, E., Ahn, H., & Lee, D. (2024). A Unified Masked Autoencoder with Patchified Skeletons for Motion Synthesis. Proceedings of the AAAI Conference on Artificial Intelligence, 38(6), 5261-5269. https://doi.org/10.1609/aaai.v38i6.28333

Issue

Section

AAAI Technical Track on Computer Vision V