Skating-Mixer: Long-Term Sport Audio-Visual Modeling with MLPs

Authors

  • Jingfei Xia Southern University of Science and Technology The Chinese University of Hong Kong
  • Mingchen Zhuge Southern University of Science and Technology AI Initiative, King Abdullah University of Science and Technology
  • Tiantian Geng Southern University of Science and Technology
  • Shun Fan Southern University of Science and Technology
  • Yuantai Wei Southern University of Science and Technology
  • Zhenyu He Harbin Institute of Technology (Shenzhen)
  • Feng Zheng Southern University of Science and Technology

DOI:

https://doi.org/10.1609/aaai.v37i3.25392

Keywords:

CV: Multi-modal Vision, CV: Applications, CV: Video Understanding & Activity Analysis

Abstract

Figure skating scoring is challenging because it requires judging players’ technical moves as well as coordination with the background music. Most learning-based methods struggle for two reasons: 1) each move in figure skating changes quickly, hence simply applying traditional frame sampling will lose a lot of valuable information, especially in 3 to 5 minutes lasting videos; 2) prior methods rarely considered the critical audio-visual relationship in their models. Due to these reasons, we introduce a novel architecture, named Skating-Mixer. It extends the MLP framework into a multimodal fashion and effectively learns long-term representations through our designed memory recurrent unit (MRU). Aside from the model, we collected a high-quality audio-visual FS1000 dataset, which contains over 1000 videos on 8 types of programs with 7 different rating metrics, overtaking other datasets in both quantity and diversity. Experiments show the proposed method achieves SOTAs over all major metrics on the public Fis-V and our FS1000 dataset. In addition, we include an analysis applying our method to the recent competitions in Beijing 2022 Winter Olympic Games, proving our method has strong applicability.

Downloads

Published

2023-06-26

How to Cite

Xia, J., Zhuge, M., Geng, T., Fan, S., Wei, Y., He, Z., & Zheng, F. (2023). Skating-Mixer: Long-Term Sport Audio-Visual Modeling with MLPs. Proceedings of the AAAI Conference on Artificial Intelligence, 37(3), 2901-2909. https://doi.org/10.1609/aaai.v37i3.25392

Issue

Section

AAAI Technical Track on Computer Vision III