Action-Agnostic Point-Level Supervision for Temporal Action Detection

Authors

  • Shuhei M. Yoshida NEC Corporation
  • Takashi Shibata NEC Corporation
  • Makoto Terao NEC Corporation
  • Takayuki Okatani Tohoku University RIKEN Center for Advanced Intelligence Project
  • Masashi Sugiyama RIKEN Center for Advanced Intelligence Project The University of Tokyo

DOI:

https://doi.org/10.1609/aaai.v39i9.33037

Abstract

We propose action-agnostic point-level (AAPL) supervision for temporal action detection to achieve accurate action instance detection with a lightly annotated dataset. In the proposed scheme, a small portion of video frames is sampled in an unsupervised manner and presented to human annotators, who then label the frames with action categories. Unlike point-level supervision, which requires annotators to search for every action instance in an untrimmed video, frames to annotate are selected without human intervention in AAPL supervision. We also propose a detection model and learning method to effectively utilize the AAPL labels. Extensive experiments on the variety of datasets (THUMOS'14, FineAction, GTEA, BEOID, and ActivityNet 1.3) demonstrate that the proposed approach is competitive with or outperforms prior methods for video-level and point-level supervision in terms of the trade-off between the annotation cost and detection performance.

Published

2025-04-11

How to Cite

Yoshida, S. M., Shibata, T., Terao, M., Okatani, T., & Sugiyama, M. (2025). Action-Agnostic Point-Level Supervision for Temporal Action Detection. Proceedings of the AAAI Conference on Artificial Intelligence, 39(9), 9571–9579. https://doi.org/10.1609/aaai.v39i9.33037

Issue

Section

AAAI Technical Track on Computer Vision VIII