Further Understanding Videos through Adverbs: A New Video Task

Bo Pang; Kaiwen Zha; Yifan Zhang; Cewu Lu

doi:10.1609/aaai.v34i07.6855

Authors

Bo Pang Shanghai Jiao Tong University
Kaiwen Zha Shanghai Jiao Tong University
Yifan Zhang Shanghai Jiao Tong University
Cewu Lu Shanghai Jiao Tong University

DOI:

https://doi.org/10.1609/aaai.v34i07.6855

Abstract

Video understanding is a research hotspot of computer vision and significant progress has been made on video action recognition recently. However, the semantics information contained in actions is not rich enough to build powerful video understanding models. This paper first introduces a new video semantics: the Behavior Adverb (BA), which is a more expressive and difficult one covering subtle and inherent characteristics of human action behavior. To exhaustively decode this semantics, we construct the Videos with Action and Adverb Dataset (VAAD), which is a large-scale dataset with a semantically complete set of BAs. The dataset will be released to the public with this paper. We benchmark several representative video understanding methods (originally for action recognition) on BA and action recognition. The results show that BA recognition task is more challenging than conventional action recognition. Accordingly, we propose the BA Understanding Network (BAUN) to solve this problem and the experiments reveal that our BAUN is more suitable for BA recognition (11% better than I3D). Furthermore, we find these two semantics (action and BA) can propel each other forward to better performance: promoting action recognition results by 3.4% averagely on three standard action recognition datasets (UCF-101, HMDB-51, Kinetics).

Further Understanding Videos through Adverbs: A New Video Task

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information