DeFB: Decomposed Feature Learning for Real-Time Multi-Person Eyeblink Detection in Untrimmed In-the-Wild Videos

Jinfang Gan; Wenzheng Zeng; Yang Xiao; Xintao Zhang; Chaoyang Zheng; Ran Zhao; Ran Wang; Min Du; Zhiguo Cao

doi:10.1609/aaai.v40i5.37411

Authors

Jinfang Gan Huazhong University of Science and Technology
Wenzheng Zeng Huazhong University of Science and Technology National University of Singapore
Yang Xiao Huazhong University of Science and Technology
Xintao Zhang Huazhong University of Science and Technology
Chaoyang Zheng Huazhong University of Science and Technology
Ran Zhao Huazhong University of Science and Technology
Ran Wang Huazhong University of Science and Technology
Min Du ByteDance
Zhiguo Cao Huazhong University of Science and Technology

DOI:

https://doi.org/10.1609/aaai.v40i5.37411

Abstract

Multi-person eyeblink detection in untrimmed in-the-wild videos is a recently emerged and challenging task. Due to its significant spatio-temporal fine-grained characteristics compared to general actions, we empirically find that general action detectors, though effective in general domains, struggle with this task (i.e., Blink-AP < 2%). Specialized eyeblink detection methods alleviate it through fine-grained spatio-temporal operations. SOTA method proposes a unified model combining instance-aware face localization and eyeblink detection through joint multi-task learning and feature sharing. While effective, it exhibits two critical limitations that may contribute to its unsatisfactory performance (i.e., Blink-AP=10.11%): (1) Face localization and eyeblink detection require distinct spatio-temporal feature granularities, making joint modeling in a unified feature space suboptimal. (2) Eyeblink task training could be largely affected by unstable face-eye feature learning under the joint training paradigm. To address this, we propose DeFB, a decomposed feature learning paradigm with favorable effectiveness and efficiency: (1) We model faces and eyes in granularity-specific feature spaces, which enhances fine-grained perception while reducing computational costs compared to a unified feature space. (2) To mitigate face-eye feature learning instability, we adopt an asynchronous learning mechanism where eye feature learning refines well-trained coarse face features, with shared queries acting as a bridge between stages to retain the efficient feature sharing of existing unified models. Compared with SOTA method, DeFB doubles the performance (Blink-AP: 24.65% v.s. 10.11%) while boosting efficiency by nearly 35%. DeFB can also be integrated as a plug-in to substantially augment the eyeblink detection capabilities of general action detectors.

DeFB: Decomposed Feature Learning for Real-Time Multi-Person Eyeblink Detection in Untrimmed In-the-Wild Videos

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information