EAC-Net: Efficient and Accurate Convolutional Network for Video Recognition
Research for computation-efficient video understanding is of great importance to real-world deployment. However, most of high-performance approaches are too computationally expensive for practical application. Though several efficiency oriented works are proposed, they inevitably suffer degradation of performance in terms of accuracy. In this paper, we explore a new architecture EAC-Net, enjoying both high efficiency and high performance. Specifically, we propose Motion Guided Temporal Encode (MGTE) blocks for temporal modeling, which exploits motion information and temporal relations among neighbor frames. EAC-Net is then constructed by inserting multiple MGTE blocks to common 2D CNNs. Furthermore, we proposed Atrous Temporal Encode (ATE) block for capturing long-term temporal relations at multiple time scales for further enhancing representation power of EAC-Net. Through experiments on Kinetics, our EAC-Nets achieved better results than TSM models with fewer FLOPs. With same 2D backbones, EAC-Nets outperformed Non-Local I3D counterparts by achieving higher accuracy with only about 7× fewer FLOPs. On Something-Something-V1 dataset, EAC-Net achieved 47% top-1 accuracy with 70G FLOPs which is 0.9% more accurate and 8× less FLOPs than that of Non-Local I3D+GCN.