CompFeat: Comprehensive Feature Aggregation for Video Instance Segmentation

Authors

  • Yang Fu University of Illinois at Urbana-Champaign
  • Linjie Yang ByteDance Inc.
  • Ding Liu Bytedance Inc.
  • Thomas S. Huang University of Illinois at Urbana-Champaign
  • Humphrey Shi University of Illinois at Urbana-Champaign University of Oregon

DOI:

https://doi.org/10.1609/aaai.v35i2.16225

Keywords:

Motion & Tracking, Video Understanding & Activity Analysis

Abstract

Video instance segmentation is a complex task in which we need to detect, segment, and track each object for any given video. Previous approaches only utilize single-frame features for the detection, segmentation, and tracking of objects and they suffer in the video scenario due to several distinct challenges such as motion blur and drastic appearance change. To eliminate ambiguities introduced by only using single-frame features, we propose a novel comprehensive feature aggregation approach (CompFeat) to refine features atboth frame-level and object-level with temporal and spatial context information. The aggregation process is carefully designed with a new attention mechanism which significantly increases the discriminative power of the learned features. We further improve the tracking capability of our model through a siamese design by incorporating both feature similarities and spatial similarities. Experiments conducted on the YouTube-VIS dataset validate the effectiveness of proposed CompFeat.

Downloads

Published

2021-05-18

How to Cite

Fu, Y., Yang, L., Liu, D., Huang, T. S., & Shi, H. (2021). CompFeat: Comprehensive Feature Aggregation for Video Instance Segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 35(2), 1361-1369. https://doi.org/10.1609/aaai.v35i2.16225

Issue

Section

AAAI Technical Track on Computer Vision I