Robust and Consistent Online Video Instance Segmentation via Instance Mask Propagation

Miran Heo; Seoung Wug Oh; Seon Joo Kim; Joon-Young Lee

doi:10.1609/aaai.v39i4.32361

Authors

Miran Heo Yonsei University
Seoung Wug Oh Adobe Research
Seon Joo Kim Yonsei University
Joon-Young Lee Adobe Research

DOI:

https://doi.org/10.1609/aaai.v39i4.32361

Abstract

Recent advancements in online Video Instance Segmentation (VIS) methods show notable performance improvements across benchmarks. However, the leading methods in the tracking-by-detection paradigm often result in temporally inconsistent predictions at both instance-level and pixel-level that lead to visually unsatisfactory outcomes. To address these challenges, we propose RoCoVIS, a simple yet effective approach that integrates segmentation and tracking to provide consistent online VIS. Our approach is an end-to-end sequential learning where object queries are propagated through mask predictions, improving the accuracy of temporal instance mapping at the pixel level. Additionally, we propose a new label assignment criterion in harmony with our approach. We also examine the limitations and challenges presented by the current standard evaluation protocol (AP) and suggest adopting additional metrics, Tube-Boundary AP and AP_Pool. RoCoVIS demonstrates superior performance on challenging VIS benchmarks with a Swin-L backbone and shows competitive results when employing a ResNet-50 backbone. By employing Tube-Boundary AP and AP_Pool as metrics to measure mask accuracy and consistency, RoCoVIS outperforms its counterpart, GenVIS, on the HQ-YTVIS and VIPSeg.

Robust and Consistent Online Video Instance Segmentation via Instance Mask Propagation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information