Robust and Consistent Online Video Instance Segmentation via Instance Mask Propagation

Authors

  • Miran Heo Yonsei University
  • Seoung Wug Oh Adobe Research
  • Seon Joo Kim Yonsei University
  • Joon-Young Lee Adobe Research

DOI:

https://doi.org/10.1609/aaai.v39i4.32361

Abstract

Recent advancements in online Video Instance Segmentation (VIS) methods show notable performance improvements across benchmarks. However, the leading methods in the tracking-by-detection paradigm often result in temporally inconsistent predictions at both instance-level and pixel-level that lead to visually unsatisfactory outcomes. To address these challenges, we propose RoCoVIS, a simple yet effective approach that integrates segmentation and tracking to provide consistent online VIS. Our approach is an end-to-end sequential learning where object queries are propagated through mask predictions, improving the accuracy of temporal instance mapping at the pixel level. Additionally, we propose a new label assignment criterion in harmony with our approach. We also examine the limitations and challenges presented by the current standard evaluation protocol (AP) and suggest adopting additional metrics, Tube-Boundary AP and AP_Pool. RoCoVIS demonstrates superior performance on challenging VIS benchmarks with a Swin-L backbone and shows competitive results when employing a ResNet-50 backbone. By employing Tube-Boundary AP and AP_Pool as metrics to measure mask accuracy and consistency, RoCoVIS outperforms its counterpart, GenVIS, on the HQ-YTVIS and VIPSeg.

Downloads

Published

2025-04-11

How to Cite

Heo, M., Oh, S. W., Kim, S. J., & Lee, J.-Y. (2025). Robust and Consistent Online Video Instance Segmentation via Instance Mask Propagation. Proceedings of the AAAI Conference on Artificial Intelligence, 39(4), 3483–3490. https://doi.org/10.1609/aaai.v39i4.32361

Issue

Section

AAAI Technical Track on Computer Vision III