Beyond Pixel and Object: Part Feature as Reference for Few-Shot Video Object Segmentation

Authors

  • Naisong Luo MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China
  • Guoxin Xiong MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China
  • Tianzhu Zhang MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China

DOI:

https://doi.org/10.1609/aaai.v39i6.32626

Abstract

Few-Shot Video Object Segmentation (FSVOS) aims to achieve accurate segmentation of video sequences supported by limited annotated images. In this work, we analyze the deficiencies inherent in the use of object prototypes and pixel features as references in previous methods. Then we shed light on that part features, with the ability to adapt to appearance variations and resist noise, are advantageous as representative reference features for aligning support images and query videos. Therefore, we propose a Part Agent Learning Network (PALN) to leverage part features from two aspects. First, we elaborately employ Optimal Transport algorithm with equal partition constraint to make part agents capable of dividing support objects into diverse parts in an adaptive manner. Second, we design a dedicated cache mechanism to learn temporal part agents as lightweight historic target representation to exploit temporal consistency. With the aid of these learned part agents, our PALN can effectively achieve support-query alignment and temporal alignment for accurate segmentation of query videos. Extensive experimental results on two challenging benchmarks demonstrate that our method performs favorably against state-of-the-art FSVOS methods.

Downloads

Published

2025-04-11

How to Cite

Luo, N., Xiong, G., & Zhang, T. (2025). Beyond Pixel and Object: Part Feature as Reference for Few-Shot Video Object Segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 39(6), 5865-5873. https://doi.org/10.1609/aaai.v39i6.32626

Issue

Section

AAAI Technical Track on Computer Vision V