You Only Infer Once: Cross-Modal Meta-Transfer for Referring Video Object Segmentation

Authors

  • Dezhuang Li Dalian University of Technology
  • Ruoqi Li Dalian University of Technology
  • Lijun Wang Dalian University of Technology
  • Yifan Wang Dalian University of Technology
  • Jinqing Qi Dalian University of Technology
  • Lu Zhang Dalian University of Technology
  • Ting Liu Meitu Inc.
  • Qingquan Xu Meitu Inc.
  • Huchuan Lu Dalian University of Technology

DOI:

https://doi.org/10.1609/aaai.v36i2.20017

Keywords:

Computer Vision (CV)

Abstract

We present YOFO (You Only inFer Once), a new paradigm for referring video object segmentation (RVOS) that operates in an one-stage manner. Our key insight is that the language descriptor should serve as target-specific guidance to identify the target object, while a direct feature fusion of image and language can increase feature complexity and thus may be sub-optimal for RVOS. To this end, we propose a meta-transfer module, which is trained in a learning-to-learn fashion and aims to transfer the target-specific information from the language domain to the image domain, while discarding the uncorrelated complex variations of language description. To bridge the gap between the image and language domains, we develop a multi-scale cross-modal feature mining block that aggregates all the essential features required by RVOS from both domains and generates regression labels for the meta-transfer module. The whole system can be trained in an end-to-end manner and shows competitive performance against state-of-the-art two-stage approaches.

Downloads

Published

2022-06-28

How to Cite

Li, D., Li, R., Wang, L., Wang, Y., Qi, J., Zhang, L., Liu, T., Xu, Q., & Lu, H. (2022). You Only Infer Once: Cross-Modal Meta-Transfer for Referring Video Object Segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 36(2), 1297-1305. https://doi.org/10.1609/aaai.v36i2.20017

Issue

Section

AAAI Technical Track on Computer Vision II