MobileInst: Video Instance Segmentation on the Mobile

Renhong Zhang; Tianheng Cheng; Shusheng Yang; Haoyi Jiang; Shuai Zhang; Jiancheng Lyu; Xin Li; Xiaowen Ying; Dashan Gao; Wenyu Liu; Xinggang Wang

doi:10.1609/aaai.v38i7.28555

Authors

Renhong Zhang Huazhong University of Science and Technology
Tianheng Cheng Huazhong University of Science and Technology
Shusheng Yang Huazhong University of Science and Technology
Haoyi Jiang Huazhong University of Science and Technology
Shuai Zhang Qualcomm AI Research
Jiancheng Lyu Qualcomm AI Research
Xin Li Qualcomm AI Research
Xiaowen Ying Qualcomm AI Research
Dashan Gao Qualcomm AI Research
Wenyu Liu Huazhong University of Science and Technology
Xinggang Wang Huazhong University of Science and Technology

DOI:

https://doi.org/10.1609/aaai.v38i7.28555

Keywords:

CV: Segmentation, CV: Video Understanding & Activity Analysis

Abstract

Video instance segmentation on mobile devices is an important yet very challenging edge AI problem. It mainly suffers from (1) heavy computation and memory costs for frame-by-frame pixel-level instance perception and (2) complicated heuristics for tracking objects. To address these issues, we present MobileInst, a lightweight and mobile-friendly framework for video instance segmentation on mobile devices. Firstly, MobileInst adopts a mobile vision transformer to extract multi-level semantic features and presents an efficient query-based dual-transformer instance decoder for mask kernels and a semantic-enhanced mask decoder to generate instance segmentation per frame. Secondly, MobileInst exploits simple yet effective kernel reuse and kernel association to track objects for video instance segmentation. Further, we propose temporal query passing to enhance the tracking ability for kernels. We conduct experiments on COCO and YouTube-VIS datasets to demonstrate the superiority of MobileInst and evaluate the inference latency on one single CPU core of the Snapdragon 778G Mobile Platform, without other methods of acceleration. On the COCO dataset, MobileInst achieves 31.2 mask AP and 433 ms on the mobile CPU, which reduces the latency by 50% compared to the previous SOTA. For video instance segmentation, MobileInst achieves 35.0 AP and 30.1 AP on YouTube-VIS 2019 & 2021.

MobileInst: Video Instance Segmentation on the Mobile

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription