Parallel Vertex Diffusion for Unified Visual Grounding

Zesen Cheng; Kehan Li; Peng Jin; Siheng Li; Xiangyang Ji; Li Yuan; Chang Liu; Jie Chen

doi:10.1609/aaai.v38i2.27896

Authors

Zesen Cheng School of Electronic and Computer Engineering, Peking University, Shenzhen, China
Kehan Li School of Electronic and Computer Engineering, Peking University, Shenzhen, China
Peng Jin School of Electronic and Computer Engineering, Peking University, Shenzhen, China
Siheng Li Tsinghua University, Beijing, China
Xiangyang Ji Tsinghua University, Beijing, China
Li Yuan School of Electronic and Computer Engineering, Peking University, Shenzhen, China
Chang Liu Tsinghua University, Beijing, China
Jie Chen School of Electronic and Computer Engineering, Peking University, Shenzhen, China

DOI:

https://doi.org/10.1609/aaai.v38i2.27896

Keywords:

CV: Language and Vision, CV: Object Detection & Categorization, CV: Segmentation

Abstract

Unified visual grounding (UVG) capitalizes on a wealth of task-related knowledge across various grounding tasks via one-shot training, which curtails retraining costs and task-specific architecture design efforts. Vertex generation-based UVG methods achieve this versatility by unified modeling object box and contour prediction and provide a text-powered interface to vast related multi-modal tasks, e.g., visual question answering and captioning. However, these methods typically generate vertexes sequentially through autoregression, which is prone to be trapped in error accumulation and heavy computation, especially for high-dimension sequence generation in complex scenarios. In this paper, we develop Parallel Vertex Diffusion (PVD) based on the parallelizability of diffusion models to accurately and efficiently generate vertexes in a parallel and scalable manner. Since the coordinates fluctuate greatly, it typically encounters slow convergence when training diffusion models without geometry constraints. Therefore, we consummate our PVD by two critical components, i.e., center anchor mechanism and angle summation loss, which serve to normalize coordinates and adopt a differentiable geometry descriptor from the point-in-polygon problem of computational geometry to constrain the overall difference of prediction and label vertexes. These innovative designs empower our PVD to demonstrate its superiority with state-of-the-art performance across various grounding tasks.

Parallel Vertex Diffusion for Unified Visual Grounding

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription