TVT: Three-Way Vision Transformer through Multi-Modal Hypersphere Learning for Zero-Shot Sketch-Based Image Retrieval

Jialin Tian; Xing Xu; Fumin Shen; Yang Yang; Heng Tao Shen

doi:10.1609/aaai.v36i2.20136

Authors

Jialin Tian University of Electronic Science and Technology of China
Xing Xu University of Electronic Science and Technology of China
Fumin Shen University of Electronic Science and Technology of China
Yang Yang University of Electronic Science and Technology of China
Heng Tao Shen University of Electronic Science and Technology of China Peng Cheng Laboratory

DOI:

https://doi.org/10.1609/aaai.v36i2.20136

Keywords:

Computer Vision (CV), Machine Learning (ML), Domain(s) Of Application (APP)

Abstract

In this paper, we study the zero-shot sketch-based image retrieval (ZS-SBIR) task, which retrieves natural images related to sketch queries from unseen categories. In the literature, convolutional neural networks (CNNs) have become the de-facto standard and they are either trained end-to-end or used to extract pre-trained features for images and sketches. However, CNNs are limited in modeling the global structural information of objects due to the intrinsic locality of convolution operations. To this end, we propose a Transformer-based approach called Three-Way Vision Transformer (TVT) to leverage the ability of Vision Transformer (ViT) to model global contexts due to the global self-attention mechanism. Going beyond simply applying ViT to this task, we propose a token-based strategy of adding fusion and distillation tokens and making them complementary to each other. Specifically, we integrate three ViTs, which are pre-trained on data of each modality, into a three-way pipeline through the processes of distillation and multi-modal hypersphere learning. The distillation process is proposed to supervise fusion ViT (ViT with an extra fusion token) with soft targets from modality-specific ViTs, which prevents fusion ViT from catastrophic forgetting. Furthermore, our method learns a multi-modal hypersphere by performing inter- and intra-modal alignment without loss of uniformity, which aims to bridge the modal gap between modalities of sketch and image and avoid the collapse in dimensions. Extensive experiments on three benchmark datasets, i.e., Sketchy, TU-Berlin, and QuickDraw, demonstrate the superiority of our TVT method over the state-of-the-art ZS-SBIR methods.

TVT: Three-Way Vision Transformer through Multi-Modal Hypersphere Learning for Zero-Shot Sketch-Based Image Retrieval

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information