Scene-Level Sketch-Based Image Retrieval with Minimal Pairwise Supervision

Authors

  • Ce Ge Beijing University of Posts and Telecommunications
  • Jingyu Wang Beijing University of Posts and Telecommunications
  • Qi Qi Beijing University of Posts and Telecommunications
  • Haifeng Sun Beijing University of Posts and Telecommunications
  • Tong Xu Beijing University of Posts and Telecommunications
  • Jianxin Liao Beijing University of Posts and Telecommunications

DOI:

https://doi.org/10.1609/aaai.v37i1.25141

Keywords:

CV: Scene Analysis & Understanding, CV: Applications, CV: Image and Video Retrieval, CV: Multi-modal Vision, CV: Representation Learning for Vision, ML: Graph-based Machine Learning, ML: Multi-Instance/Multi-View Learning, ML: Multimodal Learning, ML: Representation Learning, ML: Transfer, Domain Adaptation, Multi-Task Learning, ML: Unsupervised & Self-Supervised Learning

Abstract

The sketch-based image retrieval (SBIR) task has long been researched at the instance level, where both query sketches and candidate images are assumed to contain only one dominant object. This strong assumption constrains its application, especially with the increasingly popular intelligent terminals and human-computer interaction technology. In this work, a more general scene-level SBIR task is explored, where sketches and images can both contain multiple object instances. The new general task is extremely challenging due to several factors: (i) scene-level SBIR inherently shares sketch-specific difficulties with instance-level SBIR (e.g., sparsity, abstractness, and diversity), (ii) the cross-modal similarity is measured between two partially aligned domains (i.e., not all objects in images are drawn in scene sketches), and (iii) besides instance-level visual similarity, a more complex multi-dimensional scene-level feature matching problem is imposed (including appearance, semantics, layout, etc.). Addressing these challenges, a novel Conditional Graph Autoencoder model is proposed to deal with scene-level sketch-images retrieval. More importantly, the model can be trained with only pairwise supervision, which distinguishes our study from others in that elaborate instance-level annotations (for example, bounding boxes) are no longer required. Extensive experiments confirm the ability of our model to robustly retrieve multiple related objects at the scene level and exhibit superior performance beyond strong competitors.

Downloads

Published

2023-06-26

How to Cite

Ge, C., Wang, J., Qi, Q., Sun, H., Xu, T., & Liao, J. (2023). Scene-Level Sketch-Based Image Retrieval with Minimal Pairwise Supervision. Proceedings of the AAAI Conference on Artificial Intelligence, 37(1), 650-657. https://doi.org/10.1609/aaai.v37i1.25141

Issue

Section

AAAI Technical Track on Computer Vision I