Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions

Authors

  • Prajwal Gatti Indian Institute of Technology Jodhpur
  • Kshitij Parikh Indian Institute of Technology Jodhpur
  • Dhriti Prasanna Paul Indian Institute of Technology Jodhpur
  • Manish Gupta Microsoft
  • Anand Mishra Indian Institute of Technology Jodhpur

DOI:

https://doi.org/10.1609/aaai.v38i3.27956

Keywords:

CV: Image and Video Retrieval, ML: Multimodal Learning

Abstract

Non-native speakers with limited vocabulary often struggle to name specific objects despite being able to visualize them, e.g., people outside Australia searching for ‘numbats.’ Further, users may want to search for such elusive objects with difficult-to-sketch interactions, e.g., “numbat digging in the ground.” In such common but complex situations, users desire a search interface that accepts composite multimodal queries comprising hand-drawn sketches of “difficult-to-name but easy-to-draw” objects and text describing “difficult-to-sketch but easy-to-verbalize” object's attributes or interaction with the scene. This novel problem statement distinctly differs from the previously well-researched TBIR (text-based image retrieval) and SBIR (sketch-based image retrieval) problems. To study this under-explored task, we curate a dataset, CSTBIR (Composite Sketch+Text Based Image Retrieval), consisting of ~2M queries and 108K natural scene images. Further, as a solution to this problem, we propose a pretrained multimodal transformer-based baseline, STNet (Sketch+Text Network), that uses a hand-drawn sketch to localize relevant objects in the natural scene image, and encodes the text and image to perform image retrieval. In addition to contrastive learning, we propose multiple training objectives that improve the performance of our model. Extensive experiments show that our proposed method outperforms several state-of-the-art retrieval methods for text-only, sketch-only, and composite query modalities. We make the dataset and code available at: https://vl2g.github.io/projects/cstbir.

Published

2024-03-24

How to Cite

Gatti, P., Parikh, K., Paul, D. P., Gupta, M., & Mishra, A. (2024). Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions. Proceedings of the AAAI Conference on Artificial Intelligence, 38(3), 1869-1877. https://doi.org/10.1609/aaai.v38i3.27956

Issue

Section

AAAI Technical Track on Computer Vision II