Learning Token-Based Representation for Image Retrieval

Authors

  • Hui Wu CAS Key Laboratory of GIPAS, University of Science and Technology of China
  • Min Wang Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
  • Wengang Zhou CAS Key Laboratory of GIPAS, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
  • Yang Hu CAS Key Laboratory of GIPAS, University of Science and Technology of China
  • Houqiang Li CAS Key Laboratory of GIPAS, University of Science and Technology of China Institute of Artificial Intelligence, Hefei Comprehensive National Science Center

DOI:

https://doi.org/10.1609/aaai.v36i3.20173

Keywords:

Computer Vision (CV)

Abstract

In image retrieval, deep local features learned in a data-driven manner have been demonstrated effective to improve retrieval performance. To realize efficient retrieval on large image database, some approaches quantize deep local features with a large codebook and match images with aggregated match kernel. However, the complexity of these approaches is non-trivial with large memory footprint, which limits their capability to jointly perform feature learning and aggregation. To generate compact global representations while maintaining regional matching capability, we propose a unified framework to jointly learn local feature representation and aggregation. In our framework, we first extract local features using CNNs. Then, we design a tokenizer module to aggregate them into a few visual tokens, each corresponding to a specific visual pattern. This helps to remove background noise, and capture more discriminative regions in the image. Next, a refinement block is introduced to enhance the visual tokens with self-attention and cross-attention. Finally, different visual tokens are concatenated to generate a compact global representation. The whole framework is trained end-to-end with image-level labels. Extensive experiments are conducted to evaluate our approach, which outperforms the state-of-the-art methods on the Revisited Oxford and Paris datasets.

Downloads

Published

2022-06-28

How to Cite

Wu, H., Wang, M., Zhou, W., Hu, Y., & Li, H. (2022). Learning Token-Based Representation for Image Retrieval. Proceedings of the AAAI Conference on Artificial Intelligence, 36(3), 2703-2711. https://doi.org/10.1609/aaai.v36i3.20173

Issue

Section

AAAI Technical Track on Computer Vision III