D²-VPR: A Parameter-efficient Visual-foundation-model-based Visual Place Recognition Method via Knowledge Distillation and Deformable Aggregation

Authors

  • Zheyuan Zhang School of Computer Science, Beijing University of Posts and Telecommunications Key Laboratory of Interactive Technology and Experience System, Ministry of Culture and Tourism, Beijing University of Posts and Telecommunications
  • Jiwei Zhang School of Computer Science, Beijing University of Posts and Telecommunications Key Laboratory of Interactive Technology and Experience System, Ministry of Culture and Tourism, Beijing University of Posts and Telecommunications
  • Boyu Zhou School of Computer Science, Beijing University of Posts and Telecommunications Key Laboratory of Interactive Technology and Experience System, Ministry of Culture and Tourism, Beijing University of Posts and Telecommunications
  • Linzhimeng Duan School of Computer Science, Beijing University of Posts and Telecommunications Key Laboratory of Interactive Technology and Experience System, Ministry of Culture and Tourism, Beijing University of Posts and Telecommunications
  • Hong Chen School of Computer Science, Beijing University of Posts and Telecommunications Key Laboratory of Interactive Technology and Experience System, Ministry of Culture and Tourism, Beijing University of Posts and Telecommunications

DOI:

https://doi.org/10.1609/aaai.v40i15.38303

Abstract

Visual Place Recognition (VPR) aims to determine the geographic location of a query image by retrieving its most visually similar counterpart from a geo-tagged reference database. Recently, the emergence of the powerful visual foundation model, DINOv2, trained in a self-supervised manner on massive datasets, has significantly improved VPR performance. This improvement stems from DINOv2’s exceptional feature generalization capabilities but is often accompanied by increased model complexity and computational overhead that impede deployment on resource-constrained devices. To address this challenge, we propose D2-VPR, a Distillation- and Deformable-based framework that retains the strong feature extraction capabilities of visual foundation models while significantly reducing model parameters and achieving a more favorable performance-efficiency trade-off. Specifically, first, we employ a two-stage training strategy that integrates knowledge distillation and fine-tuning. Additionally, we introduce a Distillation Recovery Module (DRM) to better align the feature spaces between the teacher and student models, thereby minimizing knowledge transfer losses to the greatest extent possible. Second, we design a Top-Down-attention-based Deformable Aggregator (TDDA) that leverages global semantic features to dynamically and adaptively adjust the Regions of Interest (ROI) used for aggregation, thereby improving adaptability to irregular structures. Extensive experiments demonstrate that our method achieves competitive performance compared to state-of-the-art approaches. Meanwhile, it reduces the parameter count by approximately 64.2% (compared to CricaVPR).

Published

2026-03-14

How to Cite

Zhang, Z., Zhang, J., Zhou, B., Duan, L., & Chen, H. (2026). D²-VPR: A Parameter-efficient Visual-foundation-model-based Visual Place Recognition Method via Knowledge Distillation and Deformable Aggregation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(15), 13034–13042. https://doi.org/10.1609/aaai.v40i15.38303

Issue

Section

AAAI Technical Track on Computer Vision XII