Dynamic Position-aware Network for Fine-grained Image Recognition

Authors

  • Shijie Wang International School of Information Science & Engineering, Dalian University of Technology, China Key Laboratory for Ubiquitous Network and Service Software of Liaoning Province, China
  • Haojie Li International School of Information Science & Engineering, Dalian University of Technology, China Key Laboratory for Ubiquitous Network and Service Software of Liaoning Province, China
  • Zhihui Wang International School of Information Science & Engineering, Dalian University of Technology, China Key Laboratory for Ubiquitous Network and Service Software of Liaoning Province, China
  • Wanli Ouyang SenseTime Computer Vision Research Group, The University of Sydney, Australia

Keywords:

Object Detection & Categorization

Abstract

Most weakly supervised fine-grained image recognition (WFGIR) approaches predominantly focus on learning the discriminative details which contain the visual variances and position clues. The position clues can be indirectly learnt by utilizing context information of discriminative visual content. However, this will cause the selected discriminative regions containing some non-discriminative information introduced by the position clues. These analysis motivate us to directly introduce position clues into visual content to only focus on the visual variances, achieving more precise discriminative region localization. Though important, position modelling usually requires significant pixel/region annotations and therefore is labor-intensive. To address this issue, we propose an end-to-end Dynamic Position-aware Network (DP-Net) to directly incorporate the position clues into visual content and dynamically align them without extra annotations, which eliminates the effect of position information for visual variances of subcategories. In particular, the DP-Net consists of: 1) Position Encoding Module, which learns a set of position-aware parts by directly adding the learnable position information into the horizontal/vertical visual content of images; 2) Position-vision Aligning Module, which dynamically aligns both visual content and learnable position information via performing graph convolution on position-aware parts; 3) Position-vision Reorganization Module, which projects the aligned position clues and visual content into the Euclidean space to construct a position-aware feature maps. Finally, the position-aware feature maps are used which is implicitly applied the aligned visual content and position clues for more accurate discriminative regions localization. Extensive experiments verify that DP-Net yields the best performance under the same settings with most competitive approaches, on CUB Bird, Stanford-Cars, and FGVC Aircraft datasets.

Downloads

Published

2021-05-18

How to Cite

Wang, S., Li, H., Wang, Z., & Ouyang, W. (2021). Dynamic Position-aware Network for Fine-grained Image Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 35(4), 2791-2799. Retrieved from https://ojs.aaai.org/index.php/AAAI/article/view/16384

Issue

Section

AAAI Technical Track on Computer Vision III