Improving Region Representation Learning from Urban Imagery with Noisy Long-Caption Supervision

Authors

  • Yimei Zhang Zhejiang University of Technology Zhejiang Key Laboratory of Visual Information Intelligent Processing
  • Guojiang Shen Zhejiang University of Technology Zhejiang Key Laboratory of Visual Information Intelligent Processing
  • Kaili Ning Zhejiang University of Technology Zhejiang Key Laboratory of Visual Information Intelligent Processing
  • Tongwei Ren Nanjing University
  • Xuebo Qiu Zhejiang University of Technology
  • Mengmeng Wang Zhejiang University of Technology Zhejiang Key Laboratory of Visual Information Intelligent Processing
  • Xiangjie Kong Zhejiang University of Technology Zhejiang Key Laboratory of Visual Information Intelligent Processing

DOI:

https://doi.org/10.1609/aaai.v40i19.38678

Abstract

Region representation learning plays a pivotal role in urban computing by extracting meaningful features from unlabeled urban data. Analogous to how perceived facial age reflects an individual's health, the visual appearance of a city serves as its "portrait", encapsulating latent socio-economic and environmental characteristics. Recent studies have explored leveraging Large Language Models (LLMs) to incorporate textual knowledge into imagery-based urban region representation learning. However, two major challenges remain: i) difficulty in aligning fine-grained visual features with long captions, and ii) suboptimal knowledge incorporation due to noise in LLM-generated captions. To address these issues, we propose a novel pre-training framework called UrbanLN that improves Urban region representation learning through Long-text awareness and Noise suppression. Specifically, we introduce an information-preserved stretching interpolation strategy that aligns long captions with fine-grained visual semantics in complex urban scenes. To effectively mine knowledge from LLM-generated captions and filter out noise, we propose a dual-level optimization strategy. At the data level, a multi-model collaboration pipeline automatically generates diverse and reliable captions without human intervention. At the model level, we employ a momentum-based self-distillation mechanism to generate stable pseudo-targets, facilitating robust cross-modal learning under noisy conditions. Extensive experiments across four real-world cities and various downstream tasks demonstrate the superior performance of our UrbanLN.

Published

2026-03-14

How to Cite

Zhang, Y., Shen, G., Ning, K., Ren, T., Qiu, X., Wang, M., & Kong, X. (2026). Improving Region Representation Learning from Urban Imagery with Noisy Long-Caption Supervision. Proceedings of the AAAI Conference on Artificial Intelligence, 40(19), 16397–16405. https://doi.org/10.1609/aaai.v40i19.38678

Issue

Section

AAAI Technical Track on Data Mining & Knowledge Management III