UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Socioeconomic Indicator Prediction

Authors

  • Xixuan Hao The Hong Kong University of Science and Technology (Guangzhou)
  • Wei Chen The Hong Kong University of Science and Technology (Guangzhou)
  • Yibo Yan The Hong Kong University of Science and Technology (Guangzhou)
  • Siru Zhong The Hong Kong University of Science and Technology (Guangzhou)
  • Kun Wang National University of Singapore
  • Qingsong Wen Squirrel Ai Learning
  • Yuxuan Liang The Hong Kong University of Science and Technology (Guangzhou)

DOI:

https://doi.org/10.1609/aaai.v39i27.35024

Abstract

Urban socioeconomic indicator prediction aims to infer various metrics related to sustainable development in diverse urban landscapes using data-driven methods. However, prevalent pretrained models, particularly those reliant on satellite imagery, face dual challenges. Firstly, concentrating solely on macro-level patterns from satellite data may introduce bias, lacking nuanced details at micro levels, such as architectural details at a place. Secondly, the text generated by the precursor work UrbanCLIP, which fully utilizes the extensive knowledge of LLMs, frequently exhibits issues such as hallucination and homogenization, resulting in a lack of reliable quality. In response to these issues, we devise a novel framework entitled UrbanVLP based on Vision-Language Pretraining. Our UrbanVLP seamlessly integrates multi-granularity information from both macro (satellite) and micro (street-view) levels, overcoming the limitations of prior pretrained models. Moreover, it introduces automatic text generation and calibration, providing a robust guarantee for producing high-quality text descriptions of urban imagery. Rigorous experiments conducted across six socioeconomic indicator prediction tasks underscore its superior performance.

Downloads

Published

2025-04-11

How to Cite

Hao, X., Chen, W., Yan, Y., Zhong, S., Wang, K., Wen, Q., & Liang, Y. (2025). UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Socioeconomic Indicator Prediction. Proceedings of the AAAI Conference on Artificial Intelligence, 39(27), 28061–28069. https://doi.org/10.1609/aaai.v39i27.35024