Asymmetric Visual Semantic Embedding Framework for Efficient Vision-Language Alignment

Authors

  • Yang Liu College of Computer Science, Sichuan University Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education
  • Mengyuan Liu State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School
  • Shudong Huang College of Computer Science, Sichuan University Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education
  • Jiancheng Lv College of Computer Science, Sichuan University Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education

DOI:

https://doi.org/10.1609/aaai.v39i6.32605

Abstract

Learning visual semantic similarity is a critical challenge in bridging the gap between images and texts. However, there exist inherent variations between vision and language data, such as information density, i.e., images can contain textual information from multiple different views, which makes it difficult to compute the similarity between these two modalities accurately and efficiently. In this paper, we propose a novel framework called Asymmetric Visual Semantic Embedding (AVSE) to dynamically select features from various regions of images tailored to different textual inputs for similarity calculation. To capture information from different views in the image, we design a radial bias sampling module to sample image patches and obtain image features from various views, Furthermore, AVSE introduces a novel module for efficient computation of visual semantic similarity between asymmetric image and text embeddings. Central to this module is the presumption of foundational semantic units within the embeddings, denoted as ``meta-semantic embeddings." It segments all embeddings into meta-semantic embeddings with the same dimension and calculates visual semantic similarity by finding the optimal match of meta-semantic embeddings of two modalities. Our proposed AVSE model is extensively evaluated on the large-scale MS-COCO and Flickr30K datasets, demonstrating its superiority over recent state-of-the-art methods.

Downloads

Published

2025-04-11

How to Cite

Liu, Y., Liu, M., Huang, S., & Lv, J. (2025). Asymmetric Visual Semantic Embedding Framework for Efficient Vision-Language Alignment. Proceedings of the AAAI Conference on Artificial Intelligence, 39(6), 5676-5684. https://doi.org/10.1609/aaai.v39i6.32605

Issue

Section

AAAI Technical Track on Computer Vision V