UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning

Authors

  • Tiancheng Gu Miromind AI The University of Sydney
  • Kaicheng Yang M.R.L Team
  • Kaichen Zhang Miromind AI LMMs-Lab Team
  • Xiang An M.R.L Team
  • Ziyong Feng M.R.L Team
  • Yueyi Zhang MiroMind AI
  • Weidong Cai The University of Sydney
  • Jiankang Deng Imperial College London
  • Lidong Bing MiroMind AI

DOI:

https://doi.org/10.1609/aaai.v40i26.39284

Abstract

Universal multimodal embedding models are essential in various tasks. Existing approaches typically use in-batch mining to identify hard negatives by measuring the similarity of query-candidate pairs. However, these methods often struggle to capture subtle semantic differences among candidates and lack diversity in negative samples. Moreover, the embeddings exhibit limited discriminative ability in distinguishing false and hard negatives. In this paper, we leverage the advanced understanding capabilities of MLLMs to enhance representation learning, and present a novel Universal Multimodal Embedding(UniME-V2) model. Our approach first constructs a potential hard negative set through global retrieval. We then introduce the MLLM-as-a-Judge mechanism, which utilizes MLLMs to assess the semantic alignment of query-candidate pairs and generate soft semantic matching scores. These scores serve as a foundation for hard negative mining, mitigating the impact of false negatives and enabling the identification of diverse, high-quality hard negatives. Furthermore, the semantic matching scores are used as soft labels to mitigate the rigid one-to-one mapping constraint. By aligning the similarity matrix with the soft semantic matching score matrix, the model learns semantic distinctions among candidates, significantly enhancing its discriminative capacity. To further improve performance, we propose UniME-V2, a reranking model trained on our mined hard negatives through a joint pairwise and listwise optimization approach. We conduct comprehensive experiments on the MMEB benchmark and multiple retrieval tasks, demonstrating that our method achieves state-of-the-art performance across all tasks.

Published

2026-03-14

How to Cite

Gu, T., Yang, K., Zhang, K., An, X., Feng, Z., Zhang, Y., … Bing, L. (2026). UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 40(26), 21378–21386. https://doi.org/10.1609/aaai.v40i26.39284

Issue

Section

AAAI Technical Track on Machine Learning III