Are Vision-Language Transformers Learning Multimodal Representations? A Probing Perspective

Emmanuelle Salin; Badreddine Farah; Stéphane Ayache; Benoit Favre

doi:10.1609/aaai.v36i10.21375

Authors

Emmanuelle Salin Laboratoire Informatique et Systèmes, Aix-Marseille University, CNRS
Badreddine Farah École Sup Galilée, Université Sorbonne Paris Nord
Stéphane Ayache Laboratoire Informatique et Systèmes, Aix-Marseille University, CNRS
Benoit Favre Laboratoire Informatique et Systèmes, Aix-Marseille University, CNRS

DOI:

https://doi.org/10.1609/aaai.v36i10.21375

Keywords:

Speech & Natural Language Processing (SNLP), Computer Vision (CV), Machine Learning (ML)

Abstract

In recent years, joint text-image embeddings have significantly improved thanks to the development of transformer-based Vision-Language models. Despite these advances, we still need to better understand the representations produced by those models. In this paper, we compare pre-trained and fine-tuned representations at a vision, language and multimodal level. To that end, we use a set of probing tasks to evaluate the performance of state-of-the-art Vision-Language models and introduce new datasets specifically for multimodal probing. These datasets are carefully designed to address a range of multimodal capabilities while minimizing the potential for models to rely on bias. Although the results confirm the ability of Vision-Language models to understand color at a multimodal level, the models seem to prefer relying on bias in text data for object position and size. On semantically adversarial examples, we find that those models are able to pinpoint fine-grained multimodal differences. Finally, we also notice that fine-tuning a Vision-Language model on multimodal tasks does not necessarily improve its multimodal ability. We make all datasets and code available to replicate experiments.

Are Vision-Language Transformers Learning Multimodal Representations? A Probing Perspective

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information