Are Vision-Language Transformers Learning Multimodal Representations? A Probing Perspective
Keywords:Speech & Natural Language Processing (SNLP), Computer Vision (CV), Machine Learning (ML)
AbstractIn recent years, joint text-image embeddings have significantly improved thanks to the development of transformer-based Vision-Language models. Despite these advances, we still need to better understand the representations produced by those models. In this paper, we compare pre-trained and fine-tuned representations at a vision, language and multimodal level. To that end, we use a set of probing tasks to evaluate the performance of state-of-the-art Vision-Language models and introduce new datasets specifically for multimodal probing. These datasets are carefully designed to address a range of multimodal capabilities while minimizing the potential for models to rely on bias. Although the results confirm the ability of Vision-Language models to understand color at a multimodal level, the models seem to prefer relying on bias in text data for object position and size. On semantically adversarial examples, we find that those models are able to pinpoint fine-grained multimodal differences. Finally, we also notice that fine-tuning a Vision-Language model on multimodal tasks does not necessarily improve its multimodal ability. We make all datasets and code available to replicate experiments.
How to Cite
Salin, E., Farah, B., Ayache, S., & Favre, B. (2022). Are Vision-Language Transformers Learning Multimodal Representations? A Probing Perspective. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10), 11248-11257. https://doi.org/10.1609/aaai.v36i10.21375
AAAI Technical Track on Speech and Natural Language Processing