Global Fusion Attention for Vision and Language Understanding (Student Abstract)

Zixin Guo; Chen Liang; Ziyu Wan; Yang Bai

doi:10.1609/aaai.v35i18.17891

Global Fusion Attention for Vision and Language Understanding (Student Abstract)

Authors

Zixin Guo Aalto University
Chen Liang City University of Hong Kong
Ziyu Wan City University of Hong Kong
Yang Bai China University of Geosciences, Beijing

DOI:

https://doi.org/10.1609/aaai.v35i18.17891

Keywords:

Vision And Language Understanding, Attention Mechanism, Multimodal

Abstract

We extend the popular transformer architecture to a multi-modal model, processing both visual and textual inputs. We propose a new attention mechanism on Transformer-based architecture for the joint vision and language understanding tasks. Our model fuses multi-level comprehension between images and texts in a weighted manner, which could better curve the internal relationships. Experiments on benchmark VQA dataset CLEVR demonstrate the effectiveness of the proposed attention mechanism. We also observe the improvements in sample efficiency of reinforcement learning through the experiments on grounded language understanding tasks of BabyAI platform.

Downloads

Published

2021-05-18

How to Cite

Guo, Z., Liang, C., Wan, Z., & Bai, Y. (2021). Global Fusion Attention for Vision and Language Understanding (Student Abstract). Proceedings of the AAAI Conference on Artificial Intelligence, 35(18), 15789-15790. https://doi.org/10.1609/aaai.v35i18.17891

Download Citation

Issue

Vol. 35 No. 18: AAAI-21 Student Papers and Demonstrations

Section

AAAI Student Abstract and Poster Program

Global Fusion Attention for Vision and Language Understanding (Student Abstract)

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription