HybridPrompt: Bridging Language Models and Human Priors in Prompt Tuning for Visual Question Answering

Zhiyuan Ma; Zhihuan Yu; Jianjun Li; Guohui Li

doi:10.1609/aaai.v37i11.26569

Authors

Zhiyuan Ma School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China
Zhihuan Yu School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China
Jianjun Li School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China
Guohui Li School of Software Engineering, Huazhong University of Science and Technology, Wuhan, China

DOI:

https://doi.org/10.1609/aaai.v37i11.26569

Keywords:

SNLP: Question Answering

Abstract

Visual Question Answering (VQA) aims to answer the natural language question about a given image by understanding multimodal content. However, the answer quality of most existing visual-language pre-training (VLP) methods is still limited, mainly due to: (1) Incompatibility. Upstream pre-training tasks are generally incompatible with downstream question answering tasks, which makes the knowledge from the language model not well transferable to downstream tasks, and greatly limits their performance in few-shot scenarios; (2) Under-fitting. They generally do not integrate human priors to compensate for universal knowledge from language models, so as to fit the challenging VQA problem and generate reliable answers. To address these issues, we propose HybridPrompt, a cloze- and verify-style hybrid prompt framework with bridging language models and human priors in prompt tuning for VQA. Specifically, we first modify the input questions into the cloze-style prompts to narrow the gap between upstream pre-training tasks and downstream VQA task, which ensures that the universal knowledge in the language model can be better transferred to subsequent human prior-guided prompt tuning. Then, we imitate the cognitive process of human brain to introduce topic and sample related priors to construct a dynamic learnable prompt template for human prior-guided prompt learning. Finally, we add fixed-length learnable free-parameters to further enhance the generalizability and scalability of prompt learning in the VQA model. Experimental results verify the effectiveness of HybridPrompt, showing that it achieves competitive performance against previous methods on widely-used VQAv2 dataset and obtains new state-of-the-art results. Our code is released at: https://github.com/zhizhi111/hybrid.

HybridPrompt: Bridging Language Models and Human Priors in Prompt Tuning for Visual Question Answering

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription