SkillCLIP: Skill Aware Modality Fusion Visual Question Answering (Student Abstract)

Authors

  • Atharva Naik Carnegie Mellon University
  • Yash Parag Butala Carnegie Mellon University
  • Navaneethan Vaikunthan Carnegie Mellon University
  • Raghav Kapoor Carnegie Mellon University

DOI:

https://doi.org/10.1609/aaai.v38i21.30486

Keywords:

Computer Vision, Machine Learning, Machine Perception, Applications Of AI, Natural Language Processing, Multi-modal Vision, Multimodal Learning, Language And Vision, Language Grounding & Multi-modal NLP, Visual Question Answering, Question Answering

Abstract

When humans are posed with a difficult problem, they often approach it by identifying key skills, honing them, and finally effectively combining them. We propose a novel method and apply it for the VizWiz VQA task to predict the visual skills needed to answer a question, and leverage expert modules to produce intermediary outputs and fuse them in a skill-aware manner. Unlike prior works in visual question-answering (VQA) that use intermediate outputs such as detected objects and Optical Character Recognition (OCR), our approach explicitly guides the model with a skill embedding on what to focus on. While our results show that using skill-aware fusion outperforms skill-unaware models for only a subset of questions, we believe our results provide interesting directions for future work. We also release our code, model, and illustrative demonstrations for future research purposes.

Downloads

Published

2024-03-24

How to Cite

Naik, A., Butala, Y. P., Vaikunthan, N., & Kapoor, R. (2024). SkillCLIP: Skill Aware Modality Fusion Visual Question Answering (Student Abstract). Proceedings of the AAAI Conference on Artificial Intelligence, 38(21), 23592-23593. https://doi.org/10.1609/aaai.v38i21.30486