SkillCLIP: Skill Aware Modality Fusion Visual Question Answering (Student Abstract)


  • Atharva Naik Carnegie Mellon University
  • Yash Parag Butala Carnegie Mellon University
  • Navaneethan Vaikunthan Carnegie Mellon University
  • Raghav Kapoor Carnegie Mellon University



Computer Vision, Machine Learning, Machine Perception, Applications Of AI, Natural Language Processing, Multi-modal Vision, Multimodal Learning, Language And Vision, Language Grounding & Multi-modal NLP, Visual Question Answering, Question Answering


When humans are posed with a difficult problem, they often approach it by identifying key skills, honing them, and finally effectively combining them. We propose a novel method and apply it for the VizWiz VQA task to predict the visual skills needed to answer a question, and leverage expert modules to produce intermediary outputs and fuse them in a skill-aware manner. Unlike prior works in visual question-answering (VQA) that use intermediate outputs such as detected objects and Optical Character Recognition (OCR), our approach explicitly guides the model with a skill embedding on what to focus on. While our results show that using skill-aware fusion outperforms skill-unaware models for only a subset of questions, we believe our results provide interesting directions for future work. We also release our code, model, and illustrative demonstrations for future research purposes.




How to Cite

Naik, A., Butala, Y. P., Vaikunthan, N., & Kapoor, R. (2024). SkillCLIP: Skill Aware Modality Fusion Visual Question Answering (Student Abstract). Proceedings of the AAAI Conference on Artificial Intelligence, 38(21), 23592-23593.