CLIP-MSM: A Multi-Semantic Mapping Brain Representation for Human High-Level Visual Cortex

Guoyuan Yang; Mufan Xue; Ziming Mao; Haofang Zheng; Jia Xu; Dabin Sheng; Ruotian Sun; Ruoqi Yang; Xuesong Li

doi:10.1609/aaai.v39i9.32994

Authors

Guoyuan Yang Advanced Research Institute of Multidisciplinary Sciences, Beijing Institute of Technology, Beijing 100081, China School of Medical Technology, Beijing Institute of Technology, Beijing 100081, China
Mufan Xue Advanced Research Institute of Multidisciplinary Sciences, Beijing Institute of Technology, Beijing 100081, China
Ziming Mao School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
Haofang Zheng School of Integrated Circuits and Electronics, Beijing Institute of Technology, Beijing 100081, China
Jia Xu School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
Dabin Sheng School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
Ruotian Sun School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
Ruoqi Yang School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
Xuesong Li School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China

DOI:

https://doi.org/10.1609/aaai.v39i9.32994

Abstract

Prior work employing deep neural networks (DNNs) with explainable techniques has identified human visual cortical selective representation to specific categories. However, constructing high-performing encoding models that accurately capture brain responses to coexisting multi-semantics remains elusive. Here, we used CLIP models combined with CLIP Dissection to establish a multi-semantic mapping framework (CLIP-MSM) for hypothesis-free analysis in human high-level visual cortex. First, we utilize CLIP models to construct voxel-wise encoding models for predicting visual cortical responses to natural scene images. Then, we apply CLIP Dissection and normalize the semantic mapping score to achieve the mapping of single brain voxels to multiple semantics. Our findings indicate that CLIP Dissection applied to DNNs modeling the human high-level visual cortex demonstrates better interpretability accuracy compared to Network Dissection. In addition, to demonstrate how our method enables fine-grained discovery in hypothesis-free analysis, we quantify the accuracy between CLIP-MSM’s reconstructed brain activation in response to categories of faces, bodies, places, words and food, and the ground truth of brain activation. We demonstrate that CLIP-MSM provides more accurate predictions of visual responses compared to CLIP Dissection. Our results have been validated using two large natural image datasets: the Natural Scenes Dataset (NSD) and the Natural Object Dataset (NOD).

CLIP-MSM: A Multi-Semantic Mapping Brain Representation for Human High-Level Visual Cortex

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information