FaceShield: Explainable Face Anti-Spoofing with Multimodal Large Language Models

Authors

  • Hongyang Wang Shijiazhuang Tiedao University Shijiazhuang Key Laboratory of Artifical Intelligence
  • Yichen Shi Shanghai Jiao Tong University Eastern Institute of Technology
  • Zhuofu Tao Eastern Institute of Technology University of California, Los Angeles
  • Yuhao Gao Shijiazhuang Tiedao University Shijiazhuang Key Laboratory of Artifical Intelligence
  • Liepiao Zhang GRGBanking
  • Xun Lin Great Bay University
  • Jun Feng Shijiazhuang Tiedao University Shijiazhuang Key Laboratory of Artifical Intelligence
  • Xiaochen Yuan Macao Polytechnic University
  • Zitong Yu Great Bay University Dongguan Key Laboratory for Intelligence and Information Technology Guangdong Provincial Key Laboratory of Intelligent Information Processing & Shenzhen Key Laboratory of Media Security, Shenzhen University
  • Xiaochun Cao SUN YAT-SEN UNIVERSITY

DOI:

https://doi.org/10.1609/aaai.v40i12.37945

Abstract

Face anti-spoofing (FAS) is crucial for protecting facial recognition systems from presentation attacks. Previous methods approached this task as a classification problem, lacking interpretability and reasoning behind the predicted results. Recently, multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and decision-making in visual tasks. However, there is currently no universal and comprehensive MLLM and dataset specifically designed for FAS task. To address this gap, we propose FaceShield, a MLLM for FAS, along with the corresponding pre-training and supervised fine-tuning (SFT) datasets, FaceShield-pre10K and FaceShield-sft45K. FaceShield is capable of determining the authenticity of faces, identifying types of spoofing attacks, providing reasoning for its judgments, and detecting attack areas. Specifically, we employ spoof-aware vision perception (SAVP) that incorporates both the original image and auxiliary information based on prior knowledge. We then use an prompt-guided vision token masking (PVTM) strategy to random mask vision tokens, thereby improving the model's generalization ability. We conducted extensive experiments on three benchmark datasets, demonstrating that FaceShield significantly outperforms previous deep learning models and general MLLMs on four FAS tasks, i.e., coarse-grained classification, fine-grained classification, reasoning, and attack localization.

Downloads

Published

2026-03-14

How to Cite

Wang, H., Shi, Y., Tao, Z., Gao, Y., Zhang, L., Lin, X., Feng, J., Yuan, X., Yu, Z., & Cao, X. (2026). FaceShield: Explainable Face Anti-Spoofing with Multimodal Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(12), 9811-9819. https://doi.org/10.1609/aaai.v40i12.37945

Issue

Section

AAAI Technical Track on Computer Vision IX