Transferability of Adversarial Attacks in Video-based MLLMs: A Cross-modal Image-to-Video Approach

Authors

  • Linhao Huang Shenzhen International Graduate School, Tsinghua University Peng Cheng Laboratory, Shenzhen, Guangdong, China Southern University of Science and Technology
  • Xue Jiang Southern University of Science and Technology TMLR Group, Hong Kong Baptist University
  • Zhiqiang Wang Hong Kong University of Science and Technology
  • Wentao Mo Shenzhen International Graduate School, Tsinghua University Southern University of Science and Technology
  • Xi Xiao Shenzhen International Graduate School, Tsinghua University Peng Cheng Laboratory, Shenzhen, Guangdong, China
  • Yong-Jie Yin China Electronics Corporation
  • Bo Han TMLR Group, Hong Kong Baptist University
  • Feng Zheng Southern University of Science and Technology

DOI:

https://doi.org/10.1609/aaai.v40i7.37420

Abstract

Video-based multimodal large language models (V-MLLMs) have shown vulnerability to adversarial examples in video-text multimodal tasks. However, the transferability of adversarial videos to unseen models—a common and practical real-world scenario—remains unexplored. In this paper, we pioneer an investigation into the transferability of adversarial video samples across V-MLLMs. We find that existing adversarial attack methods face significant limitations when applied in black-box settings for V-MLLMs, which we attribute to the following shortcomings: (1) lacking generalization in perturbing video features, (2) focusing only on sparse key-frames, and (3) failing to integrate multimodal information. To address these limitations and deepen the understanding of V-MLLM vulnerabilities in black-box scenarios, we introduce the Image-to-Video MLLM (I2V-MLLM) attack. In I2V-MLLM, we utilize an image-based multimodal large language model (I-MLLM) as a surrogate model to craft adversarial video samples. Multimodal interactions and spatiotemporal information are integrated to disrupt video representations within the latent space, improving adversarial transferability. Additionally, a perturbation propagation technique is introduced to handle different unknown frame sampling strategies. Experimental results demonstrate that our method can generate adversarial examples that exhibit strong transferability across different V-MLLMs on multiple video-text multimodal tasks. Compared to white-box attacks on these models, our black-box attacks (using BLIP-2 as a surrogate model) achieve competitive performance, with average attack success rate (AASR) of 57.98% on MSVD-QA and 58.26% on MSRVTT-QA for Zero-Shot VideoQA tasks, respectively.

Downloads

Published

2026-03-14

How to Cite

Huang, L., Jiang, X., Wang, Z., Mo, W., Xiao, X., Yin, Y.-J., … Zheng, F. (2026). Transferability of Adversarial Attacks in Video-based MLLMs: A Cross-modal Image-to-Video Approach. Proceedings of the AAAI Conference on Artificial Intelligence, 40(7), 5067–5075. https://doi.org/10.1609/aaai.v40i7.37420

Issue

Section

AAAI Technical Track on Computer Vision IV