SurgPub-Video: A Comprehensive Surgical Video Framework for Enhanced Surgical Intelligence in Vision-Language Model

Authors

  • Yaoqian Li Department of Computer Science and Engineering, The Chinese University of Hong Kong
  • Xikai Yang Department of Computer Science and Engineering, The Chinese University of Hong Kong
  • Dunyuan Xu Department of Computer Science and Engineering, The Chinese University of Hong Kong
  • Yang YU Department of Computer Science and Engineering, The Chinese University of Hong Kong
  • Litao Zhao Department of Computer Science and Engineering, The Chinese University of Hong Kong
  • Xiaowei Hu School of Future Technology, South China University of Technology
  • Jinpeng Li Department of Computer Science and Engineering, The Chinese University of Hong Kong
  • Pheng-Ann Heng Department of Computer Science and Engineering, The Chinese University of Hong Kong Institute of Medical Intelligence and XR, The Chinese University of Hong Kong

DOI:

https://doi.org/10.1609/aaai.v40i8.37593

Abstract

Vision-Language Models (VLMs) have shown significant potential in surgical scene analysis, yet existing models are limited by frame-level datasets and lack high-quality video data with procedural surgical knowledge. To address these challenges, we make the following contributions: (i) SurgPub-Video, a comprehensive dataset of over 3,000 surgical videos and 25 million annotated frames across 11 specialities, sourced from peer-reviewed clinical journals, (ii) SurgLLaVA-Video, a specialized VLM for surgical video understanding, built upon the TinyLLaVA-Video architecture that supports both video-level and frame-level inputs, and (iii) a video-level surgical Visual Question Answering (VQA) benchmark, covering diverse 11 surgical specialities, such as vascular, cardiology, and thoracic. Extensive experiments, conducted on the proposed benchmark and three additional surgical downstream tasks (action recognition, skill assessment, and triplet recognition), show that SurgLLaVA-Video significantly outperforms both general-purpose and surgical-specific VLMs with only three billion parameters.

Published

2026-03-14

How to Cite

Li, Y., Yang, X., Xu, D., YU, Y., Zhao, L., Hu, X., … Heng, P.-A. (2026). SurgPub-Video: A Comprehensive Surgical Video Framework for Enhanced Surgical Intelligence in Vision-Language Model. Proceedings of the AAAI Conference on Artificial Intelligence, 40(8), 6628–6635. https://doi.org/10.1609/aaai.v40i8.37593

Issue

Section

AAAI Technical Track on Computer Vision V