Where It Moves, It Matters: Referring Surgical Instrument Segmentation via Motion

Authors

  • Meng Wei King's College London
  • Kun Yuan Technical University of Munich University of Strasbourg Munich Center for Machine Learning
  • Shi Li University of Strasbourg
  • Yue Zhou Technical University of Munich
  • Long Bai The Chinese University of Hong Kong
  • Nassir Navab Technical University of Munich
  • Hongliang Ren The Chinese University of Hong Kong
  • Hong Joo Lee Technical University of Munich Munich Center for Machine Learning
  • Tom Vercauteren King's College London
  • Nicolas Padoy University of Strasbourg

DOI:

https://doi.org/10.1609/aaai.v40i13.38027

Abstract

Enabling intuitive, language-driven interaction with surgical scenes is a critical step toward intelligent operating rooms and autonomous surgical robotic assistance. However, the task of referring segmentation, localizing surgical instruments based on natural language descriptions, remains underexplored in surgical videos, with existing approaches struggling to generalize due to reliance on static visual cues and predefined instrument names. In this work, we introduce SurgRef, a novel motion-guided framework that grounds free-form language expressions in instrument motion, capturing how tools move and interact across time, rather than what they look like. This allows models to understand and segment instruments even under occlusion, ambiguity, or unfamiliar terminology. To train and evaluate SurgRef, we present Ref-IMotion, a diverse, multi-institutional video dataset with dense spatiotemporal masks and rich motion-centric expressions. SurgRef achieves state-of-the-art accuracy and generalization across surgical procedures, setting a new benchmark for robust, language-driven surgical video segmentation.

Downloads

Published

2026-03-14

How to Cite

Wei, M., Yuan, K., Li, S., Zhou, Y., Bai, L., Navab, N., Ren, H., Lee, H. J., Vercauteren, T., & Padoy, N. (2026). Where It Moves, It Matters: Referring Surgical Instrument Segmentation via Motion. Proceedings of the AAAI Conference on Artificial Intelligence, 40(13), 10548-10556. https://doi.org/10.1609/aaai.v40i13.38027

Issue

Section

AAAI Technical Track on Computer Vision X