IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

Authors

  • Xiaoya Lu School of Integrated Circuits, Shanghai Jiao Tong University, China Shanghai Artificial Intelligence Laboratory, China
  • Zeren Chen Shanghai Artificial Intelligence Laboratory, China School of Software, Beihang University, China
  • Xuhao Hu Shanghai Artificial Intelligence Laboratory, China Fudan University, China
  • Yijin Zhou Shanghai Artificial Intelligence Laboratory, China
  • Weichen Zhang Shanghai Artificial Intelligence Laboratory, China
  • Dongrui Liu Shanghai Artificial Intelligence Laboratory, China
  • Lu Sheng School of Software, Beihang University, China
  • Jing Shao Shanghai Artificial Intelligence Laboratory, China

DOI:

https://doi.org/10.1609/aaai.v40i42.40880

Abstract

Flawed planning from VLM-driven embodied agents poses significant safety hazards, hindering their deployment in real-world household tasks. However, existing static, termination-oriented evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent's actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent's interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. We thus present IS-Bench, the first multi-modal benchmark designed for interactive safety, featuring 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator. Crucially, it facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps. Extensive experiments on leading VLMs, including the GPT-4o and Gemini-2.5 series, reveal that current agents lack interactive safety awareness and that while safety-aware Chain-of-Thought can improve performance, it often compromises task completion. By highlighting these critical limitations, IS-Bench provides a foundation for developing safer and more reliable embodied AI systems.

Published

2026-03-14

How to Cite

Lu, X., Chen, Z., Hu, X., Zhou, Y., Zhang, W., Liu, D., Sheng, L., & Shao, J. (2026). IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks. Proceedings of the AAAI Conference on Artificial Intelligence, 40(42), 35680-35688. https://doi.org/10.1609/aaai.v40i42.40880

Issue

Section

AAAI Technical Track on Philosophy and Ethics of AI