IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

Xiaoya Lu; Zeren Chen; Xuhao Hu; Yijin Zhou; Weichen Zhang; Dongrui Liu; Lu Sheng; Jing Shao

doi:10.1609/aaai.v40i42.40880

Authors

Xiaoya Lu School of Integrated Circuits, Shanghai Jiao Tong University, China Shanghai Artificial Intelligence Laboratory, China
Zeren Chen Shanghai Artificial Intelligence Laboratory, China School of Software, Beihang University, China
Xuhao Hu Shanghai Artificial Intelligence Laboratory, China Fudan University, China
Yijin Zhou Shanghai Artificial Intelligence Laboratory, China
Weichen Zhang Shanghai Artificial Intelligence Laboratory, China
Dongrui Liu Shanghai Artificial Intelligence Laboratory, China
Lu Sheng School of Software, Beihang University, China
Jing Shao Shanghai Artificial Intelligence Laboratory, China

DOI:

https://doi.org/10.1609/aaai.v40i42.40880

Abstract

Flawed planning from VLM-driven embodied agents poses significant safety hazards, hindering their deployment in real-world household tasks. However, existing static, termination-oriented evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent's actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent's interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. We thus present IS-Bench, the first multi-modal benchmark designed for interactive safety, featuring 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator. Crucially, it facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps. Extensive experiments on leading VLMs, including the GPT-4o and Gemini-2.5 series, reveal that current agents lack interactive safety awareness and that while safety-aware Chain-of-Thought can improve performance, it often compromises task completion. By highlighting these critical limitations, IS-Bench provides a foundation for developing safer and more reliable embodied AI systems.

IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information