VILTA: A VLM-in-the-Loop Adversary for Enhancing Driving Policy Robustness

Authors

  • Qimao Chen Tsinghua University
  • Fang Li University of Macau Xiaomi EV
  • Shaoqing Xu University of Macau Xiaomi EV
  • Zhiyi Lai Xiaomi EV
  • Zixun Xie Peking University
  • Yuechen Luo Tsinghua University
  • Shengyin Jiang Xiaomi EV
  • Hanbing Li Xiaomi EV
  • Long Chen Xiaomi EV
  • Bing Wang Xiaomi EV
  • Yi Zhang Tsinghua University
  • Zhi-Xin Yang University of Macau

DOI:

https://doi.org/10.1609/aaai.v40i4.37290

Abstract

The safe deployment of autonomous driving (AD) systems is fundamentally hindered by the long-tail problem, where rare yet critical driving scenarios are severely underrepresented in real-world data. Existing solutions including safety-critical scenario generation and closed-loop learning often rely on rule-based heuristics, resampling methods and generative models learned from offline datasets, limiting their ability to produce diverse and novel challenges. While recent works leverage Vision Language Models (VLMs) to produce scene descriptions that guide a separate, downstream model in generating hazardous trajectories for agents, such two-stage framework constrains the generative potential of VLMs, as the diversity of the final trajectories is ultimately limited by the generalization ceiling of the downstream algorithm. To overcome these limitations, we introduce VILTA (VLM-In-the-Loop Trajectory Adversary), a novel framework that integrates a VLM into the closed-loop training of AD agents. Unlike prior works, VILTA actively participates in the training loop by comprehending the dynamic driving environment and strategically generating challenging scenarios through direct, fine-grained editing of surrounding agents' future trajectories. This direct-editing approach fully leverages the VLM's powerful generalization capabilities to create a diverse curriculum of plausible yet challenging scenarios that extend beyond the scope of traditional methods. We demonstrate that our approach substantially enhances the safety and robustness of the resulting AD policy, particularly in its ability to navigate critical long-tail events.

Downloads

Published

2026-03-14

How to Cite

Chen, Q., Li, F., Xu, S., Lai, Z., Xie, Z., Luo, Y., Jiang, S., Li, H., Chen, L., Wang, B., Zhang, Y., & Yang, Z.-X. (2026). VILTA: A VLM-in-the-Loop Adversary for Enhancing Driving Policy Robustness. Proceedings of the AAAI Conference on Artificial Intelligence, 40(4), 2984-2992. https://doi.org/10.1609/aaai.v40i4.37290

Issue

Section

AAAI Technical Track on Computer Vision I