Data-Efficient and Contact-Rich Manipulation Through Diffusion Augmentation and Vision-Language Models

Authors

  • Daniel Seita University of Southern California

DOI:

https://doi.org/10.1609/aaai.v40i47.41353

Abstract

Recent progress in robot learning has produced impressive results, yet many systems still require learning from large datasets of demonstrations and are less effective in clutter or with highly deformable objects. This talk presents work on data-efficient manipulation using (i) diffusion-based augmentation that synthesizes geometrically consistent images and action labels to reduce demonstration requirements and (ii) Vision-Language Models (VLMs) that inject high-level semantics for contact-rich motion planning in clutter. We will also introduce ManipBench, which evaluates VLMs’ abilities for low-level manipulation. Together, we show how to move the community towards achieving robot manipulators that can learn and operate with reduced demonstration requirements across cluttered and real-world environments.

Downloads

Published

2026-03-14

How to Cite

Seita, D. (2026). Data-Efficient and Contact-Rich Manipulation Through Diffusion Augmentation and Vision-Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(47), 39830–39830. https://doi.org/10.1609/aaai.v40i47.41353