Streaming Generation of Co-Speech Gestures via Accelerated Rolling Diffusion

Authors

  • Evgeniia Vu Constructor University
  • Andrei Boiarov Constructor Tech
  • Dmitry Vetrov Constructor University

DOI:

https://doi.org/10.1609/aaai.v40i31.39807

Abstract

Generating co-speech gestures in real time requires both temporal coherence and efficient sampling. We introduce a novel framework for streaming gesture generation that extends Rolling Diffusion models with structured progressive noise scheduling, enabling seamless long-sequence motion synthesis while preserving realism and diversity. Our framework is universally compatible with existing diffusion-based gesture generation model, transforming them into streaming methods capable of continuous generation without requiring post-processing. We evaluate our framework on ZEGGS and BEAT, strong benchmarks for real-world applicability. Applied to state-of-the-art baselines on both datasets, it consistently outperforms them, demonstrating its effectiveness as a generalizable and efficient solution for real-time co-speech gesture synthesis. We further propose Rolling Diffusion Ladder Acceleration (RDLA), a new approach that employs a ladder-based noise scheduling strategy to simultaneously denoise multiple frames. This significantly improves sampling efficiency while maintaining motion consistency, achieving up to a 4× speedup with high visual fidelity and temporal coherence in our experiments. Comprehensive user studies further validate our framework’s ability to generate realistic, diverse gestures closely synchronized with the audio input.

Downloads

Published

2026-03-14

How to Cite

Vu, E., Boiarov, A., & Vetrov, D. (2026). Streaming Generation of Co-Speech Gestures via Accelerated Rolling Diffusion. Proceedings of the AAAI Conference on Artificial Intelligence, 40(31), 26054–26061. https://doi.org/10.1609/aaai.v40i31.39807

Issue

Section

AAAI Technical Track on Machine Learning VIII