One-Step Generative Policies with Q-Learning: A Reformulation of MeanFlow
DOI:
https://doi.org/10.1609/aaai.v40i31.39885Abstract
We introduce a one-step generative policy for offline reinforcement learning that maps *noise* directly to *actions* via a *residual reformulation* of MeanFlow, making it compatible with Q-learning. While one-step Gaussian policies enable fast inference, they struggle to capture complex, multimodal action distributions. Existing flow-based methods improve expressivity but typically rely on distillation and two-stage training when trained with Q-learning. To overcome these limitations, we propose to reformulate MeanFlow to enable *direct noise-to-action generation* by integrating the velocity field and noise-to-action transformation into a single policy network—eliminating the need for separate velocity estimation. We explore several reformulation variants and identify an effective *residual formulation* that supports expressive and stable policy learning. Our method offers three key advantages: 1) efficient one-step noise-to-action generation, 2) expressive modelling of multimodal action distributions, and 3) efficient and stable policy learning via Q-learning in a single-stage training setup. Extensive experiments on 73 tasks across the OGBench and D4RL benchmarks demonstrate that our method achieves strong performance in both offline and offline-to-online reinforcement learning settings.Published
2026-03-14
How to Cite
Wang, Z., Li, D., Chen, Y., Shi, Y., Bai, L., Yu, T., & Fu, Y. (2026). One-Step Generative Policies with Q-Learning: A Reformulation of MeanFlow. Proceedings of the AAAI Conference on Artificial Intelligence, 40(31), 26751–26759. https://doi.org/10.1609/aaai.v40i31.39885
Issue
Section
AAAI Technical Track on Machine Learning VIII