\textit{Stochastic} MeanFlow Policies: One-Step Generative Control with Entropic Mirror Descent

Online off-policy reinforcement learning (RL) is shaped by two coupled choices: the policy class and the update rule. Gaussian policies are fast and have tractable entropy, but struggle with multimodal action distributions. Generative policies are more expressive, but often require iterative samplin...

Read Original Article →

Source

http://arxiv.org/abs/2605.21282v1