How Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVR

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for reasoning in language models, with GRPO as its primary example. However, GRPO requires continuous online rollout generation, making it computationally expensive and difficult to scale. While Direct Preferenc...

Read Original Article →

Source

http://arxiv.org/abs/2605.21266v1