BASIS: Batchwise Advantage Estimation from Single-Rollout Information Sharing for LLM Reasoning

Reinforcement learning with verifiable rewards has become a standard recipe for improving the reasoning abilities of large language models. Existing algorithms face a tradeoff between computational efficiency and sample efficiency in value estimation and policy learning. We introduce BASIS, a critic...

Read Original Article →

Source

http://arxiv.org/abs/2605.27293v1