Learning Policy from a Single Trajectory in Average-Reward Markov Decision Process

While there is an extensive body of work characterizing the sample complexity of discounted cumulative-reward MDPs, finite sample analyses for average-reward MDPs have been limited, and most existing works rely on restrictive assumptions such as ergodicity or access to a generative model. In this wo...

Read Original Article →

Source

http://arxiv.org/abs/2606.16729v1