Wang, S., Blanchet, J., & Glynn, P. (2023). Optimal Sample Complexity for Average Reward Markov Decision Processes. ArXiv. /abs/2310.08833
Abstract
We settle the sample complexity of policy learning for the maximization of the long run average reward associated with a uniformly ergodic Markov decision process (MDP), assuming a generative model. In this context, the existing literature provides a sample complexity upper bound of and a lower bound of . In these expressions, and denote the cardinalities of the state and action spaces respectively, serves as a uniform upper limit for the total variation mixing times, and signifies the error tolerance. Therefore, a notable gap of still remains to be bridged. Our primary contribution is to establish an estimator for the optimal policy of average reward MDPs with a sample complexity of , effectively reaching the lower bound in the literature. This is achieved by combining algorithmic ideas in Jin and Sidford (2021) with those of Li et al. (2020).
Authors
Shengbo Wang, Jose Blanchet, Peter Glynn
Publication date
2023/10/13
Journal
arXiv preprint arXiv:2310.08833