Jin, Y., Gummadi, R., Zhou, Z. & Blanchet, J.. (2024). Feasible š¯‘„-Learning for Average Reward Reinforcement Learning. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 238:1630-1638 Available from https://proceedings.mlr.press/v238/jin24b.html.

View Publication

Abstract

Average reward reinforcement learning (RL) provides a suitable framework for capturing the objective (ie long-run average reward) for continuing tasks, where there is often no natural way to identify a discount factor. However, existing average reward RL algorithms with sample complexity guarantees are not feasible, as they take as input the (unknown) mixing time of the Markov decision process (MDP). In this paper, we make initial progress towards addressing this open problem. We design a feasible average-reward -learning framework that requires no knowledge of any problem parameter as input. Our framework is based on discounted -learning, while we dynamically adapt the discount factor (and hence the effective horizon) to progressively approximate the average reward. In the synchronous setting, we solve three tasks:(i) learn a policy that is -close to optimal,(ii) estimate optimal average reward with -accuracy, and (iii) estimate the bias function (similar to -function in discounted case) with -accuracy. We show that with carefully designed adaptation schemes,(i) can be achieved with samples,(ii) with samples, and (iii) with samples, where is the mixing time, and is an MDP-dependent constant. To our knowledge, we provide the first finite-sample guarantees that are polynomial in for a feasible variant of -learning. That said, the sample complexity bounds have tremendous room for improvement, which we leave for the communityā€™s best minds. Preliminary simulations verify that our framework is effective without prior knowledge of parameters as input.

Authors
Ying Jin, Ramki Gummadi, Zhengyuan Zhou, Jose Blanchet
Publication date
2024/4/18
Conference
International Conference on Artificial Intelligence and Statistics
Pages
1630-1638
Publisher
PMLR