Zhou, Z., Zhou, Z., Bai, Q., Qiu, L., Blanchet, J. & Glynn, P.. (2021). Finite-Sample Regret Bound for Distributionally Robust Offline Tabular Reinforcement Learning . Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 130:3331-3339 Available from https://proceedings.mlr.press/v130/zhou21d.html.

View Publication

Abstract

While reinforcement learning has witnessed tremendous success recently in a wide range of domains, robustness–or the lack thereof–remains an important issue that remains inadequately addressed. In this paper, we provide a distributionally robust formulation of offline learning policy in tabular RL that aims to learn a policy from historical data (collected by some other behavior policy) that is robust to the future environment arising as a perturbation of the training environment. We first develop a novel policy evaluation scheme that accurately estimates the robust value (ie how robust it is in a perturbed environment) of any given policy and establish its finite-sample estimation error. Building on this, we then develop a novel and minimax-optimal distributionally robust learning algorithm that achieves regret, meaning that with high probability, the policy learned from using training data points will be close to the optimal distributionally robust policy. Finally, our simulation results demonstrate the superiority of our distributionally robust approach compared to non-robust RL algorithms.

Authors: Zhengqing Zhou, Zhengyuan Zhou, Qinxun Bai, Linhai Qiu, Jose Blanchet, Peter Glynn
Publication date: 2021/3/18
Conference: International Conference on Artificial Intelligence and Statistics
Pages: 3331-3339
Publisher: PMLR