Abstract
Among the reasons hindering the applications of reinforcement learning (RL) to real-world problems, two factors are critical: limited data and the mismatch between the test environment (real environment in which the policy is deployed) and the training environment (e.g., a simulator). This paper simultaneously addresses these issues with offline distributionally robust RL, where a distributionally robust policy is learned using historical data from the source environment by optimizing against a worst-case perturbation thereof. In particular, we move beyond tabular settings and design a novel linear function approximation framework that robustifies the latent space. Our framework is instantiated into two settings, one where the dataset is well-explored and the other where the dataset has weaker data coverage. In addition, we introduce a value shift algorithmic technique specifically designed to suit the distributionally robust nature, which contributes to our improved theoretical results and empirical performance. Sample complexities and are established respectively as the first non-asymptotic results in these settings, where denotes the dimension in the linear function space and represents the number of trajectories in the dataset. Diverse experiments are conducted to demonstrate our theoretical findings, showing the superiority of our algorithms against the non-robust one.
Authors
Zhipeng Liang, Xiaoteng Ma, Jose Blanchet, Mingwen Liu, Jiheng Zhang, Zhengyuan Zhou