Accidental exploration through value predictors
cytuj
pobierz pliki
RIS BIB ENDNOTEChoose format
RIS BIB ENDNOTEAccidental exploration through value predictors
Publication date: 2018
Schedae Informaticae, 2018, Volume 27, pp. 107 - 127
https://doi.org/10.4467/20838476SI.18.009.10414Authors
Accidental exploration through value predictors
Infinite length of trajectories is an almost universal assumption in the theoretical foundations of reinforcement learning. In practice learning occurs on finite trajectories. In this paper we examine a specific result of this disparity, namely a strong bias of the time-bounded Every-visit Monte Carlo value estimator. This manifests as a vastly different learning dynamic for algorithms that use value predictors, including encouraging or discouraging exploration. We investigate these claims theoretically for a one dimensional random walk, and empirically on a number of simple environments. We use GAE as an algorithm involving a value predictor and evolution strategies as a reference point.
Information: Schedae Informaticae, 2018, Volume 27, pp. 107 - 127
Article type: Original article
Titles:
Accidental exploration through value predictors
Accidental exploration through value predictors
Faculty of Mathematics and Computer Science, Jagiellonian University, Krakow, Poland
Faculty of Mathematics and Computer Science, Jagiellonian University, Krakow, Poland
Published at: 2018
Article status: Open
Licence: CC BY-NC-ND
Percentage share of authors:
Article corrections:
-Publication languages:
English