TY - JOUR TI - Accidental exploration through value predictors AU - Kisielewski, Tomasz AU - Leśniak, Damian TI - Accidental exploration through value predictors AB - Infinite length of trajectories is an almost universal assumption in the theoretical foundations of reinforcement learning. In practice learning occurs on finite trajectories. In this paper we examine a specific result of this disparity, namely a strong bias of the time-bounded Every-visit Monte Carlo value estimator. This manifests as a vastly different learning dynamic for algorithms that use value predictors, including encouraging or discouraging exploration. We investigate these claims theoretically for a one dimensional random walk, and empirically on a number of simple environments. We use GAE as an algorithm involving a value predictor and evolution strategies as a reference point. VL - 2018 IS - Volume 27 PY - 2018 SN - 1732-3916 C1 - 2083-8476 SP - 107 EP - 127 DO - 10.4467/20838476SI.18.009.10414 UR - https://ejournals.eu/en/journal/schedae-informaticae/article/accidental-exploration-through-value-predictors KW - reinforcement learning KW - value predictors KW - exploration