FAQ

Schedae Informaticae

logo of Jagiellonian University in Krakow

Accidental exploration through value predictors

Publication date: 2018

Schedae Informaticae, 2018, Volume 27, pp. 107 - 127

https://doi.org/10.4467/20838476SI.18.009.10414

Authors

,
Tomasz Kisielewski
Faculty of Mathematics and Computer Science, Jagiellonian University, Krakow, Poland
All publications →
Damian Leśniak
Faculty of Mathematics and Computer Science, Jagiellonian University, Krakow, Poland
All publications →

Abstract

Infinite length of trajectories is an almost universal assumption in the theoretical foundations of reinforcement learning. In practice learning occurs on finite trajectories. In this paper we examine a specific result of this disparity, namely a strong bias of the time-bounded Every-visit Monte Carlo value estimator. This manifests as a vastly different learning dynamic for algorithms that use value predictors, including encouraging or discouraging exploration. We investigate these claims theoretically for a one dimensional random walk, and empirically on a number of simple environments. We use GAE as an algorithm involving a value predictor and evolution strategies as a reference point.

References


Information

Information: Schedae Informaticae, 2018, pp. 107 - 127

Article type: Original research article

Titles:

English:

Accidental exploration through value predictors

Authors

Faculty of Mathematics and Computer Science, Jagiellonian University, Krakow, Poland

Faculty of Mathematics and Computer Science, Jagiellonian University, Krakow, Poland

Published at: 2018

Article status: Open

Licence: CC BY-NC-ND  licence icon

Percentage share of authors:

Tomasz Kisielewski (Author) - 50%
Damian Leśniak (Author) - 50%

Article corrections:

-

Publication languages:

English

View count: 20

Number of downloads: 0