Uncorrected Least-Squares Temporal Difference with Lambda-Return
Temporal difference, TD(λ), learning is a foundation of reinforcement learning and also of interest in its own right for the tasks of prediction. Recently, true online TD(λ) has been shown to closely approximate the “forward view” at every step, while conventional TD(λ) does this only at the end of an episode. We re-examine least-squares temporal difference, LSTD(λ), which has been derived from conventional TD(λ). We design Uncorrected LSTD(λ) in such a way that, when λ = 1, Uncorrected LSTD(1) is equivalent to the least-squares method for the linear regression of Monte Carlo (MC) return at every step, while conventional LSTD(1) has this equivalence only at the end of an episode, since the MC return is corrected to be unbiased. We prove that Uncorrected LSTD(λ) can have smaller variance than conventional LSTD(λ), and this allows Uncorrected LSTD(λ) to sometimes outperform conventional LSTD(λ) in practice. When λ = 0, however, Uncorrected LSTD(0) is not equivalent to LSTD. We thus also propose Mixed LSTD(λ), which % mixes the two LSTD(λ)s in a way that it matches conventional LSTD(λ) at λ = 0 and Uncorrected LSTD(λ) at λ = 1. In numerical experiments, we study how the three LSTD(λ)s behave under limited training data.