Temporal-Difference Search in Computer Go

David Silver; Richard Sutton; Martin Mueller

doi:10.1609/icaps.v23i1.13578

Authors

David Silver University College London
Richard Sutton University of Alberta
Martin Mueller University of Alberta

DOI:

https://doi.org/10.1609/icaps.v23i1.13578

Keywords:

Reinforcement Learning, Simulation-Based Planning

Abstract

Temporal-difference (TD) learning is one of the most successful and broadly applied solutions to the reinforcement learning problem; it has been used to achieve master-level play in chess, checkers and backgammon. Monte-Carlo tree search is a recent algorithm for simulation-based search, which has been used to achieve master-level play in Go. We have introduced a new approach to high-performance planning. Our method, TD search, combines TD learning with simulation-based search. Like Monte-Carlo tree search, value estimates are updated by learning online from simulated experience. Like TD learning, it uses value function approximation and bootstrapping to efficiently generalise between related states. We applied TD search to the game of 9x9 Go, using a million binary features matching simple patterns of stones. Without any explicit search tree, our approach outperformed a vanilla Monte-Carlo tree search with the same number of simulations. When combined with a simple alpha-beta search, our program also outperformed all traditional (pre-Monte-Carlo) search and machine learning programs on the 9x9 Computer Go Server.

Temporal-Difference Search in Computer Go

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information