Temporal-Difference Search in Computer Go

Authors

  • David Silver University College London
  • Richard Sutton University of Alberta
  • Martin Mueller University of Alberta

DOI:

https://doi.org/10.1609/icaps.v23i1.13578

Keywords:

Reinforcement Learning, Simulation-Based Planning

Abstract

Temporal-difference (TD) learning is one of the most successful and broadly applied solutions to the reinforcement learning problem; it has been used to achieve master-level play in chess, checkers and backgammon. Monte-Carlo tree search is a recent algorithm for simulation-based search, which has been used to achieve master-level play in Go. We have introduced a new approach to high-performance planning. Our method, TD search, combines TD learning with simulation-based search. Like Monte-Carlo tree search, value estimates are updated by learning online from simulated experience. Like TD learning, it uses value function approximation and bootstrapping to efficiently generalise between related states. We applied TD search to the game of 9x9 Go, using a million binary features matching simple patterns of stones. Without any explicit search tree, our approach outperformed a vanilla Monte-Carlo tree search with the same number of simulations. When combined with a simple alpha-beta search, our program also outperformed all traditional (pre-Monte-Carlo) search and machine learning programs on the 9x9 Computer Go Server.

Downloads

Published

2013-06-02

How to Cite

Silver, D., Sutton, R., & Mueller, M. (2013). Temporal-Difference Search in Computer Go. Proceedings of the International Conference on Automated Planning and Scheduling, 23(1), 486-487. https://doi.org/10.1609/icaps.v23i1.13578