bogus comments on [Link] AlphaGo: Mastering the ancient game of Go with Machine Learning

bogus 29 Jan 2016 13:04 UTC
2 points
Cite? They use the supervised network for policy selection (i.e. tree pruning) which is a critical part of the system.
- Gunnar_Zarncke 29 Jan 2016 14:29 UTC
  2 points
  Parent
  I’m referring to figure 1a on page 4 and the explanation below. I can’t be sure but the self-play should be contributing a large part to the training and can go on and improve the algorithm even if the expert database stays fixed.
  - V_V 29 Jan 2016 17:55 UTC
    2 points
    Parent
    They spent three weeks to train the supervised policy and one day to train the reinforcement learning policy starting from the supervised policy, plus an additional week to extract the value function from the reinforcement learning policy (pages 25-26).
    
    In the final system the only part that depends on RL is the value function. According to figure 4, if the value function is taken out the system still plays better than any other Go program, though worse than the human champion.
    
    Therefore I would say that the system heavily depends on supervised training on a human-generated dataset. RL was needed to achieve the final performance, but it was not the most important ingredient.