I’m referring to figure 1a on page 4 and the explanation below. I can’t be sure but the self-play should be contributing a large part to the training and can go on and improve the algorithm even if the expert database stays fixed.
They spent three weeks to train the supervised policy and one day to train the reinforcement learning policy starting from the supervised policy, plus an additional week to extract the value function from the reinforcement learning policy (pages 25-26).
In the final system the only part that depends on RL is the value function. According to figure 4, if the value function is taken out the system still plays better than any other Go program, though worse than the human champion.
Therefore I would say that the system heavily depends on supervised training on a human-generated dataset. RL was needed to achieve the final performance, but it was not the most important ingredient.
Cite? They use the supervised network for policy selection (i.e. tree pruning) which is a critical part of the system.
I’m referring to figure 1a on page 4 and the explanation below. I can’t be sure but the self-play should be contributing a large part to the training and can go on and improve the algorithm even if the expert database stays fixed.
They spent three weeks to train the supervised policy and one day to train the reinforcement learning policy starting from the supervised policy, plus an additional week to extract the value function from the reinforcement learning policy (pages 25-26).
In the final system the only part that depends on RL is the value function. According to figure 4, if the value function is taken out the system still plays better than any other Go program, though worse than the human champion.
Therefore I would say that the system heavily depends on supervised training on a human-generated dataset. RL was needed to achieve the final performance, but it was not the most important ingredient.