For the SL phase, they trained 340 million updates with a batch size of 16, so 5.4 billion position-updates. However the database had only 29 million unique positions. That’s about 200 gradient iterations per unique position.
The self-play RL phase for AlphaGo consisted of 10,000 minibatches of 128 games each, so about 1 million games total. They only trained that part for a day.
They spent more time training the value network: 50 million minibatches of 32 board positions, so about 1.6 billion positions. That’s still much smaller than the SL training phase.
For the SL phase, they trained 340 million updates with a batch size of 16, so 5.4 billion position-updates. However the database had only 29 million unique positions. That’s about 200 gradient iterations per unique position.
The self-play RL phase for AlphaGo consisted of 10,000 minibatches of 128 games each, so about 1 million games total. They only trained that part for a day.
They spent more time training the value network: 50 million minibatches of 32 board positions, so about 1.6 billion positions. That’s still much smaller than the SL training phase.