It’s a big deal for Go, but I don’t think it’s a very big deal for AGI.
Conceptually Go is like Chess or Checkers: fully deterministic, perfect information two-player games.
Go is more challenging for computers because the search space (and in particular the average branching factor) is larger and known position evaluation heuristics are not as good, so traditional alpha-beta minimax search becomes infeasible.
The first big innovation, already put into use by most Go programs for a decade (although the idea is older) was Monte Carlo tree search, which addresses the high branching factor issue: while traditional search either does not expand a node or expands it and recursively evaluates all its children, MCTS stochastically evaluates nodes with a probability that depends on how promising they look, according to some heuristic.
DeepMind’s innovation consists in using a NN to learn a good position evaluation heuristic in a supervised fashion from a large database of professional games, refining it with reinforcement learning in “greedy” self-play mode and then using both the refined heuristic and the supervised heuristic in a MCTS engine.
Their approach essentially relies on big data and big hardware. From an engineering point of view, it is a major advancement of neural network technology because of the sheer scale and in particular the speed of the thing, which required significant non-trivial parallelization, but the core techniques aren’t particularly new and I doubt that they can scale well to more general domains with non-determinism and partial observability. However, neural networks may be more robust to noise and certain kinds of disturbances than hand-coded heuristics, so take this with a grain of salt.
So, to the extent that AGI will rely on large and fast neural networks, this work is a significant step towards practical AGI engineering, but to the extent that AGI will rely on some “master algorithm” this work is probably not a very big step towards the discovery of such algorithm, at least compared to previously known techniques.
I think it is a bigger deal than chess because it doesn’t use brute-force but mostly unsupervised learning. It is not the breakthrough in AGI but it is telling that this approach thoroughly beats all the other Go algorithms (1 out of 500 plays lost, even with handicap 4. And they say that it still improves by training.
I wouldn’t say that it’s “mostly unsupervised” since a crucial part of their training is done in a traditional supervised fashion on a database of games by professional players.
But it’s certainly much more automated than having an hand-coded heuristic.
Humans also learn extensively by studying the games of experts. In Japan/China, even fans follow games from newspapers.
A game might take an hour on average. So a pro with 10 years of experience may have played/watched upwards of 10,000 games. However, it takes much less time to read a game that has already been played—so a 10 year pro may be familiar with say 100,000 games. Considering that each game has 200+ moves, that roughly is a training set of order 2 to 20 million positions.
AlphaGo’s training set consisted of 160,000 games with 29 million positions, so the upper end estimate for humans is similar. More importantly, the human training set is far more carefully curated and thus of higher quality.
so a 10 year pro may be familiar with say 100,000 games.
That’s 27.4 games a day, on average. I think this is an overestimate.
It was my upper bound estimate, and if anything it was too low.
A pro will grow up in a dedicated go school where there are hundreds of other players just playing go and studying go all day. Some students will be playing speed games, and some will be flipping through summaries of historical games in books/magazines and or on the web.
When not playing, people will tend to walk around and spectate the other games (nowdays this is also trivial to do online). An experienced player can reconstruct some of the move history by just glancing at the board.
So if anything, 27.4 games watched/skimmed/experienced per day is too low for the upper estimate.
An East Asian Go pro will often have been an insei and been studying Go full-time at a school, and a dedicated amateur before that, so you can imagine how many hours a day they will be studying… (The intensiveness is part of why they dominate Go to the degree they do and North American & Europeans are so much weaker: start a lot of kids, start them young, school them 10 hours a day for years studying games and playing against each other and pros, and keep relentlessly filtering to winnow out anyone who is not brilliant.)
I would say 100k is an overestimate since they will tend to be more closely studying the games and commentaries and also working out life-and-death problems, memorizing the standard openings, and whatnot, but they are definitely reading through and studying tens of thousands of games—similar to how one of the reasons chess players are so much better these days than even just decades ago is that computers have given access to enormous databases of games which can be studied with the help of chess AIs (Carlsen has benefited a lot from this, I understand). Also, while I’m nitpicking, AlphaGo trained on both the KGS and then self-play; I don’t know how many games the self-play amounted to, but the appendix broke down the wallclock times by phase, and of the 4 weeks of wallclock time, IIRC most of it was spent on the self-play finetuning the value function.
But if AlphaGo is learning from games ‘only’ more efficiently than 99%+ of the humans who play Go (Fan Hui was ranked in the 600s, there’s maybe 1000-2000 people who earn a living as Go professionals, selected from the hundreds of thousands/millions of people who play), that doesn’t strike me as much of a slur.
For the SL phase, they trained 340 million updates with a batch size of 16, so 5.4 billion position-updates. However the database had only 29 million unique positions. That’s about 200 gradient iterations per unique position.
The self-play RL phase for AlphaGo consisted of 10,000 minibatches of 128 games each, so about 1 million games total. They only trained that part for a day.
They spent more time training the value network: 50 million minibatches of 32 board positions, so about 1.6 billion positions. That’s still much smaller than the SL training phase.
I’m referring to figure 1a on page 4 and the explanation below. I can’t be sure but the self-play should be contributing a large part to the training and can go on and improve the algorithm even if the expert database stays fixed.
They spent three weeks to train the supervised policy and one day to train the reinforcement learning policy starting from the supervised policy, plus an additional week to extract the value function from the reinforcement learning policy (pages 25-26).
In the final system the only part that depends on RL is the value function. According to figure 4, if the value function is taken out the system still plays better than any other Go program, though worse than the human champion.
Therefore I would say that the system heavily depends on supervised training on a human-generated dataset. RL was needed to achieve the final performance, but it was not the most important ingredient.
It’s a big deal for Go, but I don’t think it’s a very big deal for AGI.
Conceptually Go is like Chess or Checkers: fully deterministic, perfect information two-player games.
Go is more challenging for computers because the search space (and in particular the average branching factor) is larger and known position evaluation heuristics are not as good, so traditional alpha-beta minimax search becomes infeasible.
The first big innovation, already put into use by most Go programs for a decade (although the idea is older) was Monte Carlo tree search, which addresses the high branching factor issue: while traditional search either does not expand a node or expands it and recursively evaluates all its children, MCTS stochastically evaluates nodes with a probability that depends on how promising they look, according to some heuristic.
DeepMind’s innovation consists in using a NN to learn a good position evaluation heuristic in a supervised fashion from a large database of professional games, refining it with reinforcement learning in “greedy” self-play mode and then using both the refined heuristic and the supervised heuristic in a MCTS engine.
Their approach essentially relies on big data and big hardware. From an engineering point of view, it is a major advancement of neural network technology because of the sheer scale and in particular the speed of the thing, which required significant non-trivial parallelization, but the core techniques aren’t particularly new and I doubt that they can scale well to more general domains with non-determinism and partial observability. However, neural networks may be more robust to noise and certain kinds of disturbances than hand-coded heuristics, so take this with a grain of salt.
So, to the extent that AGI will rely on large and fast neural networks, this work is a significant step towards practical AGI engineering, but to the extent that AGI will rely on some “master algorithm” this work is probably not a very big step towards the discovery of such algorithm, at least compared to previously known techniques.
I think it is a bigger deal than chess because it doesn’t use brute-force but mostly unsupervised learning. It is not the breakthrough in AGI but it is telling that this approach thoroughly beats all the other Go algorithms (1 out of 500 plays lost, even with handicap 4. And they say that it still improves by training.
I wouldn’t say that it’s “mostly unsupervised” since a crucial part of their training is done in a traditional supervised fashion on a database of games by professional players.
But it’s certainly much more automated than having an hand-coded heuristic.
Humans also learn extensively by studying the games of experts. In Japan/China, even fans follow games from newspapers.
A game might take an hour on average. So a pro with 10 years of experience may have played/watched upwards of 10,000 games. However, it takes much less time to read a game that has already been played—so a 10 year pro may be familiar with say 100,000 games. Considering that each game has 200+ moves, that roughly is a training set of order 2 to 20 million positions.
AlphaGo’s training set consisted of 160,000 games with 29 million positions, so the upper end estimate for humans is similar. More importantly, the human training set is far more carefully curated and thus of higher quality.
That’s 27.4 games a day, on average. I think this is an overestimate.
It was my upper bound estimate, and if anything it was too low.
A pro will grow up in a dedicated go school where there are hundreds of other players just playing go and studying go all day. Some students will be playing speed games, and some will be flipping through summaries of historical games in books/magazines and or on the web.
When not playing, people will tend to walk around and spectate the other games (nowdays this is also trivial to do online). An experienced player can reconstruct some of the move history by just glancing at the board.
So if anything, 27.4 games watched/skimmed/experienced per day is too low for the upper estimate.
An East Asian Go pro will often have been an insei and been studying Go full-time at a school, and a dedicated amateur before that, so you can imagine how many hours a day they will be studying… (The intensiveness is part of why they dominate Go to the degree they do and North American & Europeans are so much weaker: start a lot of kids, start them young, school them 10 hours a day for years studying games and playing against each other and pros, and keep relentlessly filtering to winnow out anyone who is not brilliant.)
I would say 100k is an overestimate since they will tend to be more closely studying the games and commentaries and also working out life-and-death problems, memorizing the standard openings, and whatnot, but they are definitely reading through and studying tens of thousands of games—similar to how one of the reasons chess players are so much better these days than even just decades ago is that computers have given access to enormous databases of games which can be studied with the help of chess AIs (Carlsen has benefited a lot from this, I understand). Also, while I’m nitpicking, AlphaGo trained on both the KGS and then self-play; I don’t know how many games the self-play amounted to, but the appendix broke down the wallclock times by phase, and of the 4 weeks of wallclock time, IIRC most of it was spent on the self-play finetuning the value function.
But if AlphaGo is learning from games ‘only’ more efficiently than 99%+ of the humans who play Go (Fan Hui was ranked in the 600s, there’s maybe 1000-2000 people who earn a living as Go professionals, selected from the hundreds of thousands/millions of people who play), that doesn’t strike me as much of a slur.
For the SL phase, they trained 340 million updates with a batch size of 16, so 5.4 billion position-updates. However the database had only 29 million unique positions. That’s about 200 gradient iterations per unique position.
The self-play RL phase for AlphaGo consisted of 10,000 minibatches of 128 games each, so about 1 million games total. They only trained that part for a day.
They spent more time training the value network: 50 million minibatches of 32 board positions, so about 1.6 billion positions. That’s still much smaller than the SL training phase.
The supervised part is only in the bootstrapping. The main learning happens in the self-play part.
Cite? They use the supervised network for policy selection (i.e. tree pruning) which is a critical part of the system.
I’m referring to figure 1a on page 4 and the explanation below. I can’t be sure but the self-play should be contributing a large part to the training and can go on and improve the algorithm even if the expert database stays fixed.
They spent three weeks to train the supervised policy and one day to train the reinforcement learning policy starting from the supervised policy, plus an additional week to extract the value function from the reinforcement learning policy (pages 25-26).
In the final system the only part that depends on RL is the value function. According to figure 4, if the value function is taken out the system still plays better than any other Go program, though worse than the human champion.
Therefore I would say that the system heavily depends on supervised training on a human-generated dataset. RL was needed to achieve the final performance, but it was not the most important ingredient.