Our team of five neural networks, OpenAI Five, has started to defeat amateur human teams at Dota 2. While today we play with restrictions, we aim to beat a team of top professionals at The International in August subject only to a limited set of heroes. We may not succeed: Dota 2 is one of the most popular and complex esports games in the world, with creative and motivated professionals who train year-round to earn part of Dota’s annual $40M prize pool (the largest of any esports game).
Commentary by Sam Altman: http://blog.samaltman.com/reinforcement-learning-progress
This is the game that to me feels closest to the real world and complex decision making (combining strategy, tactics, coordinating, and real-time action) of any game AI had made real progress against so far.
The agents we train consistently outperform two-week old agents with a win rate of 90-95%. We did this without training on human-played games—we did design the reward functions, of course, but the algorithm figured out how to play by training against itself.
This is a big deal because it shows that deep reinforcement learning can solve extremely hard problems whenever you can throw enough computing scale and a really good simulated environment that captures the problem you’re solving. We hope to use this same approach to solve very different problems soon. It’s easy to imagine this being applied to environments that look increasingly like the real world.
As a strong amateur DotA player, this seems absurdly impressive, albeit much less impressive than if they could beat humans in the full DotA ruleset. (Their restrictions seem to be avoiding much of the “long-term strategic planning” part of DotA, e.g. choosing heroes, item builds, and level of aggression in a way that makes sense against the enemy team.)
This seems way more surprising to me than beating humans at Go.
Gwern wrote a summary comment with some opinions here.
From a Hacker News comment by one of the researchers:
In other words: Current algorithms do seem to be able to tackle levels of sophistication (long time horizons, imperfect information, high-dimensional option space) that even experienced researchers wouldn’t have predicted, if you give them enough compute. And this person suggests that they could tackle even more sophisticated problems as long as you have a simulator for the problem domain.
One implication would be that we can solve protein folding.
Indeed, we did with AlphaFold II 2 years ago.
Receptor-binding behavior is the next step to go.
I don’t know how hard it would be to do a side by side “FLOPS” comparison of Dota 5v5 vs AlphaGo / AlphaZero, but it seems like they are relatively similar in terms of computational cost required to achieve something close to “human level”. However, as has been noted by many, Dota is a game of vastly more complexity because of its continuous state, partial observability, large action space, and time horizon. So what does it mean when it requires roughly similar orders of magnitude of compute to achieve the same level of ability as humans, using a fairly general architecture and learning algorithm?
Some responses to AlphaGo at the time were along the lines of “Don’t worry too much about this, it looks very impressive, but the game still has a discrete action space and is fully observable, so that explains why this was easy.”
Does anyone know how this deals with partial observability? Proximal policy optimization relies on estimating Q-values, but the usual algorithms for estimating Q-values rely on being able to observe the entire state.
It seems to construct an estimate of it by averaging a huge number of observations together before each update (for Dota 5v5, they say each batch is around a million observations, and I’m guessing it processes about a million batches). The surprising thing is that this works so well, and it allows leveraging of computational resources very easily.
My guess for how it deals with partial observability in a more philosophical sense is that it must be able to store an implicit model of the world in some way, in order to better predict the reward it will eventually observe. I’m beginning to wonder if the distinction between partial and full observability isn’t very useful after all. Even with AlphaGo, even though it can see the whole board, there are also a whole bunch of “spaces” it can’t see fully, possibly the entire action space, the space of every possible game trajectory, or the mathematical structure of play strategies. And yet, it managed to infer enough about those spaces to become good at the game.
After thinking about this more, I have a hypothesis for how it works:
It records the sequence of states resulting from play, to be used for learning when the game is done. It has some Q value estimator, which given a state and an action, returns the expected utility of taking that action in that state (of course, this expected utility depends on the policies both players are using). Using the sequence of states and the actions (including exploration actions), it makes an update to this Q value estimator.
But this Q value estimator is not directly usable in determining a policy, since the Q value estimator needs to know the full state. Instead, the Q value estimator is used to compute a gradient update for the policy. After the game, the Q value estimator can tell you the value of taking each possible action at each actual time step in the game (since all game states are known at the end of the game); this can be used to update the policy so it is more likely to take the actions that are estimated to be better after the game.
I’m not sure if this hypothesis is correct, but I don’t currently see anything wrong with this algorithm. Thank you for causing me to think of this.
(BTW, there is a well-defined difference between full and partial observability, see MDP vs POMDP. Convergence theorems for vanilla Q learning will require the game to be a MDP rather than a POMDP.)
I think that the “state” in this case is just the full observation history, and is therefore observable by definition. The state space is enormous, of course, but they deal with it by value and policy networks which are both RNNs (probably sharing some parameters) that process the state as a temporal sequence.
As someone who is just starting on machine learning as an autodidact, here are my thoughts on what’s happening (probably an incoherent ramble).
I wouldn’t have expected gradient descent to find particularly great solutions for Dota. When OpenAI released the 1v1 bot with all the restrictions of the case, I figured that was about the maximum achievable through gradient descent and that if they managed to do more in the following years (like they just did), and if someone managed to make a similar architecture work for even a slightly broader range of games, e.g. first person shooters (similarly to what happened from AlphaGo to AlphaZero), let’s call this architecture GamesZero, then basically GamesZero had to be equivalent to a full human level AGI. (but even in that case, I expected the 5v5 bot to come out after 3-5 more years if at all possible, not one year)
There was also some confusion that I had at the time, which I guess I still have but it’s gradually dissolving, about the fact that if GamesZero could come out of gradient descent then gradient descent also had to be an AGI, in the sense that e.g. our brains do something very similar to gradient descent, or that any AGI must implement some sort of gradient descent. It just wasn’t behaving like one because we needed faster algorithms or something.
My current opinion is a bit different. Gradient descent is an optimization process more similar to natural selection than to what our brains do. GamesZero can come out of gradient descent in the same way that human brains came out of natural selection, and while it can be said that all four of them are optimization processes, there is a sense in which I would put human brains and GamesZero in the “AGI category” (whatever that means), while I would put natural selection and gradient descent in the “blind idiot god category”. Which is to say, I don’t expect neither human brains or GamesZero to make use of gradient descent, anymore than I expect human brains to make use of simulated natural selections. But of course I could be horribly wrong.
Now that gradient descent accomplished 30% of what I thought was sufficient to make an AGI and that I thought it could never do, I wouldn’t be surprised if GamesZero comes out in two years, although I would be panicked (I will be surprised in addition to being panicked if it comes out next year).
The question then is: is GamesZero an AGI?
What makes me suspicious that the answer is yes is the categorization that I mentioned earlier. If GamesZero can work in a variety of games ranging from MOBAs to FPSs then I also expect it to work in the outside world, similarly to how human brains could only work in certain environments and only behave in certain ways before discovering fire (or whatever was the critical historical point) and then natural selection made something happen that made them able in principle to go to the moon.
The thing you are describing as GamesZero would almost certainly require quite a lot of game-specific training data (OpenAI Five required hundreds of years of training data). Getting this much data is completely impractical in the real world; this is the main reason why deep learning is not very successful in robotics.