“AlphaStar: Mastering the Real-Time Strategy Game StarCraft II”, DeepMind [won 10 of 11 games against human pros]
- What are red flags for Neural Network suffering? by 8 Nov 2021 12:51 UTC; 29 points) (
- 29 Jan 2019 0:17 UTC; 5 points) 's comment on [Link] Did AlphaStar just click faster? by (
My overall impression looking at this is still more or less summed up by what Francois Chollet said a bit ago.
Some of the stuff Deepmind talks about a lot—so, for instance, the AlphaLeague—seems like a clever technique simply designed to ensure that you have a sufficiently dense sampling of the space, which would normally not occur in a game with unstable equilibria. And this seems to me more like “clever technique applicable to domain where we can generate infinite data through self-play” than “stepping stone on way to AGI.”
That being said, I haven’t yet read through all the papers in the blog postl, and I’d be curious what of them people think might be / definitely are potential steps towards actually engineering intelligence.
Before now, it wasn’t immediately obvious that SC2 is a game that can be played superhumanly well without anything that looks like long-term planning or counterfactual reasoning. The way humans play it relies on a combination of past experience, narrow skills, and “what-if” mental simulation of the opponent. Building a superhuman SC2 agent out of nothing more than LSTM units indicates that you can completely do away with planning, even when the action space is very large, even when the state space is VERY large, even when the possibilities are combinatorially enormous. Yes, humans can get good at SC2 with much less than 200 years of time played (although those humans are usually studying the replays of other masters to bootstrap their understanding) but I think it’s worthwhile to focus on the inverse of this observation: that a sophisticated problem domain which looks like it ought to require planning and model-based counterfactual reasoning actually requires no such thing. What other problem domains seem like they ought to require planning and counterfactual reasoning, but can probably be conquered with nothing more advanced than a deep LSTM network?
(I haven’t seen anyone bother to compute an estimate of the size of the state-space of SC2 relative to, for example, Go or Chess, and I’m not sure if there’s even a coherent way to go about it.)
Now that is a metric I would be interested to see. It feels like the answer is obviously that there is a coherent way to go about it, otherwise the same techniques could not have been used to explore both spaces.
I wonder if they could just have AlphaStar count states as it goes.
The best I can do after thinking about it for a bit is compute every possible combination of units under 200 supply, multiply that by the possible positions of those units in space, multiply that by the possible combinations of buildings on the map and their potential locations in space, multiply that by the possible combinations of upgrades, multiply that by the amount of resources in all available mineral/vespene sources … I can already spot a few oversimplifications in what I just wrote, and I can think of even more things that need to be accounted for. The shields/hitpoints/energy of every unit. Combinatorially gigantic.
Just the number of potential positions of a single unit on the map is already huge.
But AlphaStar doesn’t really explore much of this space. It finds out pretty quickly that there’s really no reason to explore the parts of the space the include building random buildings in weird map locations. It explores and optimizes around the parts of the state space that look reasonably close to human play, because that was its starting point, and it’s not going to find superior strategies randomly, not without a lot of optimization in isolation.
That’s one thing I would love to see, actually. A version of the code trained purely on self-play, without a basis in human replays. Does it ever discover proxy plays or other esoteric cheese without a starting point provided in the human replays?
I expect that will be the next step; it was how they approached the versions of Alpha Go too.
This assumes that human intelligence appears from something different than training on very large dataset of books, movies, parents chats etc.
It’s worth noting that NLP took a big leap in 2018 through simple unsupervised/predictive training on large text corpuses to build text embeddings which encode a lot of semantic knowledge about the world.
A passage about AI safety from the blog: “We also think some of our training methods may prove useful in the study of safe and robust AI. One of the great challenges in AI is the number of ways in which systems could go wrong, and StarCraft pros have previously found it easy to beat AI systems by finding inventive ways to provoke these mistakes. AlphaStar’s innovative league-based training process finds the approaches that are most reliable and least likely to go wrong. We’re excited by the potential for this kind of approach to help improve the safety and robustness of AI systems in general, particularly in safety-critical domains like energy, where it’s essential to address complex edge cases.”
Also, they said that each agent used 16 TPU3 and the graph in the article indicates that at the end there were 600 agents. Based on TPU3 declared performance of 420 Teraflops, at the end it consumed 4 exaflops, with median 2 exaflops for 14 days, which is equal to 28.000 petaflops-days of compute. AlphaGoZero consumed 1800 petaflops-days according to OpenAI, but did it around 13 months before AlphaStar. This means that the trend of 3.5 months of doubling time of compute for most complex experiments continues.
“Go wrong” is still tied to the game’s win condition. So while the league-based training process does find the set of agents whose gameplay is least exploitable (among all the agents they trained), it’s not obvious how this relates to problems in AGI safety such as goal specification or robustness to capability gains. Maybe they’re thinking of things like red teaming. But without more context I’m not sure how safety-relevant this is.
There’s also the CPU. Those <=200 years of SC2 simulations per agent aren’t free. OA5, recall, was ’256 GPUs and 128,000 CPU cores’. (Occasionally training a small NN update is easier than running many games necessary to get the experience to decide what tweak to make.)
I was waiting for one of the DM researchers to answer the question about reward shaping and Oriol Vinyals just said:
Is anyone else surprised by how little reward shaping/engineering was needed here? Did DM use some other tricks to help the agents learn from a relatively sparse reward signal, or was it just a numbers game (if you train the agents enough, even a sparse signal would be enough)?
DM Q&A: https://www.reddit.com/r/MachineLearning/comments/ajgzoc/we_are_oriol_vinyals_and_david_silver_from/
Video: https://www.youtube.com/watch?v=cUTMhmVh1qs
/r/reinforcementlearning discussion: https://www.reddit.com/r/reinforcementlearning/comments/ajeg5m/deepminds_alphastar_starcraft_2_demonstration/
How long do handicaps take to overcome, though? I find it hard to imagine that the difference between eg 500 APM average or 500 APM hard ceiling requires a whole new insight for the agent to be “clever” enough to win anyways—maybe just more training.
1) Their goal was a really good bot—and that hadn’t been done before (apparently). To implement handicaps to begin with would have been… very optimistic.
2) They don’t know what will work for sure until they try it.
3) Expense. (Training takes time and money.)
As noted in e.g. the conversation between Wei Dai and me elsewhere in this thread, it’s quite plausible that people thought beforehand that the current APM limits were fair (DeepMind apparently consulted pro players on them). Maybe AlphaStar needed to actually play a game against a human pro before it became obvious that it could be so overwhelmingly powerful with the current limits.
Well, apparently that’s exactly what happened with TLO and MaNa, and then the DeepMind guys were (at least going by their own accounts) excited about the progress they were making and wanted to share it, since being able to beat human pros at all felt like a major achievement. Like they could just have tested it in private and continued working on it in secret, but why not give a cool event to the community while also letting them know what the current state of the art is.
E.g. some comments from here:
And the top comment from that thread:
That excuse is incompatible with the excuse that they thought the APM limit was fair. Either they thought it was a fair test at the time of publicizing, or they got so excited about even passing an unfair test that they wanted to share it, but not both.
I was using “fair” to mean something like “still made for an interesting test of the system’s capabilities”. Under that definition, the explanations seem entirely compatible—they thought that it was an interesting benchmark to try, and also got excited about the results and wanted to share them because they had run a test which showed that the system had passed an interesting benchmark.
It seems like you’re talking about fairness in a way that isn’t responsive to Rekrul’s substantive point, which was about whether the test was unevenly favoring the AI on abilities like extremely high APM, unrelated to what we’d intuitively consider thinking, not about whether the test was generally uninformative.
Oh, I thought it was already plainly obvious to everyone that the victories were in part because of unfair AI advantages. We don’t need to discuss the APM cap for that, that much was already clear from the fact that the version of AlphaStar which had stricter vision limitations lost to MaNa.
That just seems like a relatively uninteresting point to me, since this looks like AlphaStar’s equivalent of the Fan Hui game. That is, it’s obvious that AlphaStar is still below the level of top human pros and wouldn’t yet beat them without unfair advantages, but if the history with AlphaGo is any guide, it’s only a matter of some additional tweaking and throwing more compute at it before it’s at the point where it will beat even the top players while having much stricter limitations in place.
Saying that its victories were *unrelated* to what we’d intuitively consider thinking seems too strong, though. I’m not terribly familiar with SC2, but a lot of the discussion that I’ve seen seems to tend towards AlphaStar’s macro being roughly on par with the pro level, and its superior micro being what ultimately carried the day. E.g. people focus a lot on that one game that MaNa arguably should have won but only lost due to AlphaStar having superhuman micro of several different groups of Stalkers, but I haven’t seen it suggested that all of its victories would have been attributable to that alone: I didn’t get that kind of a vibe from MaNa’s own post-game analysis of those matches, for instance. Nor from TLO’s equivalent analysis (lost the link, sorry) of his matches, where he IIRC only said something like “maybe they should look at the APM limits a bit [for future matches]”.
So it seems to me that even though its capability at what we’d intuitively consider thinking wouldn’t have been enough for winning all the matches, it would still have been good enough for winning several of the ones where people aren’t pointing to the superhuman micro as the sole reason of the victory.
This was definitely not initially obvious to everyone, and I expect many people still have the impression that the victories were not due to unfair AI advantages. I think you should double crux with Raemon on how many words people can be expected to read.
I mostly agree with this comment. My speculative best guess is that the main reason MaNa did better against the revised version of AlphaStar wasn’t due to the vision limitations, but rather some combination of:
MaNa had more time to come up with a good strategy and analyze previous games.
MaNa had more time to warm up, and was generally in a better headspace.
The previous version of AlphaStar was unusually good, and the new version was an entirely new system, so the new version regressed to the mean a bit. (On the dimension “can beat human pros”, even though it was superior on the dimension “can beat other AlphaStar strategies”.)
At 277, AlphaStar APM was, on average, lower than both MaNa and TLO’s.
The APM comparison is misleading.
Interesting analysis here:
Number 3 is an interesting claim, but I would assume that, if this is true and DeepMind are aware of this, they would just find a way to erase the spam clicks from the human play database.
Yes, if it’s as simple as ‘spam clicks from imitation learning are too hard to wash out via self-play given the weak APM limits’, it should be relatively easy to fix. Add a very tiny penalty for each click to incentivize efficiency, or preprocess the replay dataset—if a ‘spam click’ does nothing useful, it seems like it should be possible to replay through all the games, track what clicks actually result in a game-play difference and what clicks are either idempotent (eg multiple clicks in the same spot) or cancel out (eg a click to go one place which is replaced by a click to go another place before the unit has moved more than epsilon distance), and filter out the spam clicks.
Interesting. The article didn’t mention that.
Yeah, DeepMind seems to be behaving unethically here (i.e., being deliberately deceptive), since a bunch of people there must understand Starcraft II well enough to know that the APM graph/comparison is misleading, but the blog post still used it prominently to try to show AlphaStar in a better light. Hopefully it’s some sort of misunderstanding or honest mistake that they’ll clear up during the Reddit AMA, but if not it seems like another bad sign for the future of AI governance, comparable to the Baidu cheating scandal.
Apparently as a result of this analysis, DeepMind has edited the caption in the graph:
They are now explaining this in reddit AMA: “We are capping APM. Blizzard in game APM applies some multipliers to some actions, that’s why you are seeing a higher number. https://github.com/deepmind/pysc2/blob/master/docs/environment.md#apm-calculation″
“We consulted with TLO and Blizzard about APMs, and also added a hard limit to APMs. In particular, we set a maximum of 600 APMs over 5 second periods, 400 over 15 second periods, 320 over 30 second periods, and 300 over 60 second period. If the agent issues more actions in such periods, we drop / ignore the actions. ”
“Our network has about 70M parameters.”
AMA: https://www.reddit.com/r/MachineLearning/comments/ajgzoc/we_are_oriol_vinyals_and_david_silver_from/eexs0pd/?context=3
That explanation doesn’t really absolve them very much in my mind. If you read people’s responses to that comment on Reddit, it seems clear that AlphaStar still largely won by being faster and more accurate at crucial moments (a human couldn’t duplicate the strategies AlphaStar used because they can’t perform 50 accurate actions in 5 seconds), and the APM comparison graph was trying to make people think the opposite. See this one for example:
Is there a reason to assume that DeepMind is being intentionally deceptive rather than making a good-intentioned mistake? After all, they claim that they consulted with TLO on the APM rate, which is something that they seem unlikely to lie about, since it would be easy for TLO to dispute that if it was untrue. So presumably TLO felt like the caps that they instituted were fair, and it would have been reasonable for DeepMind to trust a top player’s judgment on that, none of them (AFAIK) being top players themselves. And even with the existing caps, people felt like the version which played TLO would likely have lost to MaNa, and the camera-limited version did actually lose to him. So it seems reasonable for someone to think that a good enough player would still beat versions of AlphaStar even with the existing limits, and that they didn’t thus give it an unfair advantage.
On the other hand, while I can think of plenty of reasons for why DeepMind might have honestly thought that this setup was fair, I can’t think of any good reason for them to decide to be intentionally deceptive in this way. They’ve been open about their agent currently only playing Protoss vs. Protoss on a single map and of earlier versions seeing the whole map, and five of their games were played against a non-Protoss player. If they honestly felt that they only won because of the higher APM rate, then I don’t see why they wouldn’t just admit that the same way they’re forthcoming about all of the system’s other limitations.
I think it’s quite possible that when they instituted the cap they thought it was fair, however from the actual gameplay it should be obvious to anyone who is even somewhat familiar with Starcraft II (e.g., many members of the AlphaStar team) that AlphaStar had a large advantage in “micro”, which in part came from the APM cap still allowing superhumanly fast and accurate actions at crucial times. It’s also possible that the blogpost and misleading APM comparison graph were written by someone who did not realize this, but then those who did realize should have objected to it and had it changed after they noticed.
Possibly you’re not objecting to what I’m suggesting above but to the usage of “intentionally deceptive” to describe it. If so, I think you may have a point in that there may not have been any single human who had an intention to be deceptive (e.g., maybe those at DM who realized that the graph would be misleading lacked the power or incentives to make changes to it), in which case perhaps it doesn’t make sense to attribute deceptive intention to the organization. Maybe “knowingly misleading” would be a better phrase there to describe the ethical failure (assuming you agree that the episode most likely does constitute a kind of ethical failure)?
ETA: I note that many people on places like Hacker News and Reddit have pointed out the misleading nature of the APM comparison graph, with virtually no one defending DM on that point, so anyone at DM who is following those discussions should have realized it by now, but no one has come out and either admitted a mistake or offered a reasonable explanation. Multiple people have also decided that it was intentional deception (which does seem like a strong possibility to me too since I think there’s a pretty high prior that the person who wrote the blog post would not be so unfamiliar with Starcraft II). Another piece of evidence that I noticed is that during the Youtube video one of the lead researchers talked about the APM comparison with the host, and also said at some point that he used to play Starcraft II. One Redditor describes it as “It’s not just the graphs, but during the conversation with Artosis the researcher was manipulating him.”
It’s not so obvious to me that someone who realizes that AlphaStar is superior at “micro” should have objected to those graphs.
Think about it like this—you’re on the DeepMind team, developing AlphaStar, and the whole point is to make it superhuman at StarCraft. So there’s going to be some part of the game that it’s superhuman at, and to some extent this will be “unfair” to humans. The team decided to try not to let AlphaStar have “physical” advantages, but I don’t see any indication that they explicitly decided that it should not be better at “micro” or unit control in general, and should only win on “strategy”.
Also, separating “micro” from “strategy” is probably not that simple for a model-free RL system like this. So I think they made a very reasonable decision to focus on a relatively easy-to-measure APM metric. When the resulting system doesn’t play exactly as humans do, or in a way that would be easy for humans to replicate, to me it doesn’t seem so-obvious-that-you’re-being-deceptive-if-you-don’t-notice-it that this is “unfair” and that you should go back to the drawing board with your handicapping system.
It seems to me that which ways for AlphaStar to be superhuman are “fair” or “unfair” is to some extent a matter of taste, and there will be many cases that are ambiguous. To give a non “micro” example—suppose AlphaStar is able to better keep track of exactly how many units its opponent has (and at what hit point levels) throughout the game, than a human can, and this allows it to make just slightly more fine-grained decisions about which units it should produce. This might allow it to win a game in a way that’s not replicable by humans. It didn’t find a new strategy—it just executed better. Is that fair or unfair? It feels maybe less unfair than just being super good at micro, but exactly where the dividing line is between “interesting” and “uninteresting” ways of winning seems not super clear.
Of course, now that a much broader group of StarCraft players has seen these games, and a consensus has emerged that this super-micro does not really seem fair, it would be weird if DeepMind did not take that into account for its next release. I will be quite surprised if they don’t adjust their setup to reduce the micro advantage going forward.
This is not the complaint that people (including me) have. Instead the complaint is that, given it’s clear that AlphaStar won mostly through micro, that graph highlighted statistics (i.e., average APM over the whole game, including humans spamming keys to keep their fingers warm) that would be irrelevant to SC2 experts for judging whether or not AlphaStar did win through micro, but would reliably mislead non-experts into thinking “no” on that question. Both of these effects should have been easy to foresee.
They’ve received a bunch of favorable publicity as a result.
It seems very unlikely to me that they would have gotten any less publicity if they’d reported the APM restrictions any differently. (After all, they didn’t get any less publicity for reporting the system’s other limitations either, like it only being able to play Protoss v. Protoss on a single map, or 10⁄11 of the agents having whole-camera vision.)
They might well have gotten less publicity due to emphasizing those facts as much as they did.
To me deep mind is simply trying to paint themselves in the best light. I’m not particularly surprised by the behavior; I would expect it from a for-profit company looking to get PR. Nor am I particularly upset about the behavior; I don’t see any outright lying going on, merely an attempt to frame the facts in the best possible way for them.
Ok, I read it too after my comment above… And I thought that than future evil superintellgence will start shooting at people on streets, the same commenters will said: “No, it is not a superintelligence, it is good just in tactical use of guns, and it just knows where humans are located and never misses, but its strategy is awful”. Or, in other words:
Weak strategy + perfect skills = dangerous AI
What especially worry me is that the same type of cheating could happen during safety evaluation of the future more advanced AI.
Could we train AlphaZero on all games it could play at once, then find the rule set its learning curve looks worst on?
Why would you want to do that?
Because “unfortunately” we are out of boardgames, and this might find another one.
We’re not out. Certainly we’re not out of games—e.g. Magic: The Gathering. Which would be a big leap.
For actual basic board games, the one I want to see is Stratego, actually; the only issue is I don’t know if there are humans who have bothered to master it.
Looks like you can watch the game vs TLO here: https://www.youtube.com/watch?v=DpRPfidTjDA
I can’t find the later games vs MaNa yet.
https://www.youtube.com/watch?v=HcZ48JDamyk&feature=youtu.be
Is at least one.