How long do handicaps take to overcome, though? I find it hard to imagine that the difference between eg 500 APM average or 500 APM hard ceiling requires a whole new insight for the agent to be “clever” enough to win anyways—maybe just more training.
1) Their goal was a really good bot—and that hadn’t been done before (apparently). To implement handicaps to begin with would have been… very optimistic.
2) They don’t know what will work for sure until they try it.
As noted in e.g. the conversation between Wei Dai and me elsewhere in this thread, it’s quite plausible that people thought beforehand that the current APM limits were fair (DeepMind apparently consulted pro players on them). Maybe AlphaStar needed to actually play a game against a human pro before it became obvious that it could be so overwhelmingly powerful with the current limits.
Well, apparently that’s exactly what happened with TLO and MaNa, and then the DeepMind guys were (at least going by their own accounts) excited about the progress they were making and wanted to share it, since being able to beat human pros at all felt like a major achievement. Like they could just have tested it in private and continued working on it in secret, but why not give a cool event to the community while also letting them know what the current state of the art is.
I am an administrator in the SC2 AI discord and that we’ve been running SC2 bot vs bot leagues for many years now. Last season we had over 50 different bots/teams with prizes exceeding thousands of dollars in value, so we’ve seen what’s possible in the AI space.
I think the comments made in this sub-reddit especially with regards to the micro part left a bit of a sour taste in my mouth, since there seems to be the ubiquitous notion that “a computer can always out-micro an opponent”. That simply isn’t true. We have multiple examples for that in our own bot ladder, with bots achieving 70k APM or higher, and them still losing to superior decision making. We have a bot that performs god-like reaper micro, and you can still win against it. And those bots are made by researchers, excellent developers and people acquainted in that field. It’s very difficult to code proper micro, since it doesn’t only pertain to shooting and retreating on cooldown, but also to know when to engage, disengage, when to group your units, what to focus on, which angle to come from, which retreat options you have, etc. Those decisions are not APM based. In fact, those are challenges that haven’t been solved in 10 years since the Broodwar API came out—and last Thursday marks the first time that an AI got close to achieving that! For that alone the results are an incredible achievement.
And all that aside—even with inhuman APM—the results are astonishing. I agree that the presentation could have been a bit less “sensationalist”, since it created the feeling of “we cracked SC2″ and many people got defensive about that (understandably, because it’s far from cracked). However, you should know that the whole show was put together in less than a week and they almost decided on not doing it at all. I for one am very happy that they went through with it.
And the top comment from that thread:
Thank you for saying this. A decent sized community of hobbyists and researchers have been working on this for YEARS, and the conversation has really never been about whether or not bots can beat humans “fairly”. In the little documentary segment, they show a scene where TLO says (summarized) “This is my off race, but i’m still a top player. If they’re able to beat me, i’ll be really surprised.”
That isn’t him being pompous, that’s completely reasonable. AI has never even come CLOSE to this level for playing starcraft. The performance of AlphaStar in game 3 against MaNa left both Artosis AND MaNa basically speechless. It’s incredible that they’ve come this far in such a short amount of time. We’ve literally gone from “Can an AI play SC2 at a high level AT ALL” to “Can an AI win ‘fairly’”. That’s a non-trivial change in discourse that’s being completely brushed over IMO.
That excuse is incompatible with the excuse that they thought the APM limit was fair. Either they thought it was a fair test at the time of publicizing, or they got so excited about even passing an unfair test that they wanted to share it, but not both.
I was using “fair” to mean something like “still made for an interesting test of the system’s capabilities”. Under that definition, the explanations seem entirely compatible—they thought that it was an interesting benchmark to try, and also got excited about the results and wanted to share them because they had run a test which showed that the system had passed an interesting benchmark.
It seems like you’re talking about fairness in a way that isn’t responsive to Rekrul’s substantive point, which was about whether the test was unevenly favoring the AI on abilities like extremely high APM, unrelated to what we’d intuitively consider thinking, not about whether the test was generally uninformative.
Oh, I thought it was already plainly obvious to everyone that the victories were in part because of unfair AI advantages. We don’t need to discuss the APM cap for that, that much was already clear from the fact that the version of AlphaStar which had stricter vision limitations lost to MaNa.
That just seems like a relatively uninteresting point to me, since this looks like AlphaStar’s equivalent of the Fan Hui game. That is, it’s obvious that AlphaStar is still below the level of top human pros and wouldn’t yet beat them without unfair advantages, but if the history with AlphaGo is any guide, it’s only a matter of some additional tweaking and throwing more compute at it before it’s at the point where it will beat even the top players while having much stricter limitations in place.
Saying that its victories were *unrelated* to what we’d intuitively consider thinking seems too strong, though. I’m not terribly familiar with SC2, but a lot of the discussion that I’ve seen seems to tend towards AlphaStar’s macro being roughly on par with the pro level, and its superior micro being what ultimately carried the day. E.g. people focus a lot on that one game that MaNa arguably should have won but only lost due to AlphaStar having superhuman micro of several different groups of Stalkers, but I haven’t seen it suggested that all of its victories would have been attributable to that alone: I didn’t get that kind of a vibe from MaNa’s own post-game analysis of those matches, for instance. Nor from TLO’s equivalent analysis (lost the link, sorry) of his matches, where he IIRC only said something like “maybe they should look at the APM limits a bit [for future matches]”.
So it seems to me that even though its capability at what we’d intuitively consider thinking wouldn’t have been enough for winning all the matches, it would still have been good enough for winning several of the ones where people aren’t pointing to the superhuman micro as the sole reason of the victory.
This was definitely not initially obvious to everyone, and I expect many people still have the impression that the victories were not due to unfair AI advantages. I think you should double crux with Raemon on how many words people can be expected to read.
I mostly agree with this comment. My speculative best guess is that the main reason MaNa did better against the revised version of AlphaStar wasn’t due to the vision limitations, but rather some combination of:
MaNa had more time to come up with a good strategy and analyze previous games.
MaNa had more time to warm up, and was generally in a better headspace.
The previous version of AlphaStar was unusually good, and the new version was an entirely new system, so the new version regressed to the mean a bit. (On the dimension “can beat human pros”, even though it was superior on the dimension “can beat other AlphaStar strategies”.)
I will try to make a convincing argument for the following:
1. AlphaStar played with superhuman speed and precision.
2. Deepmind claimed to have restricted the AI from performing actions that would be physically impossible to a human. They have not succeeded in this and most likely are aware of it.
3. The reason why AlphaStar is performing at superhuman speeds is most likely due to it’s inability to unlearn the human players tendency to spam click. I suspect Deepmind wanted to restrict it to a more human like performance but are simply not able to. It’s going to take us some time to work our way to this point but it is the whole reason why I’m writing this so I ask you to have patience.
Number 3 is an interesting claim, but I would assume that, if this is true and DeepMind are aware of this, they would just find a way to erase the spam clicks from the human play database.
Yes, if it’s as simple as ‘spam clicks from imitation learning are too hard to wash out via self-play given the weak APM limits’, it should be relatively easy to fix. Add a very tiny penalty for each click to incentivize efficiency, or preprocess the replay dataset—if a ‘spam click’ does nothing useful, it seems like it should be possible to replay through all the games, track what clicks actually result in a game-play difference and what clicks are either idempotent (eg multiple clicks in the same spot) or cancel out (eg a click to go one place which is replaced by a click to go another place before the unit has moved more than epsilon distance), and filter out the spam clicks.
Yeah, DeepMind seems to be behaving unethically here (i.e., being deliberately deceptive), since a bunch of people there must understand Starcraft II well enough to
know that the APM graph/comparison is misleading, but the blog post still used it prominently to try to show AlphaStar in a better light. Hopefully it’s some sort of misunderstanding or honest mistake that they’ll clear up during the Reddit AMA, but if not it seems like another bad sign for the future of AI governance, comparable to the Baidu cheating scandal.
Apparently as a result of this analysis, DeepMind has edited the caption in the graph:
The distribution of AlphaStar’s APMs in its matches against MaNa and TLO and the total delay between observations and actions. CLARIFICATION (29/01/19): TLO’s APM appears higher than both AlphaStar and MaNa because of his use of rapid-fire hot-keys and use of the “remove and add to control group” key bindings. Also note that AlphaStar’s effective APM bursts are sometimes higher than both players.
“We consulted with TLO and Blizzard about APMs, and also added a hard limit to APMs. In particular, we set a maximum of 600 APMs over 5 second periods, 400 over 15 second periods, 320 over 30 second periods, and 300 over 60 second period. If the agent issues more actions in such periods, we drop / ignore the actions. ”
That explanation doesn’t really absolve them very much in my mind. If you read people’s responses to that comment on Reddit, it seems clear that AlphaStar still largely won by being faster and more accurate at crucial moments (a human couldn’t duplicate the strategies AlphaStar used because they can’t perform 50 accurate actions in 5 seconds), and the APM comparison graph was trying to make people think the opposite. See this one for example:
Statistics aside, it was clear from the gamers’, presenters’, and audience’s shocked reaction to the Stalker micro, all saying that no human player in the world could do what Alpha Go was doing. Using just-beside-the-point statistics is obfuscation and an avoiding of acknowledging this.
AlphaStar wasn’t outsmarting the humans—it’s not like TLO and MaNa slapped their foreheads and said, “I wish I’d thought of microing Stalkers that fast! Genius!”
Is there a reason to assume that DeepMind is being intentionally deceptive rather than making a good-intentioned mistake? After all, they claim that they consulted with TLO on the APM rate, which is something that they seem unlikely to lie about, since it would be easy for TLO to dispute that if it was untrue. So presumably TLO felt like the caps that they instituted were fair, and it would have been reasonable for DeepMind to trust a top player’s judgment on that, none of them (AFAIK) being top players themselves. And even with the existing caps, people felt like the version which played TLO would likely have lost to MaNa, and the camera-limited version did actually lose to him. So it seems reasonable for someone to think that a good enough player would still beat versions of AlphaStar even with the existing limits, and that they didn’t thus give it an unfair advantage.
On the other hand, while I can think of plenty of reasons for why DeepMind might have honestly thought that this setup was fair, I can’t think of any good reason for them to decide to be intentionally deceptive in this way. They’ve been open about their agent currently only playing Protoss vs. Protoss on a single map and of earlier versions seeing the whole map, and five of their games were played against a non-Protoss player. If they honestly felt that they only won because of the higher APM rate, then I don’t see why they wouldn’t just admit that the same way they’re forthcoming about all of the system’s other limitations.
I think it’s quite possible that when they instituted the cap they thought it was fair, however from the actual gameplay it should be obvious to anyone who is even somewhat familiar with Starcraft II (e.g., many members of the AlphaStar team) that AlphaStar had a large advantage in “micro”, which in part came from the APM cap still allowing superhumanly fast and accurate actions at crucial times. It’s also possible that the blogpost and misleading APM comparison graph were written by someone who did not realize this, but then those who did realize should have objected to it and had it changed after they noticed.
Possibly you’re not objecting to what I’m suggesting above but to the usage of “intentionally deceptive” to describe it. If so, I think you may have a point in that there may not have been any single human who had an intention to be deceptive (e.g., maybe those at DM who realized that the graph would be misleading lacked the power or incentives to make changes to it), in which case perhaps it doesn’t make sense to attribute deceptive intention to the organization. Maybe “knowingly misleading” would be a better phrase there to describe the ethical failure (assuming you agree that the episode most likely does constitute a kind of ethical failure)?
ETA: I note that many people on places like Hacker News and Reddit have pointed out the misleading nature of the APM comparison graph, with virtually no one defending DM on that point, so anyone at DM who is following those discussions should have realized it by now, but no one has come out and either admitted a mistake or offered a reasonable explanation. Multiplepeople have also decided that it was intentional deception (which does seem like a strong possibility to me too since I think there’s a pretty high prior that the person who wrote the blog post would not be so unfamiliar with Starcraft II). Another piece of evidence that I noticed is that during the Youtube video one of the lead researchers talked about the APM comparison with the host, and also said at some point that he used to play Starcraft II. One Redditor describes it as “It’s not just the graphs, but during the conversation with Artosis the researcher was manipulating him.”
I think it’s quite possible that when they instituted the cap they thought it was fair, however from the actual gameplay it should be obvious to anyone who is even somewhat familiar with Starcraft II (e.g., many members of the AlphaStar team) that AlphaStar had a large advantage in “micro”, which in part came from the APM cap still allowing superhumanly fast and accurate actions at crucial times. It’s also possible that the blogpost and misleading APM comparison graph were written by someone who did not realize this, but then those who did realize should have objected to it and had it changed after they noticed.
It’s not so obvious to me that someone who realizes that AlphaStar is superior at “micro” should have objected to those graphs.
Think about it like this—you’re on the DeepMind team, developing AlphaStar, and the whole point is to make it superhuman at StarCraft. So there’s going to be some part of the game that it’s superhuman at, and to some extent this will be “unfair” to humans. The team decided to try not to let AlphaStar have “physical” advantages, but I don’t see any indication that they explicitly decided that it should not be better at “micro” or unit control in general, and should only win on “strategy”.
Also, separating “micro” from “strategy” is probably not that simple for a model-free RL system like this. So I think they made a very reasonable decision to focus on a relatively easy-to-measure APM metric. When the resulting system doesn’t play exactly as humans do, or in a way that would be easy for humans to replicate, to me it doesn’t seem so-obvious-that-you’re-being-deceptive-if-you-don’t-notice-it that this is “unfair” and that you should go back to the drawing board with your handicapping system.
It seems to me that which ways for AlphaStar to be superhuman are “fair” or “unfair” is to some extent a matter of taste, and there will be many cases that are ambiguous. To give a non “micro” example—suppose AlphaStar is able to better keep track of exactly how many units its opponent has (and at what hit point levels) throughout the game, than a human can, and this allows it to make just slightly more fine-grained decisions about which units it should produce. This might allow it to win a game in a way that’s not replicable by humans. It didn’t find a new strategy—it just executed better. Is that fair or unfair? It feels maybe less unfair than just being super good at micro, but exactly where the dividing line is between “interesting” and “uninteresting” ways of winning seems not super clear.
Of course, now that a much broader group of StarCraft players has seen these games, and a consensus has emerged that this super-micro does not really seem fair, it would be weird if DeepMind did not take that into account for its next release. I will be quite surprised if they don’t adjust their setup to reduce the micro advantage going forward.
When the resulting system doesn’t play exactly as humans do, or in a way that would be easy for humans to replicate, to me it doesn’t seem so-obvious-that-you’re-being-deceptive-if-you-don’t-notice-it that this is “unfair” and that you should go back to the drawing board with your handicapping system.
This is not the complaint that people (including me) have. Instead the complaint is that, given it’s clear that AlphaStar won mostly through micro, that graph highlighted statistics (i.e., average APM over the whole game, including humans spamming keys to keep their fingers warm) that would be irrelevant to SC2 experts for judging whether or not AlphaStar did win through micro, but would reliably mislead non-experts into thinking “no” on that question. Both of these effects should have been easy to foresee.
It seems very unlikely to me that they would have gotten any less publicity if they’d reported the APM restrictions any differently. (After all, they didn’t get any less publicity for reporting the system’s other limitations either, like it only being able to play Protoss v. Protoss on a single map, or 10⁄11 of the agents having whole-camera vision.)
After all, they didn’t get any less publicity for reporting the system’s other limitations either, like it only being able to play Protoss v. Protoss on a single map, or 10⁄11 of the agents having whole-camera vision.
They might well have gotten less publicity due to emphasizing those facts as much as they did.
To me deep mind is simply trying to paint themselves in the best light. I’m not particularly surprised by the behavior; I would expect it from a for-profit company looking to get PR. Nor am I particularly upset about the behavior; I don’t see any outright lying going on, merely an attempt to frame the facts in the best possible way for them.
Ok, I read it too after my comment above… And I thought that than future evil superintellgence will start shooting at people on streets, the same commenters will said: “No, it is not a superintelligence, it is good just in tactical use of guns, and it just knows where humans are located and never misses, but its strategy is awful”. Or, in other words:
We’re not out. Certainly we’re not out of games—e.g. Magic: The Gathering. Which would be a big leap.
For actual basic board games, the one I want to see is Stratego, actually; the only issue is I don’t know if there are humans who have bothered to master it.
How long do handicaps take to overcome, though? I find it hard to imagine that the difference between eg 500 APM average or 500 APM hard ceiling requires a whole new insight for the agent to be “clever” enough to win anyways—maybe just more training.
1) Their goal was a really good bot—and that hadn’t been done before (apparently). To implement handicaps to begin with would have been… very optimistic.
2) They don’t know what will work for sure until they try it.
3) Expense. (Training takes time and money.)
As noted in e.g. the conversation between Wei Dai and me elsewhere in this thread, it’s quite plausible that people thought beforehand that the current APM limits were fair (DeepMind apparently consulted pro players on them). Maybe AlphaStar needed to actually play a game against a human pro before it became obvious that it could be so overwhelmingly powerful with the current limits.
Well, apparently that’s exactly what happened with TLO and MaNa, and then the DeepMind guys were (at least going by their own accounts) excited about the progress they were making and wanted to share it, since being able to beat human pros at all felt like a major achievement. Like they could just have tested it in private and continued working on it in secret, but why not give a cool event to the community while also letting them know what the current state of the art is.
E.g. some comments from here:
And the top comment from that thread:
That excuse is incompatible with the excuse that they thought the APM limit was fair. Either they thought it was a fair test at the time of publicizing, or they got so excited about even passing an unfair test that they wanted to share it, but not both.
I was using “fair” to mean something like “still made for an interesting test of the system’s capabilities”. Under that definition, the explanations seem entirely compatible—they thought that it was an interesting benchmark to try, and also got excited about the results and wanted to share them because they had run a test which showed that the system had passed an interesting benchmark.
It seems like you’re talking about fairness in a way that isn’t responsive to Rekrul’s substantive point, which was about whether the test was unevenly favoring the AI on abilities like extremely high APM, unrelated to what we’d intuitively consider thinking, not about whether the test was generally uninformative.
Oh, I thought it was already plainly obvious to everyone that the victories were in part because of unfair AI advantages. We don’t need to discuss the APM cap for that, that much was already clear from the fact that the version of AlphaStar which had stricter vision limitations lost to MaNa.
That just seems like a relatively uninteresting point to me, since this looks like AlphaStar’s equivalent of the Fan Hui game. That is, it’s obvious that AlphaStar is still below the level of top human pros and wouldn’t yet beat them without unfair advantages, but if the history with AlphaGo is any guide, it’s only a matter of some additional tweaking and throwing more compute at it before it’s at the point where it will beat even the top players while having much stricter limitations in place.
Saying that its victories were *unrelated* to what we’d intuitively consider thinking seems too strong, though. I’m not terribly familiar with SC2, but a lot of the discussion that I’ve seen seems to tend towards AlphaStar’s macro being roughly on par with the pro level, and its superior micro being what ultimately carried the day. E.g. people focus a lot on that one game that MaNa arguably should have won but only lost due to AlphaStar having superhuman micro of several different groups of Stalkers, but I haven’t seen it suggested that all of its victories would have been attributable to that alone: I didn’t get that kind of a vibe from MaNa’s own post-game analysis of those matches, for instance. Nor from TLO’s equivalent analysis (lost the link, sorry) of his matches, where he IIRC only said something like “maybe they should look at the APM limits a bit [for future matches]”.
So it seems to me that even though its capability at what we’d intuitively consider thinking wouldn’t have been enough for winning all the matches, it would still have been good enough for winning several of the ones where people aren’t pointing to the superhuman micro as the sole reason of the victory.
This was definitely not initially obvious to everyone, and I expect many people still have the impression that the victories were not due to unfair AI advantages. I think you should double crux with Raemon on how many words people can be expected to read.
I mostly agree with this comment. My speculative best guess is that the main reason MaNa did better against the revised version of AlphaStar wasn’t due to the vision limitations, but rather some combination of:
MaNa had more time to come up with a good strategy and analyze previous games.
MaNa had more time to warm up, and was generally in a better headspace.
The previous version of AlphaStar was unusually good, and the new version was an entirely new system, so the new version regressed to the mean a bit. (On the dimension “can beat human pros”, even though it was superior on the dimension “can beat other AlphaStar strategies”.)
At 277, AlphaStar APM was, on average, lower than both MaNa and TLO’s.
The APM comparison is misleading.
Interesting analysis here:
Number 3 is an interesting claim, but I would assume that, if this is true and DeepMind are aware of this, they would just find a way to erase the spam clicks from the human play database.
Yes, if it’s as simple as ‘spam clicks from imitation learning are too hard to wash out via self-play given the weak APM limits’, it should be relatively easy to fix. Add a very tiny penalty for each click to incentivize efficiency, or preprocess the replay dataset—if a ‘spam click’ does nothing useful, it seems like it should be possible to replay through all the games, track what clicks actually result in a game-play difference and what clicks are either idempotent (eg multiple clicks in the same spot) or cancel out (eg a click to go one place which is replaced by a click to go another place before the unit has moved more than epsilon distance), and filter out the spam clicks.
Interesting. The article didn’t mention that.
Yeah, DeepMind seems to be behaving unethically here (i.e., being deliberately deceptive), since a bunch of people there must understand Starcraft II well enough to know that the APM graph/comparison is misleading, but the blog post still used it prominently to try to show AlphaStar in a better light. Hopefully it’s some sort of misunderstanding or honest mistake that they’ll clear up during the Reddit AMA, but if not it seems like another bad sign for the future of AI governance, comparable to the Baidu cheating scandal.
Apparently as a result of this analysis, DeepMind has edited the caption in the graph:
They are now explaining this in reddit AMA: “We are capping APM. Blizzard in game APM applies some multipliers to some actions, that’s why you are seeing a higher number. https://github.com/deepmind/pysc2/blob/master/docs/environment.md#apm-calculation″
“We consulted with TLO and Blizzard about APMs, and also added a hard limit to APMs. In particular, we set a maximum of 600 APMs over 5 second periods, 400 over 15 second periods, 320 over 30 second periods, and 300 over 60 second period. If the agent issues more actions in such periods, we drop / ignore the actions. ”
“Our network has about 70M parameters.”
AMA: https://www.reddit.com/r/MachineLearning/comments/ajgzoc/we_are_oriol_vinyals_and_david_silver_from/eexs0pd/?context=3
That explanation doesn’t really absolve them very much in my mind. If you read people’s responses to that comment on Reddit, it seems clear that AlphaStar still largely won by being faster and more accurate at crucial moments (a human couldn’t duplicate the strategies AlphaStar used because they can’t perform 50 accurate actions in 5 seconds), and the APM comparison graph was trying to make people think the opposite. See this one for example:
Is there a reason to assume that DeepMind is being intentionally deceptive rather than making a good-intentioned mistake? After all, they claim that they consulted with TLO on the APM rate, which is something that they seem unlikely to lie about, since it would be easy for TLO to dispute that if it was untrue. So presumably TLO felt like the caps that they instituted were fair, and it would have been reasonable for DeepMind to trust a top player’s judgment on that, none of them (AFAIK) being top players themselves. And even with the existing caps, people felt like the version which played TLO would likely have lost to MaNa, and the camera-limited version did actually lose to him. So it seems reasonable for someone to think that a good enough player would still beat versions of AlphaStar even with the existing limits, and that they didn’t thus give it an unfair advantage.
On the other hand, while I can think of plenty of reasons for why DeepMind might have honestly thought that this setup was fair, I can’t think of any good reason for them to decide to be intentionally deceptive in this way. They’ve been open about their agent currently only playing Protoss vs. Protoss on a single map and of earlier versions seeing the whole map, and five of their games were played against a non-Protoss player. If they honestly felt that they only won because of the higher APM rate, then I don’t see why they wouldn’t just admit that the same way they’re forthcoming about all of the system’s other limitations.
I think it’s quite possible that when they instituted the cap they thought it was fair, however from the actual gameplay it should be obvious to anyone who is even somewhat familiar with Starcraft II (e.g., many members of the AlphaStar team) that AlphaStar had a large advantage in “micro”, which in part came from the APM cap still allowing superhumanly fast and accurate actions at crucial times. It’s also possible that the blogpost and misleading APM comparison graph were written by someone who did not realize this, but then those who did realize should have objected to it and had it changed after they noticed.
Possibly you’re not objecting to what I’m suggesting above but to the usage of “intentionally deceptive” to describe it. If so, I think you may have a point in that there may not have been any single human who had an intention to be deceptive (e.g., maybe those at DM who realized that the graph would be misleading lacked the power or incentives to make changes to it), in which case perhaps it doesn’t make sense to attribute deceptive intention to the organization. Maybe “knowingly misleading” would be a better phrase there to describe the ethical failure (assuming you agree that the episode most likely does constitute a kind of ethical failure)?
ETA: I note that many people on places like Hacker News and Reddit have pointed out the misleading nature of the APM comparison graph, with virtually no one defending DM on that point, so anyone at DM who is following those discussions should have realized it by now, but no one has come out and either admitted a mistake or offered a reasonable explanation. Multiple people have also decided that it was intentional deception (which does seem like a strong possibility to me too since I think there’s a pretty high prior that the person who wrote the blog post would not be so unfamiliar with Starcraft II). Another piece of evidence that I noticed is that during the Youtube video one of the lead researchers talked about the APM comparison with the host, and also said at some point that he used to play Starcraft II. One Redditor describes it as “It’s not just the graphs, but during the conversation with Artosis the researcher was manipulating him.”
It’s not so obvious to me that someone who realizes that AlphaStar is superior at “micro” should have objected to those graphs.
Think about it like this—you’re on the DeepMind team, developing AlphaStar, and the whole point is to make it superhuman at StarCraft. So there’s going to be some part of the game that it’s superhuman at, and to some extent this will be “unfair” to humans. The team decided to try not to let AlphaStar have “physical” advantages, but I don’t see any indication that they explicitly decided that it should not be better at “micro” or unit control in general, and should only win on “strategy”.
Also, separating “micro” from “strategy” is probably not that simple for a model-free RL system like this. So I think they made a very reasonable decision to focus on a relatively easy-to-measure APM metric. When the resulting system doesn’t play exactly as humans do, or in a way that would be easy for humans to replicate, to me it doesn’t seem so-obvious-that-you’re-being-deceptive-if-you-don’t-notice-it that this is “unfair” and that you should go back to the drawing board with your handicapping system.
It seems to me that which ways for AlphaStar to be superhuman are “fair” or “unfair” is to some extent a matter of taste, and there will be many cases that are ambiguous. To give a non “micro” example—suppose AlphaStar is able to better keep track of exactly how many units its opponent has (and at what hit point levels) throughout the game, than a human can, and this allows it to make just slightly more fine-grained decisions about which units it should produce. This might allow it to win a game in a way that’s not replicable by humans. It didn’t find a new strategy—it just executed better. Is that fair or unfair? It feels maybe less unfair than just being super good at micro, but exactly where the dividing line is between “interesting” and “uninteresting” ways of winning seems not super clear.
Of course, now that a much broader group of StarCraft players has seen these games, and a consensus has emerged that this super-micro does not really seem fair, it would be weird if DeepMind did not take that into account for its next release. I will be quite surprised if they don’t adjust their setup to reduce the micro advantage going forward.
This is not the complaint that people (including me) have. Instead the complaint is that, given it’s clear that AlphaStar won mostly through micro, that graph highlighted statistics (i.e., average APM over the whole game, including humans spamming keys to keep their fingers warm) that would be irrelevant to SC2 experts for judging whether or not AlphaStar did win through micro, but would reliably mislead non-experts into thinking “no” on that question. Both of these effects should have been easy to foresee.
They’ve received a bunch of favorable publicity as a result.
It seems very unlikely to me that they would have gotten any less publicity if they’d reported the APM restrictions any differently. (After all, they didn’t get any less publicity for reporting the system’s other limitations either, like it only being able to play Protoss v. Protoss on a single map, or 10⁄11 of the agents having whole-camera vision.)
They might well have gotten less publicity due to emphasizing those facts as much as they did.
To me deep mind is simply trying to paint themselves in the best light. I’m not particularly surprised by the behavior; I would expect it from a for-profit company looking to get PR. Nor am I particularly upset about the behavior; I don’t see any outright lying going on, merely an attempt to frame the facts in the best possible way for them.
Ok, I read it too after my comment above… And I thought that than future evil superintellgence will start shooting at people on streets, the same commenters will said: “No, it is not a superintelligence, it is good just in tactical use of guns, and it just knows where humans are located and never misses, but its strategy is awful”. Or, in other words:
Weak strategy + perfect skills = dangerous AI
What especially worry me is that the same type of cheating could happen during safety evaluation of the future more advanced AI.
Could we train AlphaZero on all games it could play at once, then find the rule set its learning curve looks worst on?
Why would you want to do that?
Because “unfortunately” we are out of boardgames, and this might find another one.
We’re not out. Certainly we’re not out of games—e.g. Magic: The Gathering. Which would be a big leap.
For actual basic board games, the one I want to see is Stratego, actually; the only issue is I don’t know if there are humans who have bothered to master it.