Yeah, DeepMind seems to be behaving unethically here (i.e., being deliberately deceptive), since a bunch of people there must understand Starcraft II well enough to
know that the APM graph/comparison is misleading, but the blog post still used it prominently to try to show AlphaStar in a better light. Hopefully it’s some sort of misunderstanding or honest mistake that they’ll clear up during the Reddit AMA, but if not it seems like another bad sign for the future of AI governance, comparable to the Baidu cheating scandal.
Apparently as a result of this analysis, DeepMind has edited the caption in the graph:
The distribution of AlphaStar’s APMs in its matches against MaNa and TLO and the total delay between observations and actions. CLARIFICATION (29/01/19): TLO’s APM appears higher than both AlphaStar and MaNa because of his use of rapid-fire hot-keys and use of the “remove and add to control group” key bindings. Also note that AlphaStar’s effective APM bursts are sometimes higher than both players.
“We consulted with TLO and Blizzard about APMs, and also added a hard limit to APMs. In particular, we set a maximum of 600 APMs over 5 second periods, 400 over 15 second periods, 320 over 30 second periods, and 300 over 60 second period. If the agent issues more actions in such periods, we drop / ignore the actions. ”
That explanation doesn’t really absolve them very much in my mind. If you read people’s responses to that comment on Reddit, it seems clear that AlphaStar still largely won by being faster and more accurate at crucial moments (a human couldn’t duplicate the strategies AlphaStar used because they can’t perform 50 accurate actions in 5 seconds), and the APM comparison graph was trying to make people think the opposite. See this one for example:
Statistics aside, it was clear from the gamers’, presenters’, and audience’s shocked reaction to the Stalker micro, all saying that no human player in the world could do what Alpha Go was doing. Using just-beside-the-point statistics is obfuscation and an avoiding of acknowledging this.
AlphaStar wasn’t outsmarting the humans—it’s not like TLO and MaNa slapped their foreheads and said, “I wish I’d thought of microing Stalkers that fast! Genius!”
Is there a reason to assume that DeepMind is being intentionally deceptive rather than making a good-intentioned mistake? After all, they claim that they consulted with TLO on the APM rate, which is something that they seem unlikely to lie about, since it would be easy for TLO to dispute that if it was untrue. So presumably TLO felt like the caps that they instituted were fair, and it would have been reasonable for DeepMind to trust a top player’s judgment on that, none of them (AFAIK) being top players themselves. And even with the existing caps, people felt like the version which played TLO would likely have lost to MaNa, and the camera-limited version did actually lose to him. So it seems reasonable for someone to think that a good enough player would still beat versions of AlphaStar even with the existing limits, and that they didn’t thus give it an unfair advantage.
On the other hand, while I can think of plenty of reasons for why DeepMind might have honestly thought that this setup was fair, I can’t think of any good reason for them to decide to be intentionally deceptive in this way. They’ve been open about their agent currently only playing Protoss vs. Protoss on a single map and of earlier versions seeing the whole map, and five of their games were played against a non-Protoss player. If they honestly felt that they only won because of the higher APM rate, then I don’t see why they wouldn’t just admit that the same way they’re forthcoming about all of the system’s other limitations.
I think it’s quite possible that when they instituted the cap they thought it was fair, however from the actual gameplay it should be obvious to anyone who is even somewhat familiar with Starcraft II (e.g., many members of the AlphaStar team) that AlphaStar had a large advantage in “micro”, which in part came from the APM cap still allowing superhumanly fast and accurate actions at crucial times. It’s also possible that the blogpost and misleading APM comparison graph were written by someone who did not realize this, but then those who did realize should have objected to it and had it changed after they noticed.
Possibly you’re not objecting to what I’m suggesting above but to the usage of “intentionally deceptive” to describe it. If so, I think you may have a point in that there may not have been any single human who had an intention to be deceptive (e.g., maybe those at DM who realized that the graph would be misleading lacked the power or incentives to make changes to it), in which case perhaps it doesn’t make sense to attribute deceptive intention to the organization. Maybe “knowingly misleading” would be a better phrase there to describe the ethical failure (assuming you agree that the episode most likely does constitute a kind of ethical failure)?
ETA: I note that many people on places like Hacker News and Reddit have pointed out the misleading nature of the APM comparison graph, with virtually no one defending DM on that point, so anyone at DM who is following those discussions should have realized it by now, but no one has come out and either admitted a mistake or offered a reasonable explanation. Multiplepeople have also decided that it was intentional deception (which does seem like a strong possibility to me too since I think there’s a pretty high prior that the person who wrote the blog post would not be so unfamiliar with Starcraft II). Another piece of evidence that I noticed is that during the Youtube video one of the lead researchers talked about the APM comparison with the host, and also said at some point that he used to play Starcraft II. One Redditor describes it as “It’s not just the graphs, but during the conversation with Artosis the researcher was manipulating him.”
I think it’s quite possible that when they instituted the cap they thought it was fair, however from the actual gameplay it should be obvious to anyone who is even somewhat familiar with Starcraft II (e.g., many members of the AlphaStar team) that AlphaStar had a large advantage in “micro”, which in part came from the APM cap still allowing superhumanly fast and accurate actions at crucial times. It’s also possible that the blogpost and misleading APM comparison graph were written by someone who did not realize this, but then those who did realize should have objected to it and had it changed after they noticed.
It’s not so obvious to me that someone who realizes that AlphaStar is superior at “micro” should have objected to those graphs.
Think about it like this—you’re on the DeepMind team, developing AlphaStar, and the whole point is to make it superhuman at StarCraft. So there’s going to be some part of the game that it’s superhuman at, and to some extent this will be “unfair” to humans. The team decided to try not to let AlphaStar have “physical” advantages, but I don’t see any indication that they explicitly decided that it should not be better at “micro” or unit control in general, and should only win on “strategy”.
Also, separating “micro” from “strategy” is probably not that simple for a model-free RL system like this. So I think they made a very reasonable decision to focus on a relatively easy-to-measure APM metric. When the resulting system doesn’t play exactly as humans do, or in a way that would be easy for humans to replicate, to me it doesn’t seem so-obvious-that-you’re-being-deceptive-if-you-don’t-notice-it that this is “unfair” and that you should go back to the drawing board with your handicapping system.
It seems to me that which ways for AlphaStar to be superhuman are “fair” or “unfair” is to some extent a matter of taste, and there will be many cases that are ambiguous. To give a non “micro” example—suppose AlphaStar is able to better keep track of exactly how many units its opponent has (and at what hit point levels) throughout the game, than a human can, and this allows it to make just slightly more fine-grained decisions about which units it should produce. This might allow it to win a game in a way that’s not replicable by humans. It didn’t find a new strategy—it just executed better. Is that fair or unfair? It feels maybe less unfair than just being super good at micro, but exactly where the dividing line is between “interesting” and “uninteresting” ways of winning seems not super clear.
Of course, now that a much broader group of StarCraft players has seen these games, and a consensus has emerged that this super-micro does not really seem fair, it would be weird if DeepMind did not take that into account for its next release. I will be quite surprised if they don’t adjust their setup to reduce the micro advantage going forward.
When the resulting system doesn’t play exactly as humans do, or in a way that would be easy for humans to replicate, to me it doesn’t seem so-obvious-that-you’re-being-deceptive-if-you-don’t-notice-it that this is “unfair” and that you should go back to the drawing board with your handicapping system.
This is not the complaint that people (including me) have. Instead the complaint is that, given it’s clear that AlphaStar won mostly through micro, that graph highlighted statistics (i.e., average APM over the whole game, including humans spamming keys to keep their fingers warm) that would be irrelevant to SC2 experts for judging whether or not AlphaStar did win through micro, but would reliably mislead non-experts into thinking “no” on that question. Both of these effects should have been easy to foresee.
It seems very unlikely to me that they would have gotten any less publicity if they’d reported the APM restrictions any differently. (After all, they didn’t get any less publicity for reporting the system’s other limitations either, like it only being able to play Protoss v. Protoss on a single map, or 10⁄11 of the agents having whole-camera vision.)
After all, they didn’t get any less publicity for reporting the system’s other limitations either, like it only being able to play Protoss v. Protoss on a single map, or 10⁄11 of the agents having whole-camera vision.
They might well have gotten less publicity due to emphasizing those facts as much as they did.
To me deep mind is simply trying to paint themselves in the best light. I’m not particularly surprised by the behavior; I would expect it from a for-profit company looking to get PR. Nor am I particularly upset about the behavior; I don’t see any outright lying going on, merely an attempt to frame the facts in the best possible way for them.
Ok, I read it too after my comment above… And I thought that than future evil superintellgence will start shooting at people on streets, the same commenters will said: “No, it is not a superintelligence, it is good just in tactical use of guns, and it just knows where humans are located and never misses, but its strategy is awful”. Or, in other words:
Interesting. The article didn’t mention that.
Yeah, DeepMind seems to be behaving unethically here (i.e., being deliberately deceptive), since a bunch of people there must understand Starcraft II well enough to know that the APM graph/comparison is misleading, but the blog post still used it prominently to try to show AlphaStar in a better light. Hopefully it’s some sort of misunderstanding or honest mistake that they’ll clear up during the Reddit AMA, but if not it seems like another bad sign for the future of AI governance, comparable to the Baidu cheating scandal.
Apparently as a result of this analysis, DeepMind has edited the caption in the graph:
They are now explaining this in reddit AMA: “We are capping APM. Blizzard in game APM applies some multipliers to some actions, that’s why you are seeing a higher number. https://github.com/deepmind/pysc2/blob/master/docs/environment.md#apm-calculation″
“We consulted with TLO and Blizzard about APMs, and also added a hard limit to APMs. In particular, we set a maximum of 600 APMs over 5 second periods, 400 over 15 second periods, 320 over 30 second periods, and 300 over 60 second period. If the agent issues more actions in such periods, we drop / ignore the actions. ”
“Our network has about 70M parameters.”
AMA: https://www.reddit.com/r/MachineLearning/comments/ajgzoc/we_are_oriol_vinyals_and_david_silver_from/eexs0pd/?context=3
That explanation doesn’t really absolve them very much in my mind. If you read people’s responses to that comment on Reddit, it seems clear that AlphaStar still largely won by being faster and more accurate at crucial moments (a human couldn’t duplicate the strategies AlphaStar used because they can’t perform 50 accurate actions in 5 seconds), and the APM comparison graph was trying to make people think the opposite. See this one for example:
Is there a reason to assume that DeepMind is being intentionally deceptive rather than making a good-intentioned mistake? After all, they claim that they consulted with TLO on the APM rate, which is something that they seem unlikely to lie about, since it would be easy for TLO to dispute that if it was untrue. So presumably TLO felt like the caps that they instituted were fair, and it would have been reasonable for DeepMind to trust a top player’s judgment on that, none of them (AFAIK) being top players themselves. And even with the existing caps, people felt like the version which played TLO would likely have lost to MaNa, and the camera-limited version did actually lose to him. So it seems reasonable for someone to think that a good enough player would still beat versions of AlphaStar even with the existing limits, and that they didn’t thus give it an unfair advantage.
On the other hand, while I can think of plenty of reasons for why DeepMind might have honestly thought that this setup was fair, I can’t think of any good reason for them to decide to be intentionally deceptive in this way. They’ve been open about their agent currently only playing Protoss vs. Protoss on a single map and of earlier versions seeing the whole map, and five of their games were played against a non-Protoss player. If they honestly felt that they only won because of the higher APM rate, then I don’t see why they wouldn’t just admit that the same way they’re forthcoming about all of the system’s other limitations.
I think it’s quite possible that when they instituted the cap they thought it was fair, however from the actual gameplay it should be obvious to anyone who is even somewhat familiar with Starcraft II (e.g., many members of the AlphaStar team) that AlphaStar had a large advantage in “micro”, which in part came from the APM cap still allowing superhumanly fast and accurate actions at crucial times. It’s also possible that the blogpost and misleading APM comparison graph were written by someone who did not realize this, but then those who did realize should have objected to it and had it changed after they noticed.
Possibly you’re not objecting to what I’m suggesting above but to the usage of “intentionally deceptive” to describe it. If so, I think you may have a point in that there may not have been any single human who had an intention to be deceptive (e.g., maybe those at DM who realized that the graph would be misleading lacked the power or incentives to make changes to it), in which case perhaps it doesn’t make sense to attribute deceptive intention to the organization. Maybe “knowingly misleading” would be a better phrase there to describe the ethical failure (assuming you agree that the episode most likely does constitute a kind of ethical failure)?
ETA: I note that many people on places like Hacker News and Reddit have pointed out the misleading nature of the APM comparison graph, with virtually no one defending DM on that point, so anyone at DM who is following those discussions should have realized it by now, but no one has come out and either admitted a mistake or offered a reasonable explanation. Multiple people have also decided that it was intentional deception (which does seem like a strong possibility to me too since I think there’s a pretty high prior that the person who wrote the blog post would not be so unfamiliar with Starcraft II). Another piece of evidence that I noticed is that during the Youtube video one of the lead researchers talked about the APM comparison with the host, and also said at some point that he used to play Starcraft II. One Redditor describes it as “It’s not just the graphs, but during the conversation with Artosis the researcher was manipulating him.”
It’s not so obvious to me that someone who realizes that AlphaStar is superior at “micro” should have objected to those graphs.
Think about it like this—you’re on the DeepMind team, developing AlphaStar, and the whole point is to make it superhuman at StarCraft. So there’s going to be some part of the game that it’s superhuman at, and to some extent this will be “unfair” to humans. The team decided to try not to let AlphaStar have “physical” advantages, but I don’t see any indication that they explicitly decided that it should not be better at “micro” or unit control in general, and should only win on “strategy”.
Also, separating “micro” from “strategy” is probably not that simple for a model-free RL system like this. So I think they made a very reasonable decision to focus on a relatively easy-to-measure APM metric. When the resulting system doesn’t play exactly as humans do, or in a way that would be easy for humans to replicate, to me it doesn’t seem so-obvious-that-you’re-being-deceptive-if-you-don’t-notice-it that this is “unfair” and that you should go back to the drawing board with your handicapping system.
It seems to me that which ways for AlphaStar to be superhuman are “fair” or “unfair” is to some extent a matter of taste, and there will be many cases that are ambiguous. To give a non “micro” example—suppose AlphaStar is able to better keep track of exactly how many units its opponent has (and at what hit point levels) throughout the game, than a human can, and this allows it to make just slightly more fine-grained decisions about which units it should produce. This might allow it to win a game in a way that’s not replicable by humans. It didn’t find a new strategy—it just executed better. Is that fair or unfair? It feels maybe less unfair than just being super good at micro, but exactly where the dividing line is between “interesting” and “uninteresting” ways of winning seems not super clear.
Of course, now that a much broader group of StarCraft players has seen these games, and a consensus has emerged that this super-micro does not really seem fair, it would be weird if DeepMind did not take that into account for its next release. I will be quite surprised if they don’t adjust their setup to reduce the micro advantage going forward.
This is not the complaint that people (including me) have. Instead the complaint is that, given it’s clear that AlphaStar won mostly through micro, that graph highlighted statistics (i.e., average APM over the whole game, including humans spamming keys to keep their fingers warm) that would be irrelevant to SC2 experts for judging whether or not AlphaStar did win through micro, but would reliably mislead non-experts into thinking “no” on that question. Both of these effects should have been easy to foresee.
They’ve received a bunch of favorable publicity as a result.
It seems very unlikely to me that they would have gotten any less publicity if they’d reported the APM restrictions any differently. (After all, they didn’t get any less publicity for reporting the system’s other limitations either, like it only being able to play Protoss v. Protoss on a single map, or 10⁄11 of the agents having whole-camera vision.)
They might well have gotten less publicity due to emphasizing those facts as much as they did.
To me deep mind is simply trying to paint themselves in the best light. I’m not particularly surprised by the behavior; I would expect it from a for-profit company looking to get PR. Nor am I particularly upset about the behavior; I don’t see any outright lying going on, merely an attempt to frame the facts in the best possible way for them.
Ok, I read it too after my comment above… And I thought that than future evil superintellgence will start shooting at people on streets, the same commenters will said: “No, it is not a superintelligence, it is good just in tactical use of guns, and it just knows where humans are located and never misses, but its strategy is awful”. Or, in other words:
Weak strategy + perfect skills = dangerous AI
What especially worry me is that the same type of cheating could happen during safety evaluation of the future more advanced AI.