In one comical case, AlphaStar had surrounded the units it was building with its own factories so that they couldn’t get out to reach the rest of the map. Rather than lifting the buildings to let the units out, which is possible for Terran, it destroyed one building and then immediately began rebuilding it before it could move the units out!
It seems like AlphaStar played 90 ladder matches as Terran:
30 with the initial policy trained with SL
30 with the a policy from the middle of training
30 from the final RL policy.
This sounds like the kind of mistake that the SL policy would definitely make (no reason it should be able to recover), whereas it’s not clear whether RL would learn how to recover (I would expect it to, but not too strongly).
If it’s easy for anyone to check and they care, it might be worth looking quickly through the replays and seeing whether this particular game was from the SL or RL policies. This is something I’ve been curious about since seeing the behavior posted on Reddit, and it would have a moderate impact on my understanding of AlphaStar’s skill.
It looks like they released 90 replays and played 90 ladder games so it should be possible to check.
The replays are here, hosted on the DM site, sorted into three folders based on the policy, if it’s one of the SL matches it’s either AlphaStarSupervised_013_TvT.SC2Replay, or one of _017_, _019_, or _022_ (based on being TvT and being on Kairos Junction). The video in question is here. I’d check if I had SC2 installed.
(Of course better still would be to find a discussion of the 30 RL replays, from someone who understands the game. Maybe that’s been posted somewhere, I haven’t looked and it’s hard to know who to trust.)
Thanks! That’s only marginally less surprising than the final RL policy, and I suspect the final RL policy will make the same kind of mistake. Seems like the OP’s example was legit and I overestimated the RL agent.
I’m not sure how surprised to be about middle of training, versus final RL policy. Are you saying that this sort of mistake should be learned quickly in RL?
I don’t have a big difference in my model of mid vs. final, they have very similar MMR, the difference between them is pretty small in the scheme of things (e..g probably smaller than the impact of doubling model size) and my picture isn’t refined enough to appreciate those differences. For any particular dumb mistake I’d be surprised if the line between not making it and making it was in that particular doubling.
It seems like AlphaStar played 90 ladder matches as Terran:
30 with the initial policy trained with SL
30 with the a policy from the middle of training
30 from the final RL policy.
This sounds like the kind of mistake that the SL policy would definitely make (no reason it should be able to recover), whereas it’s not clear whether RL would learn how to recover (I would expect it to, but not too strongly).
If it’s easy for anyone to check and they care, it might be worth looking quickly through the replays and seeing whether this particular game was from the SL or RL policies. This is something I’ve been curious about since seeing the behavior posted on Reddit, and it would have a moderate impact on my understanding of AlphaStar’s skill.
It looks like they released 90 replays and played 90 ladder games so it should be possible to check.
The replays are here, hosted on the DM site, sorted into three folders based on the policy, if it’s one of the SL matches it’s either AlphaStarSupervised_013_TvT.SC2Replay, or one of _017_, _019_, or _022_ (based on being TvT and being on Kairos Junction). The video in question is here. I’d check if I had SC2 installed.
(Of course better still would be to find a discussion of the 30 RL replays, from someone who understands the game. Maybe that’s been posted somewhere, I haven’t looked and it’s hard to know who to trust.)
The replay for the match in that video is AlphaStarMid_042_TvT.SC2Replay, so it’s from the middle of training.
Here is the relevant screen capture: https://i.imgur.com/POFhzfj.png
Thanks! That’s only marginally less surprising than the final RL policy, and I suspect the final RL policy will make the same kind of mistake. Seems like the OP’s example was legit and I overestimated the RL agent.
I’m not sure how surprised to be about middle of training, versus final RL policy. Are you saying that this sort of mistake should be learned quickly in RL?
I don’t have a big difference in my model of mid vs. final, they have very similar MMR, the difference between them is pretty small in the scheme of things (e..g probably smaller than the impact of doubling model size) and my picture isn’t refined enough to appreciate those differences. For any particular dumb mistake I’d be surprised if the line between not making it and making it was in that particular doubling.