Contrary to most expectations, they didn’t need anything fundamentally new in order to get long-term strategic planning. I was particularly surprised by this. Some interesting thoughts from OpenAI researchers in this thread—in particular, assuming good exploration, the variance of the gradient should scale linearly with the duration, and so you might expect you only need linearly more samples to counteract this.
I notice some confusion here (possibly because I may be conflating some things). I recall, when AlphaGoZero came out, people who understood games and ML better than I were saying they expected it not to transfer well to domains involving long term planning and hidden information. (I’m not 100% sure I’m remembering the “long term planning” thing right, I definitely remember the hidden information thing).
I remember being surprised by that. I didn’t have a very clear model of what AlphaGoZero was doing (I have a rough idea of how neural nets and gradient descent work, but not clear on whether that’s even precisely what AGZ was using or, if so, how it “really worked”). But what model I did have suggested it was doing a sufficiently general thing that I didn’t see why it wouldn’t be able to handle hidden information or long term planning. And I was curious what would happen if you literally did the minimum-re-architecting to get it to play something more complex, and just see what happened without additional design.
I’m not sure how similar this architecture actually is to AlphaGoZero, and didn’t really follow much of the architecture of last year’s DOTA bot. But my vague impression was that it’s at least in the same genre.
So:
a) wanted to check in a bit about my assumptions here? Is this significantly different from AGZ?
b) is there hidden information here? (I would have assumed so, except the “they all observed the full Dota 2 state” might imply otherwise – does that mean things that are supposed to be visible to a player, or the entire map?)
c) assuming “b” is “yes”, are the reasons people did think that longterm planning and hidden information would require different/cleverer algorithms coherent and still reasonable? Like, at the time I had a gears-less belief that it’d probably be able to work at least somewhat, other people who seemed to have more gears-based-models than me seemed to think it’d require new algorithmic insights. Were their gears wrong? Were they right, just… not sufficient? (i.e. does the fact that this was easier than people expected imply anything interesting?)
they expected it not to transfer well to domains involving long term planning and hidden information.
AGZ was doing some long term planning (over the timescale of a Go game), and had no hidden information. It certainly was not clear whether similar techniques would work when trajectories were tens of thousands of steps long instead of hundreds. Similarly it wasn’t clear how to make things work with hidden information—you could try the same thing but it was plausible it wouldn’t work.
I didn’t see why it wouldn’t be able to handle hidden information or long term planning.
Yeah, I agree with this. This sounds mostly like a claim that it is more computationally expensive to deal with hidden information and long term planning.
Is this significantly different from AGZ?
Yes. There’s no Monte-Carlo Tree Search, because they want the agent to learn from experience (so the agent is not allowed to simulate how the game would go, because it doesn’t know the rules). The reward function is shaped, so that the agent actually gets feedback on how it is doing, whereas AGZ only had access to the binary win/loss signal. But I think it’s fair to say that it’s the same genre of approach.
If they instead allowed themselves to say “the agent knows the rules of the game”, as in Go, such that it can simulate various branches of the same game, they probably could take an AGZ approach with a shaped reward function and my guess is it would just work, probably faster than their current approach.
is there hidden information here? (I would have assumed so, except the “they all observed the full Dota 2 state” might imply otherwise – does that mean things that are supposed to be visible to a player, or the entire map?)
Yes, there is. By “full Dota 2 state” I meant everything visible to all five players, not the entire map. This is more than humans have access to, but certainly not everything. And humans can get access to all of the information by communicating with other players on the team.
Were their gears wrong? Were they right, just… not sufficient? (i.e. does the fact that this was easier than people expected imply anything interesting?)
At this point I’m speculating really hard, but I think it was just that our gears were wrong somewhere. It’s possible that difficult tasks like Dota actually only have a small number of realistic strategies and they can each be learned individually. (Small here is relative to what their massive amount of compute can do—it’s plausible that it learned thousands of realistic strategies, by brute force and memorization. You could say the same about AGZ.) In this scenario, the gear that was wrong was the one that predicted “the space of strategies in complex tasks is so large that it can’t be memorized from experience”. (Whereas with Go, it seems more plausible that humans also rely on these sorts of intuitive mechanisms rather than strategic reasoning, so it’s not as surprising that a computer can match that.) It could also be that RL is actually capable of learning algorithms that reason symbolically/logically, though I wouldn’t bet on it. It could be that actually we’re still quite far from good Dota bots and OpenAI Five is beating humans on micromanagement of actions, and has learned sufficient strategy to not be outclassed but is still decidedly subhuman at long term planning, and researchers gears actually were right. I don’t know.
Hi Rohin, are older version of the newsletter available?
Also:
This sounds mostly like a claim that it is more computationally expensive to deal with hidden information and long term planning.
One consideration: When you are exploring a tree of possibilities, every bit of missing information means you need to double the size of the tree. So it could be that hidden information leads to an exponential explosion in search cost in the absence of hidden-information-specific search strategies. Although strictly speaking this is just a case of something being “more computationally expensive”, exponential penalties generically push things from being feasible to infeasible.
Hey Jess, as Ben mentioned I keep all newsletter-related things on my website.
I agree that in theory hidden information leads to an exponential explosion. In practice, I think you don’t need to search over all the exponentially many ways the hidden information could be in order to get good results. (At least, you don’t need to do that in order to beat humans, because humans don’t seem to do that.)
I think overall we agree though—when I said “it wasn’t clear how to make things work with hidden information—you could try the same thing but it was plausible it wouldn’t work”, I was primarily thinking that the computational cost might be too high. I was relatively confident that given unbounded compute, AlphaGo-style algorithms could deal with hidden information.
(Layman question)
I notice some confusion here (possibly because I may be conflating some things). I recall, when AlphaGoZero came out, people who understood games and ML better than I were saying they expected it not to transfer well to domains involving long term planning and hidden information. (I’m not 100% sure I’m remembering the “long term planning” thing right, I definitely remember the hidden information thing).
I remember being surprised by that. I didn’t have a very clear model of what AlphaGoZero was doing (I have a rough idea of how neural nets and gradient descent work, but not clear on whether that’s even precisely what AGZ was using or, if so, how it “really worked”). But what model I did have suggested it was doing a sufficiently general thing that I didn’t see why it wouldn’t be able to handle hidden information or long term planning. And I was curious what would happen if you literally did the minimum-re-architecting to get it to play something more complex, and just see what happened without additional design.
I’m not sure how similar this architecture actually is to AlphaGoZero, and didn’t really follow much of the architecture of last year’s DOTA bot. But my vague impression was that it’s at least in the same genre.
So:
a) wanted to check in a bit about my assumptions here? Is this significantly different from AGZ?
b) is there hidden information here? (I would have assumed so, except the “they all observed the full Dota 2 state” might imply otherwise – does that mean things that are supposed to be visible to a player, or the entire map?)
c) assuming “b” is “yes”, are the reasons people did think that longterm planning and hidden information would require different/cleverer algorithms coherent and still reasonable? Like, at the time I had a gears-less belief that it’d probably be able to work at least somewhat, other people who seemed to have more gears-based-models than me seemed to think it’d require new algorithmic insights. Were their gears wrong? Were they right, just… not sufficient? (i.e. does the fact that this was easier than people expected imply anything interesting?)
AGZ was doing some long term planning (over the timescale of a Go game), and had no hidden information. It certainly was not clear whether similar techniques would work when trajectories were tens of thousands of steps long instead of hundreds. Similarly it wasn’t clear how to make things work with hidden information—you could try the same thing but it was plausible it wouldn’t work.
Yeah, I agree with this. This sounds mostly like a claim that it is more computationally expensive to deal with hidden information and long term planning.
Yes. There’s no Monte-Carlo Tree Search, because they want the agent to learn from experience (so the agent is not allowed to simulate how the game would go, because it doesn’t know the rules). The reward function is shaped, so that the agent actually gets feedback on how it is doing, whereas AGZ only had access to the binary win/loss signal. But I think it’s fair to say that it’s the same genre of approach.
If they instead allowed themselves to say “the agent knows the rules of the game”, as in Go, such that it can simulate various branches of the same game, they probably could take an AGZ approach with a shaped reward function and my guess is it would just work, probably faster than their current approach.
Yes, there is. By “full Dota 2 state” I meant everything visible to all five players, not the entire map. This is more than humans have access to, but certainly not everything. And humans can get access to all of the information by communicating with other players on the team.
At this point I’m speculating really hard, but I think it was just that our gears were wrong somewhere. It’s possible that difficult tasks like Dota actually only have a small number of realistic strategies and they can each be learned individually. (Small here is relative to what their massive amount of compute can do—it’s plausible that it learned thousands of realistic strategies, by brute force and memorization. You could say the same about AGZ.) In this scenario, the gear that was wrong was the one that predicted “the space of strategies in complex tasks is so large that it can’t be memorized from experience”. (Whereas with Go, it seems more plausible that humans also rely on these sorts of intuitive mechanisms rather than strategic reasoning, so it’s not as surprising that a computer can match that.) It could also be that RL is actually capable of learning algorithms that reason symbolically/logically, though I wouldn’t bet on it. It could be that actually we’re still quite far from good Dota bots and OpenAI Five is beating humans on micromanagement of actions, and has learned sufficient strategy to not be outclassed but is still decidedly subhuman at long term planning, and researchers gears actually were right. I don’t know.
Thank you, that was all very helpful
Hi Rohin, are older version of the newsletter available?
Also:
One consideration: When you are exploring a tree of possibilities, every bit of missing information means you need to double the size of the tree. So it could be that hidden information leads to an exponential explosion in search cost in the absence of hidden-information-specific search strategies. Although strictly speaking this is just a case of something being “more computationally expensive”, exponential penalties generically push things from being feasible to infeasible.
They’re all available at his LW profile and also at his offsite blog.
Hey Jess, as Ben mentioned I keep all newsletter-related things on my website.
I agree that in theory hidden information leads to an exponential explosion. In practice, I think you don’t need to search over all the exponentially many ways the hidden information could be in order to get good results. (At least, you don’t need to do that in order to beat humans, because humans don’t seem to do that.)
I think overall we agree though—when I said “it wasn’t clear how to make things work with hidden information—you could try the same thing but it was plausible it wouldn’t work”, I was primarily thinking that the computational cost might be too high. I was relatively confident that given unbounded compute, AlphaGo-style algorithms could deal with hidden information.