It seems to construct an estimate of it by averaging a huge number of observations together before each update (for Dota 5v5, they say each batch is around a million observations, and I’m guessing it processes about a million batches). The surprising thing is that this works so well, and it allows leveraging of computational resources very easily.
My guess for how it deals with partial observability in a more philosophical sense is that it must be able to store an implicit model of the world in some way, in order to better predict the reward it will eventually observe. I’m beginning to wonder if the distinction between partial and full observability isn’t very useful after all. Even with AlphaGo, even though it can see the whole board, there are also a whole bunch of “spaces” it can’t see fully, possibly the entire action space, the space of every possible game trajectory, or the mathematical structure of play strategies. And yet, it managed to infer enough about those spaces to become good at the game.
After thinking about this more, I have a hypothesis for how it works:
It records the sequence of states resulting from play, to be used for learning when the game is done. It has some Q value estimator, which given a state and an action, returns the expected utility of taking that action in that state (of course, this expected utility depends on the policies both players are using). Using the sequence of states and the actions (including exploration actions), it makes an update to this Q value estimator.
But this Q value estimator is not directly usable in determining a policy, since the Q value estimator needs to know the full state. Instead, the Q value estimator is used to compute a gradient update for the policy. After the game, the Q value estimator can tell you the value of taking each possible action at each actual time step in the game (since all game states are known at the end of the game); this can be used to update the policy so it is more likely to take the actions that are estimated to be better after the game.
I’m not sure if this hypothesis is correct, but I don’t currently see anything wrong with this algorithm. Thank you for causing me to think of this.
(BTW, there is a well-defined difference between full and partial observability, see MDP vs POMDP. Convergence theorems for vanilla Q learning will require the game to be a MDP rather than a POMDP.)
It seems to construct an estimate of it by averaging a huge number of observations together before each update (for Dota 5v5, they say each batch is around a million observations, and I’m guessing it processes about a million batches). The surprising thing is that this works so well, and it allows leveraging of computational resources very easily.
My guess for how it deals with partial observability in a more philosophical sense is that it must be able to store an implicit model of the world in some way, in order to better predict the reward it will eventually observe. I’m beginning to wonder if the distinction between partial and full observability isn’t very useful after all. Even with AlphaGo, even though it can see the whole board, there are also a whole bunch of “spaces” it can’t see fully, possibly the entire action space, the space of every possible game trajectory, or the mathematical structure of play strategies. And yet, it managed to infer enough about those spaces to become good at the game.
After thinking about this more, I have a hypothesis for how it works:
It records the sequence of states resulting from play, to be used for learning when the game is done. It has some Q value estimator, which given a state and an action, returns the expected utility of taking that action in that state (of course, this expected utility depends on the policies both players are using). Using the sequence of states and the actions (including exploration actions), it makes an update to this Q value estimator.
But this Q value estimator is not directly usable in determining a policy, since the Q value estimator needs to know the full state. Instead, the Q value estimator is used to compute a gradient update for the policy. After the game, the Q value estimator can tell you the value of taking each possible action at each actual time step in the game (since all game states are known at the end of the game); this can be used to update the policy so it is more likely to take the actions that are estimated to be better after the game.
I’m not sure if this hypothesis is correct, but I don’t currently see anything wrong with this algorithm. Thank you for causing me to think of this.
(BTW, there is a well-defined difference between full and partial observability, see MDP vs POMDP. Convergence theorems for vanilla Q learning will require the game to be a MDP rather than a POMDP.)