It seems to construct an estimate of it by averaging a huge number of observations together before each update (for Dota 5v5, they say each batch is around a million observations, and I’m guessing it processes about a million batches). The surprising thing is that this works so well, and it allows leveraging of computational resources very easily.
My guess for how it deals with partial observability in a more philosophical sense is that it must be able to store an implicit model of the world in some way, in order to better predict the reward it will eventually observe. I’m beginning to wonder if the distinction between partial and full observability isn’t very useful after all. Even with AlphaGo, even though it can see the whole board, there are also a whole bunch of “spaces” it can’t see fully, possibly the entire action space, the space of every possible game trajectory, or the mathematical structure of play strategies. And yet, it managed to infer enough about those spaces to become good at the game.
A couple of guesses for why we might see this, which don’t seem to depend on property:
An obligation to act is much more freedom-constraining than a prohibition on an action. The more and more one considers all possible actions with the obligation to take the most ethically optimal one, the less room they have to consider exploration, contemplation, or pursuing their own selfish values. Prohibition on actions does not have this effect.
The environment we evolved in had roughly the same level of opportunity to commit harmful acts, bur far less opportunity to take positive consequentialist action (and far less complicated situations to deal with). It was always possible to hurt your friends and suffer consequences, but it was rare to have to think about the long term consequences of every action.
The consequences of killing, stealing, and hurting people are easier to predict than altruistic actions. Resources are finite, therefore sharing them can be harmful or beneficial, depending on the circumstances and who they are shared with. Other people can defect or refuse to reciprocate. If you hurt someone, they are almost guaranteed to retaliate. If you help someone, there is no guarantee there will be a payoff for you.