It seems like it would be nice in Daniel’s example for P(A|ref) to be the action distribution of an “instinctual” or “non-optimising” player. I don’t know how to recover that. You could imagine something like an n-gram model of player inputs across the MMO.
Good point!
It seems like it would be nice in Daniel’s example for P(A|ref) to be the action distribution of an “instinctual” or “non-optimising” player. I don’t know how to recover that. You could imagine something like an n-gram model of player inputs across the MMO.