I think I followed you 75% to 80% of the way with the math. Would it be fair to say that your main point is that due to the fact that certain combinations of rewards and mappings will always produce the same set of actions, and thus you can’t exactly know the way an agent values things?
One thing that I couldn’t tell if you addressed was how many possible compatible pairs of mappings and reward functions can exist for an agent. In you’re third to last paragraph, you say that “it seems we can’t say anything about the human reward function.” yet if there is a finite amount of compatible pairs, it seems we’ve gained at least some knowledge about what the agent might value.
The model m(3) is compatible with any reward function, so any reward function R can be valid for the agent. Now, it’s true that this pair (m(3), R) can be quite complex (since m(3) is very complex), but any R is compatible. (and most m’s are also compatible—any m that maps to π(h), technically, and “almost all” m’s are surjective).
I think I followed you 75% to 80% of the way with the math. Would it be fair to say that your main point is that due to the fact that certain combinations of rewards and mappings will always produce the same set of actions, and thus you can’t exactly know the way an agent values things?
One thing that I couldn’t tell if you addressed was how many possible compatible pairs of mappings and reward functions can exist for an agent. In you’re third to last paragraph, you say that “it seems we can’t say anything about the human reward function.” yet if there is a finite amount of compatible pairs, it seems we’ve gained at least some knowledge about what the agent might value.
The model m(3) is compatible with any reward function, so any reward function R can be valid for the agent. Now, it’s true that this pair (m(3), R) can be quite complex (since m(3) is very complex), but any R is compatible. (and most m’s are also compatible—any m that maps to π(h), technically, and “almost all” m’s are surjective).