I was thinking of it being over observations, but having it be over States x Actions leads to a potentially different outcome. An n-step version is, your reward function is a mapping On↦R (you’re grading the last n observations jointly). Eg in Atari DRL you might see the last four frames being fed to the agent as an approximation (since the games might well be 4-step Markovian; that is, the four previous time steps fully determine what happens next).
So observation based rewards lead to bad behavior when the rewarded observation maps to different states (with at least one of those states being undesired)?
And a fully observable environment doesn’t have that problem because you always know which state you’re in? If so, wouldn’t you still be rewarded by observations and incentivized to show yourself blue images forever?
Also, a fully-observable environment will still choose to wirehead if that’s a possibility, correct?
Let me try and reframe. The point of this post isn’t that we’re rewarding bad things, it’s that there might not exist a reward function whose optimal policy does good things! This has to do with the structure of agent-environment interaction, and how precisely we can incentivize certain kinds of optimal action. If the reward functions linear functionals over camera RGB values, then excepting the trivial zero function, plugging in any one of these reward functions to AIXI leads to doom! We just can’t specify a reward function from this class which doesn’t (this is different from there maybe existing a “human utility function” which is simply hard to specify).
I was thinking of it being over observations, but having it be over States x Actions leads to a potentially different outcome. An n-step version is, your reward function is a mapping On↦R (you’re grading the last n observations jointly). Eg in Atari DRL you might see the last four frames being fed to the agent as an approximation (since the games might well be 4-step Markovian; that is, the four previous time steps fully determine what happens next).
Thanks!
So observation based rewards lead to bad behavior when the rewarded observation maps to different states (with at least one of those states being undesired)?
And a fully observable environment doesn’t have that problem because you always know which state you’re in? If so, wouldn’t you still be rewarded by observations and incentivized to show yourself blue images forever?
Also, a fully-observable environment will still choose to wirehead if that’s a possibility, correct?
Let me try and reframe. The point of this post isn’t that we’re rewarding bad things, it’s that there might not exist a reward function whose optimal policy does good things! This has to do with the structure of agent-environment interaction, and how precisely we can incentivize certain kinds of optimal action. If the reward functions linear functionals over camera RGB values, then excepting the trivial zero function, plugging in any one of these reward functions to AIXI leads to doom! We just can’t specify a reward function from this class which doesn’t (this is different from there maybe existing a “human utility function” which is simply hard to specify).