2. The link above has rewards given by the environment that go from (State, Action) to a real number, while Markovian observation-based reward function is given by the Agent itself (?) and goes from Observation to a real number?
What’s an n-step version of one? .
I have a few other questions, but they depend on whether the reward is given by agent based on observations or by the environment based on it’s actual state.
I was thinking of it being over observations, but having it be over States x Actions leads to a potentially different outcome. An n-step version is, your reward function is a mapping On↦R (you’re grading the last n observations jointly). Eg in Atari DRL you might see the last four frames being fed to the agent as an approximation (since the games might well be 4-step Markovian; that is, the four previous time steps fully determine what happens next).
So observation based rewards lead to bad behavior when the rewarded observation maps to different states (with at least one of those states being undesired)?
And a fully observable environment doesn’t have that problem because you always know which state you’re in? If so, wouldn’t you still be rewarded by observations and incentivized to show yourself blue images forever?
Also, a fully-observable environment will still choose to wirehead if that’s a possibility, correct?
Let me try and reframe. The point of this post isn’t that we’re rewarding bad things, it’s that there might not exist a reward function whose optimal policy does good things! This has to do with the structure of agent-environment interaction, and how precisely we can incentivize certain kinds of optimal action. If the reward functions linear functionals over camera RGB values, then excepting the trivial zero function, plugging in any one of these reward functions to AIXI leads to doom! We just can’t specify a reward function from this class which doesn’t (this is different from there maybe existing a “human utility function” which is simply hard to specify).
I’m a bit confused, but here’s my current understanding and questions:
1. You’re mostly talking about partially observable markov decision problems (POMDP)
2. The link above has rewards given by the environment that go from (State, Action) to a real number, while Markovian observation-based reward function is given by the Agent itself (?) and goes from Observation to a real number?
What’s an n-step version of one? .
I have a few other questions, but they depend on whether the reward is given by agent based on observations or by the environment based on it’s actual state.
I was thinking of it being over observations, but having it be over States x Actions leads to a potentially different outcome. An n-step version is, your reward function is a mapping On↦R (you’re grading the last n observations jointly). Eg in Atari DRL you might see the last four frames being fed to the agent as an approximation (since the games might well be 4-step Markovian; that is, the four previous time steps fully determine what happens next).
Thanks!
So observation based rewards lead to bad behavior when the rewarded observation maps to different states (with at least one of those states being undesired)?
And a fully observable environment doesn’t have that problem because you always know which state you’re in? If so, wouldn’t you still be rewarded by observations and incentivized to show yourself blue images forever?
Also, a fully-observable environment will still choose to wirehead if that’s a possibility, correct?
Let me try and reframe. The point of this post isn’t that we’re rewarding bad things, it’s that there might not exist a reward function whose optimal policy does good things! This has to do with the structure of agent-environment interaction, and how precisely we can incentivize certain kinds of optimal action. If the reward functions linear functionals over camera RGB values, then excepting the trivial zero function, plugging in any one of these reward functions to AIXI leads to doom! We just can’t specify a reward function from this class which doesn’t (this is different from there maybe existing a “human utility function” which is simply hard to specify).