This post makes the point that for Markovian reward functions on observations, since any given observation can correspond to multiple underlying states, we cannot know just by analyzing the reward function whether it actually leads to good behavior: it also depends on the environment. For example, suppose we want an agent to collect all of the blue blocks in a room together. We might simply reward it for having blue in its observations: this might work great if the agent only has the ability to pick up and move blocks, but won’t work well if the agent has a paintbrush and blue paint. This makes the reward designer’s job much more difficult. However, the designer could use techniques that don’t require a reward on individual observations, such as rewards that can depend on the agent’s internal cognition (as in iterated amplification), or rewards that can depend on histories (as in Deep RL from Human Preferences).
Planned opinion:
I certainly agree that we want to avoid reward functions defined on observations, and this is one reason why. It seems like a more general version of the wireheading argument to me, and applies even if you think that the AI won’t be able to wirehead, as long as it is capable enough to find other plans for getting high reward besides the one the designer intended.
Planned summary:
This post makes the point that for Markovian reward functions on observations, since any given observation can correspond to multiple underlying states, we cannot know just by analyzing the reward function whether it actually leads to good behavior: it also depends on the environment. For example, suppose we want an agent to collect all of the blue blocks in a room together. We might simply reward it for having blue in its observations: this might work great if the agent only has the ability to pick up and move blocks, but won’t work well if the agent has a paintbrush and blue paint. This makes the reward designer’s job much more difficult. However, the designer could use techniques that don’t require a reward on individual observations, such as rewards that can depend on the agent’s internal cognition (as in iterated amplification), or rewards that can depend on histories (as in Deep RL from Human Preferences).
Planned opinion:
I certainly agree that we want to avoid reward functions defined on observations, and this is one reason why. It seems like a more general version of the wireheading argument to me, and applies even if you think that the AI won’t be able to wirehead, as long as it is capable enough to find other plans for getting high reward besides the one the designer intended.