Logan Riggs comments on What You See Isn’t Always What You Want

Logan Riggs 14 Sep 2019 1:39 UTC
1 point
I’m a bit confused, but here’s my current understanding and questions:
1. You’re mostly talking about partially observable markov decision problems (POMDP)
2. The link above has rewards given by the environment that go from (State, Action) to a real number, while Markovian observation-based reward function is given by the Agent itself (?) and goes from Observation to a real number?
- What’s an n-step version of one? .
I have a few other questions, but they depend on whether the reward is given by agent based on observations or by the environment based on it’s actual state.
- TurnTrout 16 Sep 2019 23:16 UTC
  3 points
  Parent
  I was thinking of it being over observations, but having it be over States x Actions leads to a potentially different outcome. An $n$ -step version is, your reward function is a mapping $O^{n} \mapsto R$ (you’re grading the last $n$ observations jointly). Eg in Atari DRL you might see the last four frames being fed to the agent as an approximation (since the games might well be 4-step Markovian; that is, the four previous time steps fully determine what happens next).
  - Logan Riggs 17 Sep 2019 0:10 UTC
    1 point
    Parent
    Thanks!
    
    So observation based rewards lead to bad behavior when the rewarded observation maps to different states (with at least one of those states being undesired)?
    
    And a fully observable environment doesn’t have that problem because you always know which state you’re in? If so, wouldn’t you still be rewarded by observations and incentivized to show yourself blue images forever?
    
    Also, a fully-observable environment will still choose to wirehead if that’s a possibility, correct?
    - TurnTrout 19 Sep 2019 21:35 UTC
      3 points
      Parent
      Let me try and reframe. The point of this post isn’t that we’re rewarding bad things, it’s that there might not exist a reward function whose optimal policy does good things! This has to do with the structure of agent-environment interaction, and how precisely we can incentivize certain kinds of optimal action. If the reward functions linear functionals over camera RGB values, then excepting the trivial zero function, plugging in any one of these reward functions to AIXI leads to doom! We just can’t specify a reward function from this class which doesn’t (this is different from there maybe existing a “human utility function” which is simply hard to specify).