Let me try and reframe. The point of this post isn’t that we’re rewarding bad things, it’s that there might not exist a reward function whose optimal policy does good things! This has to do with the structure of agent-environment interaction, and how precisely we can incentivize certain kinds of optimal action. If the reward functions linear functionals over camera RGB values, then excepting the trivial zero function, plugging in any one of these reward functions to AIXI leads to doom! We just can’t specify a reward function from this class which doesn’t (this is different from there maybe existing a “human utility function” which is simply hard to specify).
Let me try and reframe. The point of this post isn’t that we’re rewarding bad things, it’s that there might not exist a reward function whose optimal policy does good things! This has to do with the structure of agent-environment interaction, and how precisely we can incentivize certain kinds of optimal action. If the reward functions linear functionals over camera RGB values, then excepting the trivial zero function, plugging in any one of these reward functions to AIXI leads to doom! We just can’t specify a reward function from this class which doesn’t (this is different from there maybe existing a “human utility function” which is simply hard to specify).