This modification of my approach came not because there is no surgery, but because the penalty is |Q(a)-Q(Ø)| instead of |Q(a)-Q(destroy itself)|. QRi is learned to be the answer to “How much utility could I attain if my utility function were surgically replaced with Ri?”, but it is only by accident that such a surgery might change the world’s future, because the agent didn’t refactor the interface away. If optimization pressure is put on this, it goes away.
Well, in what I’m proposing, there isn’t even a different auxiliary reward function - RAUP is just R(s,a)−|Q∗R(s,a)−Q∗R(s,∅)|. The agent would be penalizing shifts in its ability to accrue its own primary reward.
One thing that might happen instead, though, is the agent builds a thing that checks whether it’s running its inaction policy (whether it’s calculating Q∗R(s,∅), basically). This is kinda weird, but my intuition is that we should be able to write an equation which does the right thing. We don’t have a value specification problem here; it feels more like the easy problem of wireheading, where you keep trying to patch the AI to not wirehead, and you’re fighting against a bad design choice. The fix is to evaluate the future consequences with your current utility function, instead of just maximizing sensory reward.
We’re trying to measure how well can the agent achieve its own fully formally specified goal. More on this later in the sequence.
Well, in what I’m proposing, there isn’t even a different auxiliary reward function - RAUP is just R(s,a)−|Q∗R(s,a)−Q∗R(s,∅)|. The agent would be penalizing shifts in its ability to accrue its own primary reward.
One thing that might happen instead, though, is the agent builds a thing that checks whether it’s running its inaction policy (whether it’s calculating Q∗R(s,∅), basically). This is kinda weird, but my intuition is that we should be able to write an equation which does the right thing. We don’t have a value specification problem here; it feels more like the easy problem of wireheading, where you keep trying to patch the AI to not wirehead, and you’re fighting against a bad design choice. The fix is to evaluate the future consequences with your current utility function, instead of just maximizing sensory reward.
We’re trying to measure how well can the agent achieve its own fully formally specified goal. More on this later in the sequence.