RAUP is penalized whenever the action you choose changes the agent’s ability to attain other utilities. One thing an agent might do to leave that penalty at zero is to spawn a subagent, tell it to take over the world, and program it such that if the agent ever tells the subagent it has been counterfactually switched to another reward function, the subagent is to give the agent as much of that reward function as the agent might have been able to get for itself, had it not originally spawned a subagent.
This modification of my approach came not because there is no surgery, but because the penalty is |Q(a)-Q(Ø)| instead of |Q(a)-Q(destroy itself)|. QRi is learned to be the answer to “How much utility could I attain if my utility function were surgically replaced with Ri?”, but it is only by accident that such a surgery might change the world’s future, because the agent didn’t refactor the interface away. If optimization pressure is put on this, it goes away.
If I’m missing the point too hard, feel free to command me to wait till the end of Reframing Impact so I don’t spend all my street cred keeping you talking :).
This modification of my approach came not because there is no surgery, but because the penalty is |Q(a)-Q(Ø)| instead of |Q(a)-Q(destroy itself)|. QRi is learned to be the answer to “How much utility could I attain if my utility function were surgically replaced with Ri?”, but it is only by accident that such a surgery might change the world’s future, because the agent didn’t refactor the interface away. If optimization pressure is put on this, it goes away.
Well, in what I’m proposing, there isn’t even a different auxiliary reward function - RAUP is just R(s,a)−|Q∗R(s,a)−Q∗R(s,∅)|. The agent would be penalizing shifts in its ability to accrue its own primary reward.
One thing that might happen instead, though, is the agent builds a thing that checks whether it’s running its inaction policy (whether it’s calculating Q∗R(s,∅), basically). This is kinda weird, but my intuition is that we should be able to write an equation which does the right thing. We don’t have a value specification problem here; it feels more like the easy problem of wireheading, where you keep trying to patch the AI to not wirehead, and you’re fighting against a bad design choice. The fix is to evaluate the future consequences with your current utility function, instead of just maximizing sensory reward.
We’re trying to measure how well can the agent achieve its own fully formally specified goal. More on this later in the sequence.
RAUP is penalized whenever the action you choose changes the agent’s ability to attain other utilities. One thing an agent might do to leave that penalty at zero is to spawn a subagent, tell it to take over the world, and program it such that if the agent ever tells the subagent it has been counterfactually switched to another reward function, the subagent is to give the agent as much of that reward function as the agent might have been able to get for itself, had it not originally spawned a subagent.
This modification of my approach came not because there is no surgery, but because the penalty is |Q(a)-Q(Ø)| instead of |Q(a)-Q(destroy itself)|. QRi is learned to be the answer to “How much utility could I attain if my utility function were surgically replaced with Ri?”, but it is only by accident that such a surgery might change the world’s future, because the agent didn’t refactor the interface away. If optimization pressure is put on this, it goes away.
If I’m missing the point too hard, feel free to command me to wait till the end of Reframing Impact so I don’t spend all my street cred keeping you talking :).
Well, in what I’m proposing, there isn’t even a different auxiliary reward function - RAUP is just R(s,a)−|Q∗R(s,a)−Q∗R(s,∅)|. The agent would be penalizing shifts in its ability to accrue its own primary reward.
One thing that might happen instead, though, is the agent builds a thing that checks whether it’s running its inaction policy (whether it’s calculating Q∗R(s,∅), basically). This is kinda weird, but my intuition is that we should be able to write an equation which does the right thing. We don’t have a value specification problem here; it feels more like the easy problem of wireheading, where you keep trying to patch the AI to not wirehead, and you’re fighting against a bad design choice. The fix is to evaluate the future consequences with your current utility function, instead of just maximizing sensory reward.
We’re trying to measure how well can the agent achieve its own fully formally specified goal. More on this later in the sequence.