Not quite. I’m proposing penalizing it for gaining power, a lamy recent post. There’s a big difference between “able to get 10 return from my current vantage point” and “I’ve taken over the planet and can ensure i get 100 return with high probability”. We’re penalizing it for increasing its ability like that (concretely, see Conservative Agency for an analogous formalization, or if none of this makes sense still, wait till the end of Reframing Impact).
Assessing its ability to attain various utilities after an action requires that you surgically replace its utility function with a different one in a world it has impacted. How do you stop it from messing with the interface, such as by passing its power to a subagent to make your surgery do nothing?
RAUP is penalized whenever the action you choose changes the agent’s ability to attain other utilities. One thing an agent might do to leave that penalty at zero is to spawn a subagent, tell it to take over the world, and program it such that if the agent ever tells the subagent it has been counterfactually switched to another reward function, the subagent is to give the agent as much of that reward function as the agent might have been able to get for itself, had it not originally spawned a subagent.
This modification of my approach came not because there is no surgery, but because the penalty is |Q(a)-Q(Ø)| instead of |Q(a)-Q(destroy itself)|. QRi is learned to be the answer to “How much utility could I attain if my utility function were surgically replaced with Ri?”, but it is only by accident that such a surgery might change the world’s future, because the agent didn’t refactor the interface away. If optimization pressure is put on this, it goes away.
If I’m missing the point too hard, feel free to command me to wait till the end of Reframing Impact so I don’t spend all my street cred keeping you talking :).
This modification of my approach came not because there is no surgery, but because the penalty is |Q(a)-Q(Ø)| instead of |Q(a)-Q(destroy itself)|. QRi is learned to be the answer to “How much utility could I attain if my utility function were surgically replaced with Ri?”, but it is only by accident that such a surgery might change the world’s future, because the agent didn’t refactor the interface away. If optimization pressure is put on this, it goes away.
Well, in what I’m proposing, there isn’t even a different auxiliary reward function - RAUP is just R(s,a)−|Q∗R(s,a)−Q∗R(s,∅)|. The agent would be penalizing shifts in its ability to accrue its own primary reward.
One thing that might happen instead, though, is the agent builds a thing that checks whether it’s running its inaction policy (whether it’s calculating Q∗R(s,∅), basically). This is kinda weird, but my intuition is that we should be able to write an equation which does the right thing. We don’t have a value specification problem here; it feels more like the easy problem of wireheading, where you keep trying to patch the AI to not wirehead, and you’re fighting against a bad design choice. The fix is to evaluate the future consequences with your current utility function, instead of just maximizing sensory reward.
We’re trying to measure how well can the agent achieve its own fully formally specified goal. More on this later in the sequence.
Not quite. I’m proposing penalizing it for gaining power, a la my recent post. There’s a big difference between “able to get 10 return from my current vantage point” and “I’ve taken over the planet and can ensure i get 100 return with high probability”. We’re penalizing it for increasing its ability like that (concretely, see Conservative Agency for an analogous formalization, or if none of this makes sense still, wait till the end of Reframing Impact).
Assessing its ability to attain various utilities after an action requires that you surgically replace its utility function with a different one in a world it has impacted. How do you stop it from messing with the interface, such as by passing its power to a subagent to make your surgery do nothing?
It doesn’t require anything like that. Check out RAUP in the linked paper!
RAUP is penalized whenever the action you choose changes the agent’s ability to attain other utilities. One thing an agent might do to leave that penalty at zero is to spawn a subagent, tell it to take over the world, and program it such that if the agent ever tells the subagent it has been counterfactually switched to another reward function, the subagent is to give the agent as much of that reward function as the agent might have been able to get for itself, had it not originally spawned a subagent.
This modification of my approach came not because there is no surgery, but because the penalty is |Q(a)-Q(Ø)| instead of |Q(a)-Q(destroy itself)|. QRi is learned to be the answer to “How much utility could I attain if my utility function were surgically replaced with Ri?”, but it is only by accident that such a surgery might change the world’s future, because the agent didn’t refactor the interface away. If optimization pressure is put on this, it goes away.
If I’m missing the point too hard, feel free to command me to wait till the end of Reframing Impact so I don’t spend all my street cred keeping you talking :).
Well, in what I’m proposing, there isn’t even a different auxiliary reward function - RAUP is just R(s,a)−|Q∗R(s,a)−Q∗R(s,∅)|. The agent would be penalizing shifts in its ability to accrue its own primary reward.
One thing that might happen instead, though, is the agent builds a thing that checks whether it’s running its inaction policy (whether it’s calculating Q∗R(s,∅), basically). This is kinda weird, but my intuition is that we should be able to write an equation which does the right thing. We don’t have a value specification problem here; it feels more like the easy problem of wireheading, where you keep trying to patch the AI to not wirehead, and you’re fighting against a bad design choice. The fix is to evaluate the future consequences with your current utility function, instead of just maximizing sensory reward.
We’re trying to measure how well can the agent achieve its own fully formally specified goal. More on this later in the sequence.