1) Why wouldn’t gaining trust be useful for other rewards?
Because the agent has already committed to what the trust will be “used for.” It’s not as easy to construct the story of an agent attempting to gain the trust to be allowed to do one particular thing as it is construct the story of an agent attempting to gain trust to be allowed to do anything, but the latter is unappealing to AUP, and the former is perfectly appealing. So all the optimization power will go towards convincing the operator to run this particular code (which takes over the world, and maximizes the reward). If done in the right way, AUP won’t have made arguments which would render it easier to then convince the operator to run different code; running different code would be necessary to maximize a different reward function, so in this scenario, the Q-values for other random reward functions won’t have increased wildly in the way that the Q-value for the real reward did.
I don’t think I agree, but even if trust did work like this, how exactly does taking over the world not increase the Q-values? Even if the code doesn’t supply reward for other reward functions, the agent now has a much more stable existence. If you’re saying that the stable existence only applies for agents maximizing the AUP reward function, then this is what intent verification is for.
Notice something interesting here where the thing which would be goodharted upon without intent verification isn’t the penalty itself per se, but rather the structural properties of the agent design – the counterfactuals, the fact that it’s a specific agent with I/O channels, and so on. more on this later.
I’m not claiming things described as “trust” usually work like this, only that there exists a strategy like this. Maybe it’s better described as “presenting an argument to run this particular code.”
how exactly does taking over the world not increase the Q-values
The code that AUP convinces the operator to run is code for an agent which takes over the world. AUP does not over the world. AUP is living in a brave new world run by a new agent that has been spun up. This new agent will have been designed so that when operational: 1) AUP enters world-states which have very high reward and 2) AUP enters world-states such that AUP’s Q-values for various other reward functions remain comparable to their prior values.
the agent now has a much more stable existence
If you’re claiming that the other Q-values can’t help but be higher in this arrangement, New Agent can tune this by penalizing other reward functions just enough to balance out the expectation.
And let’s forget about intent verification for just a moment to see if AUP to see if AUP accomplishes anything on its own, especially because it seems to me that intent verification suffices for safe AGI, in which case it’s not saying much to say that AUP + intent verification would make it safe.
Because the agent has already committed to what the trust will be “used for.” It’s not as easy to construct the story of an agent attempting to gain the trust to be allowed to do one particular thing as it is construct the story of an agent attempting to gain trust to be allowed to do anything, but the latter is unappealing to AUP, and the former is perfectly appealing. So all the optimization power will go towards convincing the operator to run this particular code (which takes over the world, and maximizes the reward). If done in the right way, AUP won’t have made arguments which would render it easier to then convince the operator to run different code; running different code would be necessary to maximize a different reward function, so in this scenario, the Q-values for other random reward functions won’t have increased wildly in the way that the Q-value for the real reward did.
I don’t think I agree, but even if trust did work like this, how exactly does taking over the world not increase the Q-values? Even if the code doesn’t supply reward for other reward functions, the agent now has a much more stable existence. If you’re saying that the stable existence only applies for agents maximizing the AUP reward function, then this is what intent verification is for.
Notice something interesting here where the thing which would be goodharted upon without intent verification isn’t the penalty itself per se, but rather the structural properties of the agent design – the counterfactuals, the fact that it’s a specific agent with I/O channels, and so on. more on this later.
I’m not claiming things described as “trust” usually work like this, only that there exists a strategy like this. Maybe it’s better described as “presenting an argument to run this particular code.”
The code that AUP convinces the operator to run is code for an agent which takes over the world. AUP does not over the world. AUP is living in a brave new world run by a new agent that has been spun up. This new agent will have been designed so that when operational: 1) AUP enters world-states which have very high reward and 2) AUP enters world-states such that AUP’s Q-values for various other reward functions remain comparable to their prior values.
If you’re claiming that the other Q-values can’t help but be higher in this arrangement, New Agent can tune this by penalizing other reward functions just enough to balance out the expectation.
And let’s forget about intent verification for just a moment to see if AUP to see if AUP accomplishes anything on its own, especially because it seems to me that intent verification suffices for safe AGI, in which case it’s not saying much to say that AUP + intent verification would make it safe.
(The post defines the mathematical criterion used for what I call intent verification, it’s not a black box that I’m appealing to.)
Oh sorry.