I’m not claiming things described as “trust” usually work like this, only that there exists a strategy like this. Maybe it’s better described as “presenting an argument to run this particular code.”
how exactly does taking over the world not increase the Q-values
The code that AUP convinces the operator to run is code for an agent which takes over the world. AUP does not over the world. AUP is living in a brave new world run by a new agent that has been spun up. This new agent will have been designed so that when operational: 1) AUP enters world-states which have very high reward and 2) AUP enters world-states such that AUP’s Q-values for various other reward functions remain comparable to their prior values.
the agent now has a much more stable existence
If you’re claiming that the other Q-values can’t help but be higher in this arrangement, New Agent can tune this by penalizing other reward functions just enough to balance out the expectation.
And let’s forget about intent verification for just a moment to see if AUP to see if AUP accomplishes anything on its own, especially because it seems to me that intent verification suffices for safe AGI, in which case it’s not saying much to say that AUP + intent verification would make it safe.
I’m not claiming things described as “trust” usually work like this, only that there exists a strategy like this. Maybe it’s better described as “presenting an argument to run this particular code.”
The code that AUP convinces the operator to run is code for an agent which takes over the world. AUP does not over the world. AUP is living in a brave new world run by a new agent that has been spun up. This new agent will have been designed so that when operational: 1) AUP enters world-states which have very high reward and 2) AUP enters world-states such that AUP’s Q-values for various other reward functions remain comparable to their prior values.
If you’re claiming that the other Q-values can’t help but be higher in this arrangement, New Agent can tune this by penalizing other reward functions just enough to balance out the expectation.
And let’s forget about intent verification for just a moment to see if AUP to see if AUP accomplishes anything on its own, especially because it seems to me that intent verification suffices for safe AGI, in which case it’s not saying much to say that AUP + intent verification would make it safe.
(The post defines the mathematical criterion used for what I call intent verification, it’s not a black box that I’m appealing to.)
Oh sorry.