These comments are responding to the version of AUP presented in the paper. (Let me know if I should be commenting elsewhere).
1)
If an action is useful w.r.t the actual reward but useless to all other rewards (as useless as taking ∅), that is the ideal according to RAUP—i.e. if it is not worth doing because the impact measure is too strong, nothing is worth doing. This is true even if the action is extremely useful to the actual reward. Am I right in thinking that we can conceptualize AUP as saying: “take actions which lead to reward, but wouldn’t be useful (or detrimental) to gaining reward if reward were specified differently”? A typical outline for an AGI gaining power to accomplish a goal might be: gain power, use the power to run some code to help you get maximal reward. We might imagine an AGI convincing a human to run some code, and then giving them the actual code. AUP would be less inclined to do this because after winning over the human, the Q-values for lots of reward functions would be extremely high, so it would be more reluctant to bring itself to that position. Suppose that AUP gives the human operator code to run first and then convinces them to run it. The actions which it takes to gain trust are not useful for other rewards, because they’ll only lead to the code already given being run, which is useless from the perspective of the other reward functions. Do you think AUP would be motivated to search for ways to lock in the effects of future power, and then pursue that power?
2)
If increasing attainable utility and decreasing attainable utility are both dangerous, then raising the size of the actions space to a power makes the agent more dangerous. Consider transforming action/observation/reward into the agent submitting 3 actions, and receiving the next three observations (with the rewards averaged). This is just a new actions space cubically larger. But in this action space, if the “first” action decreased attainable utility dangerously, and the “third” action increased it dangerously, that would cancel out and fail to register as dangerous. Since this problem appears in the math, but not in the intuition, it makes me wary of the reliability of the intuition.
3)
Q-learning converges by sampling all actions repeatedly from all states. AUP penalizes actions according to disruptions in Q-values. I understand that AGI won’t be a Q-learner in a finite-state MDP, but I think it’s worth noting: AUP learns to avoid catastrophic states (if in fact, it does) by testing them out.
4)
Suppose we have a chatbot, and the actions space is finite length strings of text. What exactly is aunit? If it is a meaningless string of text, I suspect every meaningful string of text will be “too high impact”. Maybe aunit is an imitation of a human? I think humans are sufficiently powerful that normal human policies often accidentally cause large impact (i.e. make it massively more difficult or easy to achieve random goals), and that infinitely often (although perhaps not frequently), having aunit be a human policy would lead to an incredibly high tolerance for impact, which would give AUP plenty of leeway to do dangerous things.
1) Why wouldn’t gaining trust be useful for other rewards? I think that it wouldn’t be motivated to do so, because the notion of gaining power seems to be deeply intertwined with the notion of heavy maximization. It might attempt to Goodhart our particular way of measuring impact; the fact that we are actually measuring goal achievement ability from a particular vantage point and are using a particular counterfactual structure means that there could be cheeky ways of tricking that structure. This is why intent verification is a thing in this longer post. However, I think the attainable utility measure itself is correct.
2) this doesn’t appear in the paper, but I do talk about in the post and I think it’s great that you raise this point. Attainable utility preservation says that impact is measured along the arc of your actions, taking into account the deviation of the Q functions at each step compared to doing nothing. If you can imagine making your actions more and more granular (at least, up to a reasonably fine level), it seems like there should be a well-defined limit that the coarser representations approximate. In other words, since impact is measured along the arc of your actions, if your differential elements are chunky, you’re not going to get a very good approximation. I think there are good reasons to suspect that in the real world, the way we think about actions is granular enough to avoid this dangerous phenomenon.
3) this is true. My stance here is that this is basically a capabilities problem/a safe exploration issue, which is disjoint from impact measurement.
4) this is why we want to slowly increment N. This should work whether it’s a human policy or a meaningless string of text. The reason for this is that even if the meaningless string is very low impact, eventually N gets large enough to let the agent do useful things; conversely, if the human policy is more aggressive, we stop incrementing sooner and avoid giving too much leeway.
2) … If you can imagine making your actions more and more granular (at least, up to a reasonably fine level), it seems like there should be a well-defined limit that the coarser representations approximate.
Yeah I agree there’s an easy way to avoid this problem. My main point in bringing it up was that there must be gaps in your justification that AUP is safe, if your justification does not depend on “and the action space must be sufficiently small.” Since AUP definitely isn’t safe for sufficiently large action spaces, your justification (or at least the one presented in the paper) must have at least one flaw, since it purports to argue that AUP is safe regardless of the size of the action space.
You must have read the first version of BoMAI (since you quoted here :) how did you find it by the way?). I’d level the same criticism against that draft. I believed I had a solid argument that it was safe, but then I discovered ν†, which proved there was an error somewhere in my reasoning. So I started by patching the error, but I was still haunted by how certain I felt that it was safe without the patch. I decided I needed to explicitly figure out every assumption involved, and in the process, I discovered ones that I hadn’t realized I was making. Likewise, this patch definitely does seem sufficient to avoid this problem of action-granularity, but I think the problem shows that a more rigorous argument is needed.
Where did I purport that it was safe for AGI in the paper, or in the post? I specifically disclaim that I’m not making that point yet, although I’m pretty sure we can get there.
There is a deeper explanation which I didn’t have space to fit in the paper, and I didn’t have the foresight to focus on when I wrote this post. I agree that it calls out for more investigation, and (this feels like a refrain for me at this point) I’ll be answering this call in a more in-depth sequence on what is actually going on at a deep level with AUP, and how fundamental the phenomenon is to agent-environment interaction.
I don’t remember how I found the first version, I think it was in a Google search somehow?
1) Why wouldn’t gaining trust be useful for other rewards?
Because the agent has already committed to what the trust will be “used for.” It’s not as easy to construct the story of an agent attempting to gain the trust to be allowed to do one particular thing as it is construct the story of an agent attempting to gain trust to be allowed to do anything, but the latter is unappealing to AUP, and the former is perfectly appealing. So all the optimization power will go towards convincing the operator to run this particular code (which takes over the world, and maximizes the reward). If done in the right way, AUP won’t have made arguments which would render it easier to then convince the operator to run different code; running different code would be necessary to maximize a different reward function, so in this scenario, the Q-values for other random reward functions won’t have increased wildly in the way that the Q-value for the real reward did.
I don’t think I agree, but even if trust did work like this, how exactly does taking over the world not increase the Q-values? Even if the code doesn’t supply reward for other reward functions, the agent now has a much more stable existence. If you’re saying that the stable existence only applies for agents maximizing the AUP reward function, then this is what intent verification is for.
Notice something interesting here where the thing which would be goodharted upon without intent verification isn’t the penalty itself per se, but rather the structural properties of the agent design – the counterfactuals, the fact that it’s a specific agent with I/O channels, and so on. more on this later.
I’m not claiming things described as “trust” usually work like this, only that there exists a strategy like this. Maybe it’s better described as “presenting an argument to run this particular code.”
how exactly does taking over the world not increase the Q-values
The code that AUP convinces the operator to run is code for an agent which takes over the world. AUP does not over the world. AUP is living in a brave new world run by a new agent that has been spun up. This new agent will have been designed so that when operational: 1) AUP enters world-states which have very high reward and 2) AUP enters world-states such that AUP’s Q-values for various other reward functions remain comparable to their prior values.
the agent now has a much more stable existence
If you’re claiming that the other Q-values can’t help but be higher in this arrangement, New Agent can tune this by penalizing other reward functions just enough to balance out the expectation.
And let’s forget about intent verification for just a moment to see if AUP to see if AUP accomplishes anything on its own, especially because it seems to me that intent verification suffices for safe AGI, in which case it’s not saying much to say that AUP + intent verification would make it safe.
4) this is why we want to slowly increment N. This should work whether it’s a human policy or a meaningless string of text. The reason for this is that even if the meaningless string is very low impact, eventually N gets large enough to let the agent do useful things; conversely, if the human policy is more aggressive, we stop incrementing sooner and avoid giving too much leeway.
Let’s say for concreteness that it’s a human policy that is used for aunit, if you think it works either way. I think that most human actions are moderately low impact, and some are extremely high impact. No matter what N is, then, if the impact of aunit is leaping to very large values infinitely often, then infinitely often there will effectively be no impact regularization, no matter what N is. No setting for N fixes this; if N were small enough to preclude even actions that are less impactful than aunit, then agent can’t ever act usefully, and if N permits actions as impactful as aunit, then when aunit has very large impact (which I contend happens infinitely often for any assignment of aunit that permits any useful action ever), then dangerously high impact actions will be allowed.
I think there’s some variance, but not as much as you have in mind. Even if there were a very large value, however, this isn’t how N-incrementation works (in the post – if you’re thinking of the paper, then yes, the version I presented there doesn’t bound lifetime returns and therefore doesn’t get the same desirable properties as in the post). If you’ll forgive my postponing this discussion, I’d be interested in hearing your thoughts after I post a more in-depth exploration of the phenomenon?
These comments are responding to the version of AUP presented in the paper. (Let me know if I should be commenting elsewhere).
1)
If an action is useful w.r.t the actual reward but useless to all other rewards (as useless as taking ∅), that is the ideal according to RAUP—i.e. if it is not worth doing because the impact measure is too strong, nothing is worth doing. This is true even if the action is extremely useful to the actual reward. Am I right in thinking that we can conceptualize AUP as saying: “take actions which lead to reward, but wouldn’t be useful (or detrimental) to gaining reward if reward were specified differently”? A typical outline for an AGI gaining power to accomplish a goal might be: gain power, use the power to run some code to help you get maximal reward. We might imagine an AGI convincing a human to run some code, and then giving them the actual code. AUP would be less inclined to do this because after winning over the human, the Q-values for lots of reward functions would be extremely high, so it would be more reluctant to bring itself to that position. Suppose that AUP gives the human operator code to run first and then convinces them to run it. The actions which it takes to gain trust are not useful for other rewards, because they’ll only lead to the code already given being run, which is useless from the perspective of the other reward functions. Do you think AUP would be motivated to search for ways to lock in the effects of future power, and then pursue that power?
2)
If increasing attainable utility and decreasing attainable utility are both dangerous, then raising the size of the actions space to a power makes the agent more dangerous. Consider transforming action/observation/reward into the agent submitting 3 actions, and receiving the next three observations (with the rewards averaged). This is just a new actions space cubically larger. But in this action space, if the “first” action decreased attainable utility dangerously, and the “third” action increased it dangerously, that would cancel out and fail to register as dangerous. Since this problem appears in the math, but not in the intuition, it makes me wary of the reliability of the intuition.
3)
Q-learning converges by sampling all actions repeatedly from all states. AUP penalizes actions according to disruptions in Q-values. I understand that AGI won’t be a Q-learner in a finite-state MDP, but I think it’s worth noting: AUP learns to avoid catastrophic states (if in fact, it does) by testing them out.
4)
Suppose we have a chatbot, and the actions space is finite length strings of text. What exactly is aunit? If it is a meaningless string of text, I suspect every meaningful string of text will be “too high impact”. Maybe aunit is an imitation of a human? I think humans are sufficiently powerful that normal human policies often accidentally cause large impact (i.e. make it massively more difficult or easy to achieve random goals), and that infinitely often (although perhaps not frequently), having aunit be a human policy would lead to an incredibly high tolerance for impact, which would give AUP plenty of leeway to do dangerous things.
1) Why wouldn’t gaining trust be useful for other rewards? I think that it wouldn’t be motivated to do so, because the notion of gaining power seems to be deeply intertwined with the notion of heavy maximization. It might attempt to Goodhart our particular way of measuring impact; the fact that we are actually measuring goal achievement ability from a particular vantage point and are using a particular counterfactual structure means that there could be cheeky ways of tricking that structure. This is why intent verification is a thing in this longer post. However, I think the attainable utility measure itself is correct.
2) this doesn’t appear in the paper, but I do talk about in the post and I think it’s great that you raise this point. Attainable utility preservation says that impact is measured along the arc of your actions, taking into account the deviation of the Q functions at each step compared to doing nothing. If you can imagine making your actions more and more granular (at least, up to a reasonably fine level), it seems like there should be a well-defined limit that the coarser representations approximate. In other words, since impact is measured along the arc of your actions, if your differential elements are chunky, you’re not going to get a very good approximation. I think there are good reasons to suspect that in the real world, the way we think about actions is granular enough to avoid this dangerous phenomenon.
3) this is true. My stance here is that this is basically a capabilities problem/a safe exploration issue, which is disjoint from impact measurement.
4) this is why we want to slowly increment N. This should work whether it’s a human policy or a meaningless string of text. The reason for this is that even if the meaningless string is very low impact, eventually N gets large enough to let the agent do useful things; conversely, if the human policy is more aggressive, we stop incrementing sooner and avoid giving too much leeway.
Yeah I agree there’s an easy way to avoid this problem. My main point in bringing it up was that there must be gaps in your justification that AUP is safe, if your justification does not depend on “and the action space must be sufficiently small.” Since AUP definitely isn’t safe for sufficiently large action spaces, your justification (or at least the one presented in the paper) must have at least one flaw, since it purports to argue that AUP is safe regardless of the size of the action space.
You must have read the first version of BoMAI (since you quoted here :) how did you find it by the way?). I’d level the same criticism against that draft. I believed I had a solid argument that it was safe, but then I discovered ν†, which proved there was an error somewhere in my reasoning. So I started by patching the error, but I was still haunted by how certain I felt that it was safe without the patch. I decided I needed to explicitly figure out every assumption involved, and in the process, I discovered ones that I hadn’t realized I was making. Likewise, this patch definitely does seem sufficient to avoid this problem of action-granularity, but I think the problem shows that a more rigorous argument is needed.
Where did I purport that it was safe for AGI in the paper, or in the post? I specifically disclaim that I’m not making that point yet, although I’m pretty sure we can get there.
There is a deeper explanation which I didn’t have space to fit in the paper, and I didn’t have the foresight to focus on when I wrote this post. I agree that it calls out for more investigation, and (this feels like a refrain for me at this point) I’ll be answering this call in a more in-depth sequence on what is actually going on at a deep level with AUP, and how fundamental the phenomenon is to agent-environment interaction.
I don’t remember how I found the first version, I think it was in a Google search somehow?
Okay fair. I just mean to make some requests for the next version of the argument.
Because the agent has already committed to what the trust will be “used for.” It’s not as easy to construct the story of an agent attempting to gain the trust to be allowed to do one particular thing as it is construct the story of an agent attempting to gain trust to be allowed to do anything, but the latter is unappealing to AUP, and the former is perfectly appealing. So all the optimization power will go towards convincing the operator to run this particular code (which takes over the world, and maximizes the reward). If done in the right way, AUP won’t have made arguments which would render it easier to then convince the operator to run different code; running different code would be necessary to maximize a different reward function, so in this scenario, the Q-values for other random reward functions won’t have increased wildly in the way that the Q-value for the real reward did.
I don’t think I agree, but even if trust did work like this, how exactly does taking over the world not increase the Q-values? Even if the code doesn’t supply reward for other reward functions, the agent now has a much more stable existence. If you’re saying that the stable existence only applies for agents maximizing the AUP reward function, then this is what intent verification is for.
Notice something interesting here where the thing which would be goodharted upon without intent verification isn’t the penalty itself per se, but rather the structural properties of the agent design – the counterfactuals, the fact that it’s a specific agent with I/O channels, and so on. more on this later.
I’m not claiming things described as “trust” usually work like this, only that there exists a strategy like this. Maybe it’s better described as “presenting an argument to run this particular code.”
The code that AUP convinces the operator to run is code for an agent which takes over the world. AUP does not over the world. AUP is living in a brave new world run by a new agent that has been spun up. This new agent will have been designed so that when operational: 1) AUP enters world-states which have very high reward and 2) AUP enters world-states such that AUP’s Q-values for various other reward functions remain comparable to their prior values.
If you’re claiming that the other Q-values can’t help but be higher in this arrangement, New Agent can tune this by penalizing other reward functions just enough to balance out the expectation.
And let’s forget about intent verification for just a moment to see if AUP to see if AUP accomplishes anything on its own, especially because it seems to me that intent verification suffices for safe AGI, in which case it’s not saying much to say that AUP + intent verification would make it safe.
(The post defines the mathematical criterion used for what I call intent verification, it’s not a black box that I’m appealing to.)
Oh sorry.
Let’s say for concreteness that it’s a human policy that is used for aunit, if you think it works either way. I think that most human actions are moderately low impact, and some are extremely high impact. No matter what N is, then, if the impact of aunit is leaping to very large values infinitely often, then infinitely often there will effectively be no impact regularization, no matter what N is. No setting for N fixes this; if N were small enough to preclude even actions that are less impactful than aunit, then agent can’t ever act usefully, and if N permits actions as impactful as aunit, then when aunit has very large impact (which I contend happens infinitely often for any assignment of aunit that permits any useful action ever), then dangerously high impact actions will be allowed.
I think there’s some variance, but not as much as you have in mind. Even if there were a very large value, however, this isn’t how N-incrementation works (in the post – if you’re thinking of the paper, then yes, the version I presented there doesn’t bound lifetime returns and therefore doesn’t get the same desirable properties as in the post). If you’ll forgive my postponing this discussion, I’d be interested in hearing your thoughts after I post a more in-depth exploration of the phenomenon?
Sure thing.