And wow, this turned out longer than I thought it would. It’s in 6 sections:
1. Starting with models versus learning models.
2. Is the third conditions for deceptive alignment necessary?
3. An alternative to, or form of, treacherous turn: Building a successor.
4. Time management: How deceptive alignment might be not be a lot more computationally expensive, and why treacherous turns might have a time delay.
5. The model of a distributional shift, and it’s relationship to the model of training followed by deployment.
6. Miscellaneous
1.
The mesa-optimizer must be able to model the fact that it is being selected to achieve a particular base objective and must have some model of what that objective is.[6]
I’d distinguish “have” in the sense of it starting out with this information from “learning” where it ‘realizes’ this. (If it’s was all about agent generation, then an aligned agent might be simpler because it’s generated with the base objective as it’s objective while an unaligned agent is longer/more complex because it’s generated with the base objective and it’s own objective.*)
2.
If the third condition is not satisfied, then the mesa-optimizer will not expect any benefit from temporarily acting in accordance with the base objective.
Here’s a toy model: Let M denote the mesa objective, and B denote the base objective. Let U denote how the mesa-optimizer values things. U(M) > U(X) for all X such that X != M. Now suppose the mesa-optimizer, rather than just selecting taking the action A that maximizes U(A), instead wishes to maximize the utility of the sum of all it’s actions, and instead chooses the action A that maximizes U(A)*T(A) where T(A) is how long it will run/expect to run for if it takes action A. Now the mesa-optimizer may expect a benefit from doing B instead of M if: U(B)*T(B) > U(M)*T(M). Note that while, by definition (assuming B != M) U(M) > U(B), the action A that maximizes U(A)*T(A) may be B rather that M, even if the the third condition:
The mesa-optimizer must expect the threat of modification[7] to eventually go away, either due to training ending or because of actions taken by the mesa-optimizer.
is not met.
I.E. if it expects to not be modified (as soon) if it does B rather than M, then it may pursue B if it likes doing B for longer, more than it likes doing M for a shorter period of time.
3.
Pure deception. If the mesa-optimizer stops trying to optimize its own objective in the short term and focuses on cooperating with the selection process entirely then this may result in its objective “crystallizing.”
Aside from the appeals of self-modification, planning to build a better version of itself at a later date also seems like a concern, although it would probably be specific to certain possible use-cases in the real world. (It’s unclear how a self-driving car would pull this off physically, software aside.)
4.
Deceptive and corrigible alignment might be more computationally expensive. In the case of deceptive alignment, simultaneously working on optimizing the base objective and working on identifying good opportunities to defect is likely to require more computation than just doing the former.
Depends on how much it’s doing it. If we assume it has a memory:
An important special case is when the mesa-optimizer has access to cross-episodic state. Such state could be used as memory,
It could delay that, or do some sort of threading. i.e. if it thinks training is going to last a long time (there’s a sort of “doomsday” type argument I could make that I should expect to live to twice my current age, whatever that is. On similar grounds, a mesa-optimizer might suppose that training will end when it’s age has doubled.) then, rather than looking for opportunities all the time it might put that on the backburner, or devote less processing to it (while things stay the same). This suggests that a mesa-optimizer might a) execute a treacherous turn later in it’s deployment rather than right away, though later may be in terms of it’s time, and if it’s given more resources/compute. we might not notice a difference or b) it might ‘slow down’/use more compute to look for such opportunities when things change, or if situations match desiderata it has specified for what a good treacherous opportunity might look like.
5.
Thus, a deceptively aligned mesa-optimizer will be looking for ways to determine whether or not it is in training—that is, for the presence of a distributional shift.
And here I thought it’d be an end to/a shift in modification. While we’d expect that, I’m unclear on how a mesa-optimizer would know to expect that.
This model** assumes distributional shift (when transitioning from training to deployment). This assumes: 1) that training is not followed by a (noticeable) verification phase before deployment, 2) one consistent distribution in training (no phases).
An example of phases might be
For example, training the system in a simulated environment and deploying it in the real world would likely involve enough detectable distributional shift to make it infeasible under this approach.
a series of simulated environments, each more complex than the last, followed by the real world (or real Dota or something).
Basically, it’s worth asking how a given plan might be different from (or more specific than) just “training followed by deployment”, and thinking about what effects that might have.**
6.
Minor errata (in the last footnote):
incentivized to cause there [to] be a
*Also, we’re putting a lot of stuff together in our notion of “objective”. It might be useful to unpack the how the parts/complexity of an objective relates to (inner) alignment. For example, we want self-driving cars to not crash, but if someone crosses the road and they can both avoid hitting the person and avoid crashing (say, by stopping quickly), we want them to do that. There’s both a task (drive from point A to B, stay on the road (don’t crash)) and safety (don’t run into people/(big) things crossing the road).
**What about continuous deployment? Windows wasn’t just released into the wild, the OS has had updates. Most products seem to work like this. (Self-driving cars might be different, I can see why they wouldn’t need updates as often as iphones.) I don’t know a lot about Siri and Alexa, but they keep changing them to add new features and such.
This is a response to point 2 before Pattern’s post was modified to include the other points.
Interesting point! First of all, I think condition three is mostly a requirement that must be met for a deceptively aligned mesa-optimizer to actually defect at some point, rather than for it to be deceptive in the first place.
That being said, I think the situation you raise is particularly interesting because it depends on whether the mesa-optimizer cares about its own survival. If the mesa-optimizer does care about its own continued existence, then it would want to do as you were reasoning and not defect, keeping itself around. Alternatively, however, suppose the mesa-optimizer just cares about Omesa but doesn’t care that it’s around to do the optimization for it. Then, defecting and optimizing for Omesa instead of Obase when you expect to be modified afterwards won’t actually hurt the long-term fulfillment of Obase, since another learned algorithm will just take your place to optimize for it instead. Thus, even if the mesa-optimizer prefers lots of Obase to a bit of Omesa, that’s not actually the choice presented to it; rather, it’s actual choice is between a bit of Omesaand a lot of Obase versus only a lot of Obase. Thus, in this case, it would defect even if it thought it would be modified.
(Also, on point 6, thanks for the catch; it should be fixed now!)
I think a cleaner way of stating condition 3 might be “there exists some policy with a non-zero chance of deception at some point in the future, with expected value for the mesa-optimizer higher than the policy of always cooperating”.
This could be met if the threat of modification never goes away, but maybe decreases over time. (at training time, system is monitored for defection every episode, at a deployment time it is monitored for defection 1/million episodes, not worthwhile to defect during training but worthwhile to defect at deployment time)
This could fail to be met if the mesa-optimizer believes that 1) there is a 1% chance that it will be detected by the base optimizer 2) if this happens, the base optimizer will be able to determine $O_mesa$ and give it a large negative reward, 100x more than the possible reward from the best defecting policy. (not sure if there’s any version of this that would work, since it seems hard to figure out $O_mesa$ and provide negative reward)
And wow, this turned out longer than I thought it would. It’s in 6 sections:
1. Starting with models versus learning models.
2. Is the third conditions for deceptive alignment necessary?
3. An alternative to, or form of, treacherous turn: Building a successor.
4. Time management: How deceptive alignment might be not be a lot more computationally expensive, and why treacherous turns might have a time delay.
5. The model of a distributional shift, and it’s relationship to the model of training followed by deployment.
6. Miscellaneous
1.
I’d distinguish “have” in the sense of it starting out with this information from “learning” where it ‘realizes’ this. (If it’s was all about agent generation, then an aligned agent might be simpler because it’s generated with the base objective as it’s objective while an unaligned agent is longer/more complex because it’s generated with the base objective and it’s own objective.*)
2.
Here’s a toy model: Let M denote the mesa objective, and B denote the base objective. Let U denote how the mesa-optimizer values things. U(M) > U(X) for all X such that X != M. Now suppose the mesa-optimizer, rather than just selecting taking the action A that maximizes U(A), instead wishes to maximize the utility of the sum of all it’s actions, and instead chooses the action A that maximizes U(A)*T(A) where T(A) is how long it will run/expect to run for if it takes action A. Now the mesa-optimizer may expect a benefit from doing B instead of M if: U(B)*T(B) > U(M)*T(M). Note that while, by definition (assuming B != M) U(M) > U(B), the action A that maximizes U(A)*T(A) may be B rather that M, even if the the third condition:
is not met.
I.E. if it expects to not be modified (as soon) if it does B rather than M, then it may pursue B if it likes doing B for longer, more than it likes doing M for a shorter period of time.
3.
Aside from the appeals of self-modification, planning to build a better version of itself at a later date also seems like a concern, although it would probably be specific to certain possible use-cases in the real world. (It’s unclear how a self-driving car would pull this off physically, software aside.)
4.
Depends on how much it’s doing it. If we assume it has a memory:
It could delay that, or do some sort of threading. i.e. if it thinks training is going to last a long time (there’s a sort of “doomsday” type argument I could make that I should expect to live to twice my current age, whatever that is. On similar grounds, a mesa-optimizer might suppose that training will end when it’s age has doubled.) then, rather than looking for opportunities all the time it might put that on the backburner, or devote less processing to it (while things stay the same). This suggests that a mesa-optimizer might a) execute a treacherous turn later in it’s deployment rather than right away, though later may be in terms of it’s time, and if it’s given more resources/compute. we might not notice a difference or b) it might ‘slow down’/use more compute to look for such opportunities when things change, or if situations match desiderata it has specified for what a good treacherous opportunity might look like.
5.
And here I thought it’d be an end to/a shift in modification. While we’d expect that, I’m unclear on how a mesa-optimizer would know to expect that.
This model** assumes distributional shift (when transitioning from training to deployment). This assumes: 1) that training is not followed by a (noticeable) verification phase before deployment, 2) one consistent distribution in training (no phases).
An example of phases might be
a series of simulated environments, each more complex than the last, followed by the real world (or real Dota or something).
Basically, it’s worth asking how a given plan might be different from (or more specific than) just “training followed by deployment”, and thinking about what effects that might have.**
6.
Minor errata (in the last footnote):
*Also, we’re putting a lot of stuff together in our notion of “objective”. It might be useful to unpack the how the parts/complexity of an objective relates to (inner) alignment. For example, we want self-driving cars to not crash, but if someone crosses the road and they can both avoid hitting the person and avoid crashing (say, by stopping quickly), we want them to do that. There’s both a task (drive from point A to B, stay on the road (don’t crash)) and safety (don’t run into people/(big) things crossing the road).
**What about continuous deployment? Windows wasn’t just released into the wild, the OS has had updates. Most products seem to work like this. (Self-driving cars might be different, I can see why they wouldn’t need updates as often as iphones.) I don’t know a lot about Siri and Alexa, but they keep changing them to add new features and such.
This is a response to point 2 before Pattern’s post was modified to include the other points.
Interesting point! First of all, I think condition three is mostly a requirement that must be met for a deceptively aligned mesa-optimizer to actually defect at some point, rather than for it to be deceptive in the first place.
That being said, I think the situation you raise is particularly interesting because it depends on whether the mesa-optimizer cares about its own survival. If the mesa-optimizer does care about its own continued existence, then it would want to do as you were reasoning and not defect, keeping itself around. Alternatively, however, suppose the mesa-optimizer just cares about Omesa but doesn’t care that it’s around to do the optimization for it. Then, defecting and optimizing for Omesa instead of Obase when you expect to be modified afterwards won’t actually hurt the long-term fulfillment of Obase, since another learned algorithm will just take your place to optimize for it instead. Thus, even if the mesa-optimizer prefers lots of Obase to a bit of Omesa, that’s not actually the choice presented to it; rather, it’s actual choice is between a bit of Omesa and a lot of Obase versus only a lot of Obase. Thus, in this case, it would defect even if it thought it would be modified.
(Also, on point 6, thanks for the catch; it should be fixed now!)
I think a cleaner way of stating condition 3 might be “there exists some policy with a non-zero chance of deception at some point in the future, with expected value for the mesa-optimizer higher than the policy of always cooperating”.
This could be met if the threat of modification never goes away, but maybe decreases over time. (at training time, system is monitored for defection every episode, at a deployment time it is monitored for defection 1/million episodes, not worthwhile to defect during training but worthwhile to defect at deployment time)
This could fail to be met if the mesa-optimizer believes that 1) there is a 1% chance that it will be detected by the base optimizer 2) if this happens, the base optimizer will be able to determine $O_mesa$ and give it a large negative reward, 100x more than the possible reward from the best defecting policy. (not sure if there’s any version of this that would work, since it seems hard to figure out $O_mesa$ and provide negative reward)