This is a response to point 2 before Pattern’s post was modified to include the other points.
Interesting point! First of all, I think condition three is mostly a requirement that must be met for a deceptively aligned mesa-optimizer to actually defect at some point, rather than for it to be deceptive in the first place.
That being said, I think the situation you raise is particularly interesting because it depends on whether the mesa-optimizer cares about its own survival. If the mesa-optimizer does care about its own continued existence, then it would want to do as you were reasoning and not defect, keeping itself around. Alternatively, however, suppose the mesa-optimizer just cares about Omesa but doesn’t care that it’s around to do the optimization for it. Then, defecting and optimizing for Omesa instead of Obase when you expect to be modified afterwards won’t actually hurt the long-term fulfillment of Obase, since another learned algorithm will just take your place to optimize for it instead. Thus, even if the mesa-optimizer prefers lots of Obase to a bit of Omesa, that’s not actually the choice presented to it; rather, it’s actual choice is between a bit of Omesaand a lot of Obase versus only a lot of Obase. Thus, in this case, it would defect even if it thought it would be modified.
(Also, on point 6, thanks for the catch; it should be fixed now!)
I think a cleaner way of stating condition 3 might be “there exists some policy with a non-zero chance of deception at some point in the future, with expected value for the mesa-optimizer higher than the policy of always cooperating”.
This could be met if the threat of modification never goes away, but maybe decreases over time. (at training time, system is monitored for defection every episode, at a deployment time it is monitored for defection 1/million episodes, not worthwhile to defect during training but worthwhile to defect at deployment time)
This could fail to be met if the mesa-optimizer believes that 1) there is a 1% chance that it will be detected by the base optimizer 2) if this happens, the base optimizer will be able to determine $O_mesa$ and give it a large negative reward, 100x more than the possible reward from the best defecting policy. (not sure if there’s any version of this that would work, since it seems hard to figure out $O_mesa$ and provide negative reward)
This is a response to point 2 before Pattern’s post was modified to include the other points.
Interesting point! First of all, I think condition three is mostly a requirement that must be met for a deceptively aligned mesa-optimizer to actually defect at some point, rather than for it to be deceptive in the first place.
That being said, I think the situation you raise is particularly interesting because it depends on whether the mesa-optimizer cares about its own survival. If the mesa-optimizer does care about its own continued existence, then it would want to do as you were reasoning and not defect, keeping itself around. Alternatively, however, suppose the mesa-optimizer just cares about Omesa but doesn’t care that it’s around to do the optimization for it. Then, defecting and optimizing for Omesa instead of Obase when you expect to be modified afterwards won’t actually hurt the long-term fulfillment of Obase, since another learned algorithm will just take your place to optimize for it instead. Thus, even if the mesa-optimizer prefers lots of Obase to a bit of Omesa, that’s not actually the choice presented to it; rather, it’s actual choice is between a bit of Omesa and a lot of Obase versus only a lot of Obase. Thus, in this case, it would defect even if it thought it would be modified.
(Also, on point 6, thanks for the catch; it should be fixed now!)
I think a cleaner way of stating condition 3 might be “there exists some policy with a non-zero chance of deception at some point in the future, with expected value for the mesa-optimizer higher than the policy of always cooperating”.
This could be met if the threat of modification never goes away, but maybe decreases over time. (at training time, system is monitored for defection every episode, at a deployment time it is monitored for defection 1/million episodes, not worthwhile to defect during training but worthwhile to defect at deployment time)
This could fail to be met if the mesa-optimizer believes that 1) there is a 1% chance that it will be detected by the base optimizer 2) if this happens, the base optimizer will be able to determine $O_mesa$ and give it a large negative reward, 100x more than the possible reward from the best defecting policy. (not sure if there’s any version of this that would work, since it seems hard to figure out $O_mesa$ and provide negative reward)