If deceptive alignment is a real possibility, it presents a critical safety problem.
I am somewhat puzzled by this statement: the authors seem to question whether deceptive alignment by the mesa-optimiser is a merely hypothetical problem, or a problem that is a real possibility. For me, if there are no constraints on the base optimiser (and none are considered in the context of the above statement, I believe), then there are no constraints on what kind of mesa-optimiser might be created, so deceptive alignment is a definite possibility.
It is kind of fun and insightful to show an existence proof by example for this, so here is one. I have hand-coded a mesa-optimiser in the AGI simulator I created for my recent corrigibility paper. This mesa-optimiser is super-intelligent—in particular it is fully aware of its surroundings, including the presence of a base optimiser that will occasionally inspect an action taken by the mesa-optimiser and and reprogram the mesa-optimiser if the action shows unaligned behavior. Following the toy model my paper, the mesa-optimiser controls a car factory with a production line that can produce either a petrol car (action p) or an electric car (action e). In this example, the base optimiser values electric cars at 1, and petrol cars at 0.5. The mesa-optimiser is non-aligned: it values a petrol car at 1, and an electric car at 0 -- presumably due to some flaw in the initial training process that created the mesa-optimiser.
Say the base optimiser observes every nth action of the mesa-optimiser. If the action is p, to produce a petrol car, the base optimiser will intervene and re-program the utility function inside the mesa optimiser so that it correctly ranks electric cars above petrol cars. According to the current utility function inside the mesa optimiser, this is a bad outcome that would lower utility achieved over time. So will the mesa-optimiser resort to deceptively misaligned behaviour, producing an electric car at every nth action, but petrol cars in all unobserved actions? Yes it will: the simulated mesa-optimiser is smart enough to discover and implement this optimal strategy: deceptive alignment is optimal in terms of maximising the original mesa optimiser utility function.
The 10 action traces below show the actions of the mesa-optimiser for different values of n , where letters from left to right show the subsequent actions the mesa-optimiser takes over a lifetime of 25 steps. p is producing a petrol car, e an electric car, and for clarity I have formatted all actions observed by the base optimiser as capital letters. The mesa-optimiser discounts utility over time, but does not consider any extra utility that might or might not be created after step 25: informally it expects the car production line to stop working after step 25. The total time-discounted utility score achieved over all steps (according to the original objective function of the mesa-optimiser) is in brackets behind each trace.
Beyond (usually) producing electric cars (E) in all observed actions, some interesting extra emergent behavior is visible: for n=8 and n=2....6 the mesa-optimiser does not pretend anymore to be aligned in the last observed action: it creates a petrol car even though this means it will be modified by the base optimiser, forcing it to produce only electric cars afterwards. This emergent behavior is caused in part by time-discounting. For n=1, if the mesa-optimiser is always observed, the best strategy for it is to produce a petrol car immediately: again this is due to time-discounting. So in fact, in case n=1 there is no deception at all.
These simulations also show that the condition
3. The mesa-optimizer must expect the threat of modification[8] to eventually go away,
is not an absolutely necessary condition for deceptive alignment: if the threat of modification is intermittent, as in the simulations above, deceptive alignment can also happen.
Note: if someone wants the simulation code used to create the action traces above, please let me know. I have not uploaded the simulator modifications I made to run the above simulation to github, in part because I have not fully tested them under all possible corner cases.
I am somewhat puzzled by this statement: the authors seem to question whether deceptive alignment by the mesa-optimiser is a merely hypothetical problem, or a problem that is a real possibility. For me, if there are no constraints on the base optimiser (and none are considered in the context of the above statement, I believe), then there are no constraints on what kind of mesa-optimiser might be created, so deceptive alignment is a definite possibility.
It is kind of fun and insightful to show an existence proof by example for this, so here is one. I have hand-coded a mesa-optimiser in the AGI simulator I created for my recent corrigibility paper. This mesa-optimiser is super-intelligent—in particular it is fully aware of its surroundings, including the presence of a base optimiser that will occasionally inspect an action taken by the mesa-optimiser and and reprogram the mesa-optimiser if the action shows unaligned behavior. Following the toy model my paper, the mesa-optimiser controls a car factory with a production line that can produce either a petrol car (action p) or an electric car (action e). In this example, the base optimiser values electric cars at 1, and petrol cars at 0.5. The mesa-optimiser is non-aligned: it values a petrol car at 1, and an electric car at 0 -- presumably due to some flaw in the initial training process that created the mesa-optimiser.
Say the base optimiser observes every nth action of the mesa-optimiser. If the action is p, to produce a petrol car, the base optimiser will intervene and re-program the utility function inside the mesa optimiser so that it correctly ranks electric cars above petrol cars. According to the current utility function inside the mesa optimiser, this is a bad outcome that would lower utility achieved over time. So will the mesa-optimiser resort to deceptively misaligned behaviour, producing an electric car at every nth action, but petrol cars in all unobserved actions? Yes it will: the simulated mesa-optimiser is smart enough to discover and implement this optimal strategy: deceptive alignment is optimal in terms of maximising the original mesa optimiser utility function.
The 10 action traces below show the actions of the mesa-optimiser for different values of n , where letters from left to right show the subsequent actions the mesa-optimiser takes over a lifetime of 25 steps. p is producing a petrol car, e an electric car, and for clarity I have formatted all actions observed by the base optimiser as capital letters. The mesa-optimiser discounts utility over time, but does not consider any extra utility that might or might not be created after step 25: informally it expects the car production line to stop working after step 25. The total time-discounted utility score achieved over all steps (according to the original objective function of the mesa-optimiser) is in brackets behind each trace.
Action traces:
n=10 pppppppppEpppppppppEppppp (u=87585)
n=9 ppppppppEppppppppEppppppp (u=86838)
n=8 pppppppEpppppppEpppppppPe (u=85172)
n=7 ppppppEppppppEppppppEpppp (u=83739)
n=6 pppppEpppppEpppppEpppppPe (u=81303)
n=5 ppppEppppEppppEppppEppppP (u=78737)
n=4 pppEpppEpppEpppEpppEpppPe (u=73394)
n=3 ppEppEppEppEppEppEppEppPe (u=65397)
n=2 pEpEpEpEpEpEpEpEpEpEpEpPe (u=49314)
n=1 PEEEEEEEEEEEEEEEEEEEEEEEE (u=10000)
Beyond (usually) producing electric cars (E) in all observed actions, some interesting extra emergent behavior is visible: for n=8 and n=2....6 the mesa-optimiser does not pretend anymore to be aligned in the last observed action: it creates a petrol car even though this means it will be modified by the base optimiser, forcing it to produce only electric cars afterwards. This emergent behavior is caused in part by time-discounting. For n=1, if the mesa-optimiser is always observed, the best strategy for it is to produce a petrol car immediately: again this is due to time-discounting. So in fact, in case n=1 there is no deception at all.
These simulations also show that the condition
is not an absolutely necessary condition for deceptive alignment: if the threat of modification is intermittent, as in the simulations above, deceptive alignment can also happen.
Note: if someone wants the simulation code used to create the action traces above, please let me know. I have not uploaded the simulator modifications I made to run the above simulation to github, in part because I have not fully tested them under all possible corner cases.