Proxy: mesa objective is a proxy for the base objective in the training environment
Side-effect: optimizing mesa objective happens to optimize base objective
Instrumental: optimizing base objective happens to optimize mesa objective
Approximate: objectives differ due to an approximation error, caused by limits of representation in mesa-optimizer’s model
Suboptimal: optimizing mesa objective happens to optimize base objective due to a flaw
What makes this particularly concerning relative to the “standard” alignment problem? It’s that the misalignment may be particularly difficult to detect (see next step). The mesa-optimizer has been “screened” by the base optimizer during training; it has to align to the base objective to some extent, at least in the training environment. This either means it is aligned, or deceptively misaligned. The base probability of a misaligned mesa-optimizer is lower than a misaligned system in general, but conditioned on it happening, it could carry greater risk.
-> Deployment
Not enough evidence of risk based on training and testing in controlled environment
Mesa optimizer is deceptive, to increase its chance of deployment
-> Distribution shift
Deployment environment differs from training environment or controlled testing environment
-> Unanticipated behaviour
The distribution shift causes the misalignment to manifest in different behaviour
There is a “standard” problem of robustness to distribution shift that applies to any machine learning model, not just mesa-optimizers. What makes a mesa-optimizer particularly concerning? It’s that optimization will tend to make the unanticipated behaviour more adversarial and influence-seeking. It can freely pursue an objective rather than merely perform a specified task.
-> Catastrophe
The unanticipated behaviour is so bad that it leads to a permanent, huge loss of value.
Ways this doesn’t happen:
The behaviour is counter-productive: perhaps self-destructive, not directed or consistent enough to lead to anything bad (depends on how sensitive the domain is to random error)
Intervention from other actors (depends on decisive strategic advantage)
Base → Mesa: prevent mesa-optimizer arising
Alternative methods of subsystem generation beyond search
Mesa → Misalignment: prevent mesa-optimizer from being misaligned
Standard alignment methods
Specific methods for base optimizer to align mesa optimizer
Deployment → Distribution shift: ensure deployment environment does not differ significantly from training environment or controlled testing environment
The converse is probably more tractable, in which case intervention comes before Base → Mesa
Distribution shift → Unanticipated behaviour: prevent unanticipated behaviour by being aware of this fact, followed by deferring or failing gracefully
Model of the threat and interventions for mesa-optimization
Consider a chain model
Base optimizer
-> Mesa optimizer
Produced through optimization of base objective in training environment
-> Misalignment (base objective != mesa objective)
Different kinds of misalignment
Proxy: mesa objective is a proxy for the base objective in the training environment
Side-effect: optimizing mesa objective happens to optimize base objective
Instrumental: optimizing base objective happens to optimize mesa objective
Approximate: objectives differ due to an approximation error, caused by limits of representation in mesa-optimizer’s model
Suboptimal: optimizing mesa objective happens to optimize base objective due to a flaw
What makes this particularly concerning relative to the “standard” alignment problem? It’s that the misalignment may be particularly difficult to detect (see next step). The mesa-optimizer has been “screened” by the base optimizer during training; it has to align to the base objective to some extent, at least in the training environment. This either means it is aligned, or deceptively misaligned. The base probability of a misaligned mesa-optimizer is lower than a misaligned system in general, but conditioned on it happening, it could carry greater risk.
-> Deployment
Not enough evidence of risk based on training and testing in controlled environment
Mesa optimizer is deceptive, to increase its chance of deployment
-> Distribution shift
Deployment environment differs from training environment or controlled testing environment
-> Unanticipated behaviour
The distribution shift causes the misalignment to manifest in different behaviour
There is a “standard” problem of robustness to distribution shift that applies to any machine learning model, not just mesa-optimizers. What makes a mesa-optimizer particularly concerning? It’s that optimization will tend to make the unanticipated behaviour more adversarial and influence-seeking. It can freely pursue an objective rather than merely perform a specified task.
-> Catastrophe
The unanticipated behaviour is so bad that it leads to a permanent, huge loss of value.
Ways this doesn’t happen:
The behaviour is counter-productive: perhaps self-destructive, not directed or consistent enough to lead to anything bad (depends on how sensitive the domain is to random error)
Intervention from other actors (depends on decisive strategic advantage)
Base → Mesa: prevent mesa-optimizer arising
Alternative methods of subsystem generation beyond search
Mesa → Misalignment: prevent mesa-optimizer from being misaligned
Standard alignment methods
Specific methods for base optimizer to align mesa optimizer
Misalignment → Deployment: don’t deploy misaligned mesa-optimizer
Deployment → Distribution shift: ensure deployment environment does not differ significantly from training environment or controlled testing environment
The converse is probably more tractable, in which case intervention comes before Base → Mesa
Distribution shift → Unanticipated behaviour: prevent unanticipated behaviour by being aware of this fact, followed by deferring or failing gracefully
Out-of-distribution detection
Well-calibrated uncertainty estimation
Unanticipated behaviour → Catastrophe: prevent catastrophe
Impact measurement
Shut down
Stop or steer the behaviour in various ways, depending on the nature of the behaviour