Proxy: mesa objective is a proxy for the base objective in the training environment
Side-effect: optimizing mesa objective happens to optimize base objective
Instrumental: optimizing base objective happens to optimize mesa objective
Approximate: objectives differ due to an approximation error, caused by limits of representation in mesa-optimizer’s model
Suboptimal: optimizing mesa objective happens to optimize base objective due to a flaw
What makes this particularly concerning relative to the “standard” alignment problem? It’s that the misalignment may be particularly difficult to detect (see next step). The mesa-optimizer has been “screened” by the base optimizer during training; it has to align to the base objective to some extent, at least in the training environment. This either means it is aligned, or deceptively misaligned. The base probability of a misaligned mesa-optimizer is lower than a misaligned system in general, but conditioned on it happening, it could carry greater risk.
-> Deployment
Not enough evidence of risk based on training and testing in controlled environment
Mesa optimizer is deceptive, to increase its chance of deployment
-> Distribution shift
Deployment environment differs from training environment or controlled testing environment
-> Unanticipated behaviour
The distribution shift causes the misalignment to manifest in different behaviour
There is a “standard” problem of robustness to distribution shift that applies to any machine learning model, not just mesa-optimizers. What makes a mesa-optimizer particularly concerning? It’s that optimization will tend to make the unanticipated behaviour more adversarial and influence-seeking. It can freely pursue an objective rather than merely perform a specified task.
-> Catastrophe
The unanticipated behaviour is so bad that it leads to a permanent, huge loss of value.
Ways this doesn’t happen:
The behaviour is counter-productive: perhaps self-destructive, not directed or consistent enough to lead to anything bad (depends on how sensitive the domain is to random error)
Intervention from other actors (depends on decisive strategic advantage)
Interventions
Base → Mesa: prevent mesa-optimizer arising
Alternative methods of subsystem generation beyond search
Mesa → Misalignment: prevent mesa-optimizer from being misaligned
Standard alignment methods
Specific methods for base optimizer to align mesa optimizer
Deployment → Distribution shift: ensure deployment environment does not differ significantly from training environment or controlled testing environment
The converse is probably more tractable, in which case intervention comes before Base → Mesa
Distribution shift → Unanticipated behaviour: prevent unanticipated behaviour by being aware of this fact, followed by deferring or failing gracefully
Model of the threat and interventions for mesa-optimization
Consider a chain model
Base optimizer
-> Mesa optimizer
Produced through optimization of base objective in training environment
-> Misalignment (base objective != mesa objective)
Different kinds of misalignment
Proxy: mesa objective is a proxy for the base objective in the training environment
Side-effect: optimizing mesa objective happens to optimize base objective
Instrumental: optimizing base objective happens to optimize mesa objective
Approximate: objectives differ due to an approximation error, caused by limits of representation in mesa-optimizer’s model
Suboptimal: optimizing mesa objective happens to optimize base objective due to a flaw
What makes this particularly concerning relative to the “standard” alignment problem? It’s that the misalignment may be particularly difficult to detect (see next step). The mesa-optimizer has been “screened” by the base optimizer during training; it has to align to the base objective to some extent, at least in the training environment. This either means it is aligned, or deceptively misaligned. The base probability of a misaligned mesa-optimizer is lower than a misaligned system in general, but conditioned on it happening, it could carry greater risk.
-> Deployment
Not enough evidence of risk based on training and testing in controlled environment
Mesa optimizer is deceptive, to increase its chance of deployment
-> Distribution shift
Deployment environment differs from training environment or controlled testing environment
-> Unanticipated behaviour
The distribution shift causes the misalignment to manifest in different behaviour
There is a “standard” problem of robustness to distribution shift that applies to any machine learning model, not just mesa-optimizers. What makes a mesa-optimizer particularly concerning? It’s that optimization will tend to make the unanticipated behaviour more adversarial and influence-seeking. It can freely pursue an objective rather than merely perform a specified task.
-> Catastrophe
The unanticipated behaviour is so bad that it leads to a permanent, huge loss of value.
Ways this doesn’t happen:
The behaviour is counter-productive: perhaps self-destructive, not directed or consistent enough to lead to anything bad (depends on how sensitive the domain is to random error)
Intervention from other actors (depends on decisive strategic advantage)
Interventions
Base → Mesa: prevent mesa-optimizer arising
Alternative methods of subsystem generation beyond search
Mesa → Misalignment: prevent mesa-optimizer from being misaligned
Standard alignment methods
Specific methods for base optimizer to align mesa optimizer
Misalignment → Deployment: don’t deploy misaligned mesa-optimizer
Transparency
Interpretability
Deployment → Distribution shift: ensure deployment environment does not differ significantly from training environment or controlled testing environment
The converse is probably more tractable, in which case intervention comes before Base → Mesa
Distribution shift → Unanticipated behaviour: prevent unanticipated behaviour by being aware of this fact, followed by deferring or failing gracefully
Out-of-distribution detection
Well-calibrated uncertainty estimation
Corrigibility
Unanticipated behaviour → Catastrophe: prevent catastrophe
Internal
Impact measurement
External
Shut down
Stop or steer the behaviour in various ways, depending on the nature of the behaviour