Ben Cottier comments on Ben Cottier’s Shortform

Ben Cottier 12 May 2020 11:03 UTC
3 points
Model of the threat and interventions for mesa-optimization
- Consider a chain model
  - Base optimizer
  - -> Mesa optimizer
    Produced through optimization of base objective in training environment
  - -> Misalignment (base objective != mesa objective)
    Different kinds of misalignment
    Proxy: mesa objective is a proxy for the base objective in the training environment
    Side-effect: optimizing mesa objective happens to optimize base objective
    Instrumental: optimizing base objective happens to optimize mesa objective
    
    Approximate: objectives differ due to an approximation error, caused by limits of representation in mesa-optimizer’s model
    Suboptimal: optimizing mesa objective happens to optimize base objective due to a flaw
    
    What makes this particularly concerning relative to the “standard” alignment problem? It’s that the misalignment may be particularly difficult to detect (see next step). The mesa-optimizer has been “screened” by the base optimizer during training; it has to align to the base objective to some extent, at least in the training environment. This either means it is aligned, or deceptively misaligned. The base probability of a misaligned mesa-optimizer is lower than a misaligned system in general, but conditioned on it happening, it could carry greater risk.
  - -> Deployment
    Not enough evidence of risk based on training and testing in controlled environment
    Mesa optimizer is deceptive, to increase its chance of deployment
  - -> Distribution shift
    Deployment environment differs from training environment or controlled testing environment
  - -> Unanticipated behaviour
    The distribution shift causes the misalignment to manifest in different behaviour
    There is a “standard” problem of robustness to distribution shift that applies to any machine learning model, not just mesa-optimizers. What makes a mesa-optimizer particularly concerning? It’s that optimization will tend to make the unanticipated behaviour more adversarial and influence-seeking. It can freely pursue an objective rather than merely perform a specified task.
  - -> Catastrophe
    The unanticipated behaviour is so bad that it leads to a permanent, huge loss of value.
    Ways this doesn’t happen:
    The behaviour is counter-productive: perhaps self-destructive, not directed or consistent enough to lead to anything bad (depends on how sensitive the domain is to random error)
    Intervention from other actors (depends on decisive strategic advantage)
- Interventions
  - Base → Mesa: prevent mesa-optimizer arising
    Alternative methods of subsystem generation beyond search
  - Mesa → Misalignment: prevent mesa-optimizer from being misaligned
    Standard alignment methods
    Specific methods for base optimizer to align mesa optimizer
  - Misalignment → Deployment: don’t deploy misaligned mesa-optimizer
    Transparency
    Interpretability
  - Deployment → Distribution shift: ensure deployment environment does not differ significantly from training environment or controlled testing environment
    The converse is probably more tractable, in which case intervention comes before Base → Mesa
  - Distribution shift → Unanticipated behaviour: prevent unanticipated behaviour by being aware of this fact, followed by deferring or failing gracefully
    Out-of-distribution detection
    Well-calibrated uncertainty estimation
    Corrigibility
  - Unanticipated behaviour → Catastrophe: prevent catastrophe
    Internal
    Impact measurement
    
    External
    Shut down
    Stop or steer the behaviour in various ways, depending on the nature of the behaviour