Rohin Shah comments on AXRP Episode 4 - Risks from Learned Optimization with Evan Hubinger

Rohin Shah 19 Feb 2021 1:17 UTC
LW: 6 AF: 5
AF
Planned summary for the Alignment Newsletter:
This podcast delves into a bunch of questions and thoughts around <@mesa optimization@>(@Risks from Learned Optimization in Advanced Machine Learning Systems@). Here are some of the points that stood out to me (to be clear, many of these have been covered in this newsletter before, but it seemed worth it to state them again):

- A model is a mesa optimizer if it is a _mechanistic_ optimizer, that is, it is executing an algorithm that performs search for some objective.
- We need to focus on mechanistic optimizers instead of things that behave as though they are optimizing for some goal, because those two categories can have very different generalization behavior, and we are primarily interested in how they will generalize.
- Humans do seem like mesa optimizers relative to evolution (though perhaps not a central example). In particular, it seems accurate to say that humans look at different possible strategies and select the ones which have good properties, and thus we are implementing a mechanistic search algorithm.
- To reason about whether machine learning will result in these mechanistic optimizers, we need to reason about the _inductive biases_ of machine learning algorithms. We mostly don’t yet know how likely they are.
- Evan expects that in powerful neural networks there will exist a combination of neurons that encode the objective, which we might be able to find with interpretability techniques.
- We can’t rely on generalization bounds to guarantee performance, since in practice there is always some distribution shift (which invalidates those bounds).
- Although it is usually phrased in the train/test paradigm, mesa optimization is still a concern in an online learning setup, since at every time we are interested in whether the model will generalize well to the next data point it sees.
- We will probably select for simple ML models (in the sense of short description length) but not for low inference time, such that mechanistic optimizers are more likely than models that use more space (the extreme version being lookup tables).
- If you want to avoid mesa optimizers entirely (rather than aligning them), you probably need to have a pretty major change from the current practice of AI, as with STEM AI and Microscope AI (explained <@here@>(@An overview of 11 proposals for building safe advanced AI@)).
- Even in a <@CAIS scenario@>(@Reframing Superintelligence: Comprehensive AI Services as General Intelligence@) where we have (say) a thousand models doing different tasks, each of those tasks will still likely be complex enough to lead to the models being mesa optimizers.
- There are lots of mesa objectives which would lead to deceptive alignment relative to corrigible or internalized alignment, and so we should expect deceptive alignment a priori.
- DanielFilan 19 Feb 2021 2:02 UTC
  LW: 8 AF: 6
  AF Parent
  (sorry I didn’t reply to this when you messaged it to me privately, this has been a low-brain-power week)
  
  To reason about whether machine learning will result in these mechanistic optimizers, we need to reason about the inductive biases of machine learning algorithms. We mostly don’t yet know how likely they are.
  
  I think Evan also indirectly appeals to ‘inductive biases’ in the parameter to function mapping of neural networks, e.g. the result Joar Skalse contributed to about properties of random nets.
  - DanielFilan 19 Feb 2021 2:03 UTC
    LW: 4 AF: 2
    AF Parent
    Also my biggest take-away was the argument for why we shouldn’t expect myopia by default. But perhaps this was already obvious to others.
    - Rohin Shah 19 Feb 2021 4:32 UTC
      LW: 4 AF: 4
      AF Parent
      So my understanding is there are two arguments:
      A myopic objective requires an extra distinction to say “don’t continue past the end of the episode”
      Something about online learning
      The online learning argument is actually super complicated and dependent on a bunch of factors, so I’m not going to summarize that one here. So I’ve just added the other one:
      Even if training on a myopic base objective, we might expect the mesa objective to be non-myopic, as the non-myopic objective “pursue X” is simpler than the myopic objective “pursue X until time T”.
  - Rohin Shah 19 Feb 2021 4:21 UTC
    LW: 2 AF: 2
    AF Parent
    I did mean to include that; going to delete the word “algorithms” since that’s what’s causing the ambiguity.