Andrew McKnight comments on Anthropic’s Core Views on AI Safety

Andrew McKnight 11 Mar 2023 2:27 UTC
LW: 3 AF: 1
0
AF
I mean, it’s mostly semantics but I think of mechanical interpretability as “inner” but not alignment and think it’s clearer that way, personally, so that we don’t call everything alignment. Observing properties doesn’t automatically get you good properties. I’ll read your link but it’s a bit too much to wade into for me atm.

Either way, it’s clear how to restate my question: Is mechanical interpretability work the only inner alignment work Anthropic is doing?
- Ethan Perez 11 Mar 2023 23:57 UTC
  LW: 14 AF: 7
  7
  AF Parent
  Evan and others on my team are working on non-mechanistic-interpretability directions primarily motivated by inner alignment:
  1. Developing model organisms for deceptive inner alignment, which we may use to study the risk factors for deceptive alignment
  2. Conditioning predictive models as an alternative to training agents. Predictive models may pose fewer inner alignment risks, for reasons discussed here
  3. Studying the extent to which models exhibit likely pre-requisites to deceptive inner alignment, such as situational awareness (a very preliminary exploration is in Sec. 5 in our paper on model-written evaluations)
  4. Investigating the extent to which externalized reasoning (e.g. chain of thought) is a way to gain transparency into a model’s process for solving a task
  There’s also ongoing work on other teams related to (automated) red teaming of models and understanding how models generalize, which may also turn out to be relevant/helpful for inner alignment. It’s pretty unclear to me how useful any of these directions will turn out to be for inner alignment in the end, but we’ve chosen these directions in large part because we’re very concerned about inner alignment, and we’re actively looking for new directions that seem useful for mitigating inner misalignment risks.
  - Andrew McKnight 13 Mar 2023 20:23 UTC
    1 point
    0
    Parent
    Thanks for the links and explanation, Ethan.