Erik Jenner comments on Reframing inner alignment

Erik Jenner 11 Dec 2022 21:11 UTC
LW: 3 AF: 2
0
AF
Thanks, computing J not being part of step 1 helps clear things up.
I do think that “realistically defining the environment” is pretty closely related to being able to detect deceptive misalignment: one way J could fail due to deception would be if its specification of the environment is good enough for most purposes, but still has some differences to the real world which allow an AI to detect the difference. Then you could have a policy that is good according to J, but which still destroys the world when actually deployed.
Similar to my comment in the other thread on deception detection, most of my optimism about alignment comes from worlds where we don’t need to define the environment in such a robust way (doing so just seems hopelessly intractable to me). Not sure whether we mostly disagree on the hardness of solving things within your decomposition, or on how good chances are we’ll be fine with less formal guarantees.
- davidad 11 Dec 2022 21:40 UTC
  LW: 3 AF: 2
  0
  AF Parent
  To the final question, for what it’s worth to contextualize my perspective, I think my inside-view is simultaneously:
  - unusually optimistic about formal verification
  - unusually optimistic about learning interpretable world-models
  - unusually pessimistic about learning interpretable end-to-end policies
- davidad 11 Dec 2022 21:23 UTC
  LW: 2 AF: 1
  0
  AF Parent
  I agree, if there is a class of environment-behaviors that occur with nonnegligible probability in the real world but occur with negligible probability in the environment-model encoded in $J$ , that would be a vulnerability in the shape of alignment plan I’m gesturing at. However, aligning a predictive model of reality to reality is “natural” compared to normative alignment. And the probability with which this vulnerability can actually be bad is linearly related to something like total variation distance between the model and reality; I don’t know if this is exactly formally correct, but I think there’s some true theorem vaguely along the lines of: a 1% TV distance could only cause a 1% chance of alignment failure via this vulnerability. We don’t have to get an astronomically perfect model of reality to have any hope of its not being exploited. Judicious use of worst-case maximin approaches (e.g. credal sets rather than pure Bayesian modeling) will also help a lot with narrowing this gap, since it will be (something like) the gap to the nearest point in the set rather than to a single distribution.