davidad comments on Reframing inner alignment

davidad 11 Dec 2022 21:23 UTC
LW: 2 AF: 1
0
AF
I agree, if there is a class of environment-behaviors that occur with nonnegligible probability in the real world but occur with negligible probability in the environment-model encoded in $J$ , that would be a vulnerability in the shape of alignment plan I’m gesturing at. However, aligning a predictive model of reality to reality is “natural” compared to normative alignment. And the probability with which this vulnerability can actually be bad is linearly related to something like total variation distance between the model and reality; I don’t know if this is exactly formally correct, but I think there’s some true theorem vaguely along the lines of: a 1% TV distance could only cause a 1% chance of alignment failure via this vulnerability. We don’t have to get an astronomically perfect model of reality to have any hope of its not being exploited. Judicious use of worst-case maximin approaches (e.g. credal sets rather than pure Bayesian modeling) will also help a lot with narrowing this gap, since it will be (something like) the gap to the nearest point in the set rather than to a single distribution.