I think there would still be an inner alignment problem even if deceptive models were in fact always more complicated than non-deceptive models—i.e. if the universal prior wasn’t malign—which is just that the neural net prior (or whatever other ML prior we use) might be malign even if the universal prior isn’t (and in fact I’m not sure that there’s even that much of a connection between the malignity of those two priors).
If the universal prior were benign but NNs were still potentially malign, I think I would argue strongly against the use of NNs and in favor of more direct approximations of the universal prior. But, I agree this is not 100% obvious; giving up prosaic AI is giving up a lot.
Also, I think that this distinction leads me to view “the main point of the inner alignment problem” quite differently: I would say that the main point of the inner alignment problem is that whatever prior we use in practice will probably be malign.
Hopefully my final write-up won’t contain so much polemicizing about what “the main point” is, like this write-up, and will instead just contain good descriptions of the various important problems.
If the universal prior were benign but NNs were still potentially malign, I think I would argue strongly against the use of NNs and in favor of more direct approximations of the universal prior. But, I agree this is not 100% obvious; giving up prosaic AI is giving up a lot.
Hopefully my final write-up won’t contain so much polemicizing about what “the main point” is, like this write-up, and will instead just contain good descriptions of the various important problems.