My third and final example: in one conversation, someone made a claim which I see as “exactly wrong”: that we can somehow lower-bound the complexity of a mesa-optimizer in comparison to a non-agentic hypothesis (perhaps because a mesa-optimizer has to have a world-model plus other stuff, where a regular hypothesis just needs to directly model the world). This idea was used to argue against some concern of mine.
The problem is precisely that we know of no way of doing that! If we did, there would not be any inner alignment problem! We could just focus on the simplest hypothesis that fit the data, which is pretty much what you want to do anyway!
I think there would still be an inner alignment problem even if deceptive models were in fact always more complicated than non-deceptive models—i.e. if the universal prior wasn’t malign—which is just that the neural net prior (or whatever other ML prior we use) might be malign even if the universal prior isn’t (and in fact I’m not sure that there’s even that much of a connection between the malignity of those two priors).
Also, I think that this distinction leads me to view “the main point of the inner alignment problem” quite differently: I would say that the main point of the inner alignment problem is that whatever prior we use in practice will probably be malign. But that does suggest that if you can construct a training process that defuses the arguments for why its prior/inductive biases will be malign, then I think that does make significant progress on defusing the inner alignment problem. Of course, I agree that we’d like to be as confident that there’s as little malignancy/deception as possible such that just defusing the arguments that we can come up with might not be enough—but I still think that trying to figure out how plausible it is that the actual prior we use will be malign is in fact at least attempting to address the core problem.
I think there would still be an inner alignment problem even if deceptive models were in fact always more complicated than non-deceptive models—i.e. if the universal prior wasn’t malign—which is just that the neural net prior (or whatever other ML prior we use) might be malign even if the universal prior isn’t (and in fact I’m not sure that there’s even that much of a connection between the malignity of those two priors).
If the universal prior were benign but NNs were still potentially malign, I think I would argue strongly against the use of NNs and in favor of more direct approximations of the universal prior. But, I agree this is not 100% obvious; giving up prosaic AI is giving up a lot.
Also, I think that this distinction leads me to view “the main point of the inner alignment problem” quite differently: I would say that the main point of the inner alignment problem is that whatever prior we use in practice will probably be malign.
Hopefully my final write-up won’t contain so much polemicizing about what “the main point” is, like this write-up, and will instead just contain good descriptions of the various important problems.
I think there would still be an inner alignment problem even if deceptive models were in fact always more complicated than non-deceptive models—i.e. if the universal prior wasn’t malign—which is just that the neural net prior (or whatever other ML prior we use) might be malign even if the universal prior isn’t (and in fact I’m not sure that there’s even that much of a connection between the malignity of those two priors)
I think there would still be an inner alignment problem even if deceptive models were in fact always more complicated than non-deceptive models—i.e. if the universal prior wasn’t malign—which is just that the neural net prior (or whatever other ML prior we use) might be malign even if the universal prior isn’t (and in fact I’m not sure that there’s even that much of a connection between the malignity of those two priors).
Also, I think that this distinction leads me to view “the main point of the inner alignment problem” quite differently: I would say that the main point of the inner alignment problem is that whatever prior we use in practice will probably be malign. But that does suggest that if you can construct a training process that defuses the arguments for why its prior/inductive biases will be malign, then I think that does make significant progress on defusing the inner alignment problem. Of course, I agree that we’d like to be as confident that there’s as little malignancy/deception as possible such that just defusing the arguments that we can come up with might not be enough—but I still think that trying to figure out how plausible it is that the actual prior we use will be malign is in fact at least attempting to address the core problem.
If the universal prior were benign but NNs were still potentially malign, I think I would argue strongly against the use of NNs and in favor of more direct approximations of the universal prior. But, I agree this is not 100% obvious; giving up prosaic AI is giving up a lot.
Hopefully my final write-up won’t contain so much polemicizing about what “the main point” is, like this write-up, and will instead just contain good descriptions of the various important problems.
Agree.