This was an interesting read, especially the first section!
I’m confused by some aspects of the proposal in section 4, which makes it harder to say what would go wrong. As a starting point, what’s the training signal in the final step (RL training)? I think you’re assuming we have some outer-aligned reward signal, is that right? But then it seems like that reward signal would have to do the work of making sure that the AI only gets rewarded for following human instructions in a “good” way—I don’t think we just get that for free. As a silly example, if we rewarded the AI whenever it literally followed our commands, then even with this setup, it seems quite clear to me we’d at best get a literal-command-following AI, and not an AI that does what we actually want. (Not sure if you even meant to imply that the proposal solved that problem, or if this is purely about inner alignment).
The complexity regularizer should ensure the AI doesn’t develop some separate procedure for interpreting commands (which might end up crucially flawed/misaligned). Instead, it will use the same model of humans it uses to make predictions, and inaccuracies in it would equal inaccuracies in predictions, which would be purged by the SGD as it improves the AI’s capabilities.
Since this sounds to me like you are saying this proposal will automatically lead to commands being interpreted the way we mean them, I’ll say more on this specifically: the AI will presumably have not just a model of what humans actually want when they give commands (even assuming that’s one of the things it internally represents). It should just as easily be able to interpret commands literally using its existing world model (it’s something humans can do as well if we want to). So which of these you get would depend on the reward signal, I think.
For related reasons, I’m not even convinced you get something that’s inner-aligned in this proposal. It’s true that if everything works out the way you’re hoping, you won’t be starting with pre-existing inner-misaligned mesa objectives, you just have a pure predictive model and GPS. But then there are still lots of objectives that could be represented in terms of the existing predictive model that would all achieve high reward. I don’t quite follow why you think the objective we want would be especially likely—my sense is that even if “do what the human wants” is pretty simple to represent in the AI’s ontology, other objectives will be too (as one example, if the AI is already modeling the training process from the beginning of RL training, then “maximize the number in my reward register” might also be a very simple “connective tissue”).
This was an interesting read, especially the first section!
I’m confused by some aspects of the proposal in section 4, which makes it harder to say what would go wrong. As a starting point, what’s the training signal in the final step (RL training)? I think you’re assuming we have some outer-aligned reward signal, is that right? But then it seems like that reward signal would have to do the work of making sure that the AI only gets rewarded for following human instructions in a “good” way—I don’t think we just get that for free. As a silly example, if we rewarded the AI whenever it literally followed our commands, then even with this setup, it seems quite clear to me we’d at best get a literal-command-following AI, and not an AI that does what we actually want. (Not sure if you even meant to imply that the proposal solved that problem, or if this is purely about inner alignment).
Since this sounds to me like you are saying this proposal will automatically lead to commands being interpreted the way we mean them, I’ll say more on this specifically: the AI will presumably have not just a model of what humans actually want when they give commands (even assuming that’s one of the things it internally represents). It should just as easily be able to interpret commands literally using its existing world model (it’s something humans can do as well if we want to). So which of these you get would depend on the reward signal, I think.
For related reasons, I’m not even convinced you get something that’s inner-aligned in this proposal. It’s true that if everything works out the way you’re hoping, you won’t be starting with pre-existing inner-misaligned mesa objectives, you just have a pure predictive model and GPS. But then there are still lots of objectives that could be represented in terms of the existing predictive model that would all achieve high reward. I don’t quite follow why you think the objective we want would be especially likely—my sense is that even if “do what the human wants” is pretty simple to represent in the AI’s ontology, other objectives will be too (as one example, if the AI is already modeling the training process from the beginning of RL training, then “maximize the number in my reward register” might also be a very simple “connective tissue”).