This is great—thanks for writing this! I particularly liked your explanation of deceptive alignment with the diagrams to explain the different setups. Some comments, however:
(These models are called the training data and the setting is called supervised learning.)
Should be “these images are.”
Thus, there is only a problem if our way of obtaining feedback is flawed.
I don’t think that’s right. Even if the feedback mechanism is perfect, if your inductive biases are off, you could still end up with a highly misaligned model. Consider, for example, Paul’s argument that the universal prior is malign—that’s a setting where the feedback is perfect but you still get malign optimization because the prior is bad.
For proxy alignment, think of Martin Luther King.
The analogy is meant to be to the original Martin Luther, not MLK.
If we further assume that processing input data doesn’t directly modify the model’s objective, it follows that representing a complex objective via internalization is harder than via “modelling” (i.e., corrigibility or deception).
I’m not exactly sure what you’re trying to say here. The way I would describe this is that internalization requires an expensive duplication where the objective is represented separately from the world model despite the world model including information about the objective.
I’ve fixed #1-#3. Arguments about the universal prior are definitely not something I want to get into with this post, so for #2 I’ve just made a vague statement that misalignment can arise for other reasons and linked to Paul’s post.
I’m hesitant to change #4 before I fully understand why.
I’m not exactly sure what you’re trying to say here. The way I would describe this is that internalization requires an expensive duplication where the objective is represented separately from the world model despite the world model including information about the objective.
So, there are these two channels, input data and SGD. If the model’s objective can only be modified by SGD, then (since SGD doesn’t want to do super complex modifications), it is easier for SGD to create a pointer rather than duplicate the [model of the base objective] explicitly.
But the bolded part seemed like a necessary condition, and that’s what I’m trying to say in the part you quoted. Without this condition, I figured the model could just modify [its objective] and [its model of the Base Objective] in parallel through processing input data. I still don’t think I quite understand why this isn’t plausible. If the [model of Base objective] and the [Mesa Objective] get modified simultaneously, I don’t see any one step where this is harder than creating a pointer. You seem to need an argument for why [the model of the base objective] gets represented in full before the Mesa Objective is modified.
Edit: I slightly rephrased it to say
If we further assume that processing input data doesn’t directly modify the model’s objective (the Mesa Objective), or that its model of the Base Objective is created first,[4]
This is great—thanks for writing this! I particularly liked your explanation of deceptive alignment with the diagrams to explain the different setups. Some comments, however:
Should be “these images are.”
I don’t think that’s right. Even if the feedback mechanism is perfect, if your inductive biases are off, you could still end up with a highly misaligned model. Consider, for example, Paul’s argument that the universal prior is malign—that’s a setting where the feedback is perfect but you still get malign optimization because the prior is bad.
The analogy is meant to be to the original Martin Luther, not MLK.
I’m not exactly sure what you’re trying to say here. The way I would describe this is that internalization requires an expensive duplication where the objective is represented separately from the world model despite the world model including information about the objective.
Many thanks for taking the time to find errors.
I’ve fixed #1-#3. Arguments about the universal prior are definitely not something I want to get into with this post, so for #2 I’ve just made a vague statement that misalignment can arise for other reasons and linked to Paul’s post.
I’m hesitant to change #4 before I fully understand why.
So, there are these two channels, input data and SGD. If the model’s objective can only be modified by SGD, then (since SGD doesn’t want to do super complex modifications), it is easier for SGD to create a pointer rather than duplicate the [model of the base objective] explicitly.
But the bolded part seemed like a necessary condition, and that’s what I’m trying to say in the part you quoted. Without this condition, I figured the model could just modify [its objective] and [its model of the Base Objective] in parallel through processing input data. I still don’t think I quite understand why this isn’t plausible. If the [model of Base objective] and the [Mesa Objective] get modified simultaneously, I don’t see any one step where this is harder than creating a pointer. You seem to need an argument for why [the model of the base objective] gets represented in full before the Mesa Objective is modified.
Edit: I slightly rephrased it to say
The post still contains a misplaced mention of MLK shortly after the first mention of Luther:
Ah, shoot. Thanks.