I’ve fixed #1-#3. Arguments about the universal prior are definitely not something I want to get into with this post, so for #2 I’ve just made a vague statement that misalignment can arise for other reasons and linked to Paul’s post.
I’m hesitant to change #4 before I fully understand why.
I’m not exactly sure what you’re trying to say here. The way I would describe this is that internalization requires an expensive duplication where the objective is represented separately from the world model despite the world model including information about the objective.
So, there are these two channels, input data and SGD. If the model’s objective can only be modified by SGD, then (since SGD doesn’t want to do super complex modifications), it is easier for SGD to create a pointer rather than duplicate the [model of the base objective] explicitly.
But the bolded part seemed like a necessary condition, and that’s what I’m trying to say in the part you quoted. Without this condition, I figured the model could just modify [its objective] and [its model of the Base Objective] in parallel through processing input data. I still don’t think I quite understand why this isn’t plausible. If the [model of Base objective] and the [Mesa Objective] get modified simultaneously, I don’t see any one step where this is harder than creating a pointer. You seem to need an argument for why [the model of the base objective] gets represented in full before the Mesa Objective is modified.
Edit: I slightly rephrased it to say
If we further assume that processing input data doesn’t directly modify the model’s objective (the Mesa Objective), or that its model of the Base Objective is created first,[4]
Many thanks for taking the time to find errors.
I’ve fixed #1-#3. Arguments about the universal prior are definitely not something I want to get into with this post, so for #2 I’ve just made a vague statement that misalignment can arise for other reasons and linked to Paul’s post.
I’m hesitant to change #4 before I fully understand why.
So, there are these two channels, input data and SGD. If the model’s objective can only be modified by SGD, then (since SGD doesn’t want to do super complex modifications), it is easier for SGD to create a pointer rather than duplicate the [model of the base objective] explicitly.
But the bolded part seemed like a necessary condition, and that’s what I’m trying to say in the part you quoted. Without this condition, I figured the model could just modify [its objective] and [its model of the Base Objective] in parallel through processing input data. I still don’t think I quite understand why this isn’t plausible. If the [model of Base objective] and the [Mesa Objective] get modified simultaneously, I don’t see any one step where this is harder than creating a pointer. You seem to need an argument for why [the model of the base objective] gets represented in full before the Mesa Objective is modified.
Edit: I slightly rephrased it to say
The post still contains a misplaced mention of MLK shortly after the first mention of Luther:
Ah, shoot. Thanks.