I’m confused by the analogy between this experiment and aligning a superintelligent model.
I can imagine someone seeing the RLHF result and saying, “oh, that’s great news for alignment! If we train a superintelligent model on our preferences, it will just imitate our preferences as-is, rather than treating them as a flawed approximation of some other, ‘correct’ set of preferences and then imitating those instead.”
But the paper’s interpretation is the opposite of this. From the paper’s perspective, it’s bad if the student (analogized to a superintelligence) simply imitates the preferences of the teacher (analogized to us), as opposed to imitating some other set of “correct” preferences which differ from what the student explicitly expressed.
Now, of course, there is a case where it makes sense to want this out of a superintelligence, and it’s a case that the paper talks about to motivate the experiment: the case where we don’t understand what the superintelligence is doing, and so we can’t confidently express preferences about its actions.
That is, although we may basically know what we want at a coarse outcome level—“do a good job, don’t hurt anyone, maximize human flourishing,” that sort of thing—we can’t translate this into preferences about the lower-level behaviors of the AI, because we don’t have a good mental model of how the lower-level behaviors cause higher-level outcomes.
From our perspective, the options for lower-level behavior all look like “should it do Incomprehensibly Esoteric Thing A or Incomprehensibly Esoteric Thing B?” If asked to submit a preference annotation for this, we’d shrug and say “uhh, whichever one maximizes human flourishing??” and then press button A or button B effectively at random.
But in this case, trying to align the AI by expressing preferences about low-level actions seems like an obviously bad idea, to the point that I wouldn’t expect anyone to try it? Like, if we get to the point where we are literally doing preference annotations on Incomprehensibly Esoteric Things, and we know we’re basically pushing button A or button B at random because we don’t know what’s going on, then I assume we would stop and try something else.
(It is also not obvious to me that the reward modeling experiment factored in this way, with the small teacher “having the right values” but not understanding the tasks well enough to know which actions were consistent with them. I haven’t looked at every section of the paper, so maybe this was addressed?)
In this case, finetuning on preference annotations no longer conveys our preferences to the AI, because the annotations no longer capture our preferences. Instead, I’d imagine we would want to convey our preferences to the AI in a more direct and task-independent way—to effectively say, “what we want is for you to do a good job, not hurt anyone, maximize human flourishing; just do whatever accomplishes that.”
And since LLMs are very good at language and human-like intuition, and can be finetuned for generic instruction-following, literally just saying that (or something similar) to an instruction-following superintelligent LLM would be at least a strong baseline, and presumably better than preference data we know is garbage.
(In that last point, I’m leaning on the assumption that we can finetune an superintelligence for generic instruction-following more easily than we can finetune it for a specific task we don’t understand.
This seems plausible: we can tune it on a diverse set of instructions paired with behaviors we know are appropriate [because the tasks are merely human-level], and it’ll probably make the obvious generalization of “ah, I’m supposed to do whatever it says in the instruction slot,” rather than the bizarre misfire of “ah, I’m supposed to do whatever it says in the instruction slot unless the task requires superhuman intelligence, in which case I’m supposed to do some other thing.” [Unless it is deceptively aligned, but in that case all of these techniques will be equally useless.])
I’m confused by the analogy between this experiment and aligning a superintelligent model.
I can imagine someone seeing the RLHF result and saying, “oh, that’s great news for alignment! If we train a superintelligent model on our preferences, it will just imitate our preferences as-is, rather than treating them as a flawed approximation of some other, ‘correct’ set of preferences and then imitating those instead.”
But the paper’s interpretation is the opposite of this. From the paper’s perspective, it’s bad if the student (analogized to a superintelligence) simply imitates the preferences of the teacher (analogized to us), as opposed to imitating some other set of “correct” preferences which differ from what the student explicitly expressed.
Now, of course, there is a case where it makes sense to want this out of a superintelligence, and it’s a case that the paper talks about to motivate the experiment: the case where we don’t understand what the superintelligence is doing, and so we can’t confidently express preferences about its actions.
That is, although we may basically know what we want at a coarse outcome level—“do a good job, don’t hurt anyone, maximize human flourishing,” that sort of thing—we can’t translate this into preferences about the lower-level behaviors of the AI, because we don’t have a good mental model of how the lower-level behaviors cause higher-level outcomes.
From our perspective, the options for lower-level behavior all look like “should it do Incomprehensibly Esoteric Thing A or Incomprehensibly Esoteric Thing B?” If asked to submit a preference annotation for this, we’d shrug and say “uhh, whichever one maximizes human flourishing??” and then press button A or button B effectively at random.
But in this case, trying to align the AI by expressing preferences about low-level actions seems like an obviously bad idea, to the point that I wouldn’t expect anyone to try it? Like, if we get to the point where we are literally doing preference annotations on Incomprehensibly Esoteric Things, and we know we’re basically pushing button A or button B at random because we don’t know what’s going on, then I assume we would stop and try something else.
(It is also not obvious to me that the reward modeling experiment factored in this way, with the small teacher “having the right values” but not understanding the tasks well enough to know which actions were consistent with them. I haven’t looked at every section of the paper, so maybe this was addressed?)
In this case, finetuning on preference annotations no longer conveys our preferences to the AI, because the annotations no longer capture our preferences. Instead, I’d imagine we would want to convey our preferences to the AI in a more direct and task-independent way—to effectively say, “what we want is for you to do a good job, not hurt anyone, maximize human flourishing; just do whatever accomplishes that.”
And since LLMs are very good at language and human-like intuition, and can be finetuned for generic instruction-following, literally just saying that (or something similar) to an instruction-following superintelligent LLM would be at least a strong baseline, and presumably better than preference data we know is garbage.
(In that last point, I’m leaning on the assumption that we can finetune an superintelligence for generic instruction-following more easily than we can finetune it for a specific task we don’t understand.
This seems plausible: we can tune it on a diverse set of instructions paired with behaviors we know are appropriate [because the tasks are merely human-level], and it’ll probably make the obvious generalization of “ah, I’m supposed to do whatever it says in the instruction slot,” rather than the bizarre misfire of “ah, I’m supposed to do whatever it says in the instruction slot unless the task requires superhuman intelligence, in which case I’m supposed to do some other thing.” [Unless it is deceptively aligned, but in that case all of these techniques will be equally useless.])
You might find Appendix G in the paper worth reading.