Thanks for writing this. I’ve been having a lot of similar conversations, and found your post clarifying in stating a lot of core arguments clearly.
Is there an even better critique that the Skeptic could make?
Focusing first on human preference learning as a subset of alignment research: I think most ML researchers “should” agree on the importance of simple human preference learning, both from a safety and capabilities perspective. If we take the narrower question “should we do human preference learning, or is pretraining + minimal prompt engineering enough?”, I feel confident in the answer you give as Advocate: To the extent prompt engineering works, it’s because it’s preference learning in disguise, and leaning into preference learning (including supervised / RL finetuning) will work much better. Both the theoretical and empirical pictures to-date agree with this.
(My sense is that not all ML researchers immediately agree with this / maybe just haven’t considered the question in this frame, but that most researchers are pretty receptive to it and will agree in discussion.)
So I think a more challenging Skeptic might say: “Perhaps simple human preference learning is enough, and we can focus all alignment research there. Why do we need the other research directions in the alignment portfolio like handling inaccessible information, deceptive mesa-optimizers, or interpretability?” Here, “simple” human preference learning is referring to something like supervised (your step 1 for Question 1) + RL finetuning (step 2) + ad hoc ways of making it easier for humans to supervise models (limited versions of step 3).
I again side with Advocate here, but I think making the case is more difficult (and also perhaps requires different arguments for different research directions). I don’t have a response for this as short or convincing as what you have here. My typical response would expand on your points that more capable models will be more dangerous and that alignment might turn out to be very hard, so it’s important to consider these potential difficulties in advance. The hardness claim would probably involve failure stories (along these lines) or more abstract hardness arguments (alongtheselines).
Thanks for writing this. I’ve been having a lot of similar conversations, and found your post clarifying in stating a lot of core arguments clearly.
Focusing first on human preference learning as a subset of alignment research: I think most ML researchers “should” agree on the importance of simple human preference learning, both from a safety and capabilities perspective. If we take the narrower question “should we do human preference learning, or is pretraining + minimal prompt engineering enough?”, I feel confident in the answer you give as Advocate: To the extent prompt engineering works, it’s because it’s preference learning in disguise, and leaning into preference learning (including supervised / RL finetuning) will work much better. Both the theoretical and empirical pictures to-date agree with this.
(My sense is that not all ML researchers immediately agree with this / maybe just haven’t considered the question in this frame, but that most researchers are pretty receptive to it and will agree in discussion.)
So I think a more challenging Skeptic might say: “Perhaps simple human preference learning is enough, and we can focus all alignment research there. Why do we need the other research directions in the alignment portfolio like handling inaccessible information, deceptive mesa-optimizers, or interpretability?” Here, “simple” human preference learning is referring to something like supervised (your step 1 for Question 1) + RL finetuning (step 2) + ad hoc ways of making it easier for humans to supervise models (limited versions of step 3).
I again side with Advocate here, but I think making the case is more difficult (and also perhaps requires different arguments for different research directions). I don’t have a response for this as short or convincing as what you have here. My typical response would expand on your points that more capable models will be more dangerous and that alignment might turn out to be very hard, so it’s important to consider these potential difficulties in advance. The hardness claim would probably involve failure stories (along these lines) or more abstract hardness arguments (along these lines).
Agree with what you’ve written here—I think you put it very well.