Disclaimer: This comment was written as part of my application process to become an intern supervised by the author of this post.
Potential uses of the post:
This post is an excellent summary, and I think it has great potential for several purposes, in particular being used as part of a sequence on RLHF. It is a good introduction for many reasons:
It’s very useful to have lists like those, easily accessible to serve as reminders or pointers when you discuss with other people.
For aspiring RLHF understanders, it can provide minimum information to quickly prioritize what to learn about.
It can be used to generate ideas of research (“which of these problems could I solve?”) or superficially check that an idea is not promising (“it looks fancy, but actually it does not help against this problem”).
It can be used as a gateway to more in-depth articles. To that end, I would really appreciate it if you put links for each point, or mention that you are not aware of any specific article on the subject.
Meta level critics:
If it is taken as an introduction to RLHF risks, you should make clear where this list is exhaustive (to the best of your knowledge). This will allow readers who are aware it isn’t to easily propose additions. To facilitate its improvement, you should make explicit calls to the reader to point out where you suspect the post might fail; in particular, there could be a class of readers who are experts in a specific problem with RLHF not listed here, who come only to get a glimpse of related failure modes. They should be encouraged to participate.
As Daniel Kokotajlo and trevor have pointed out, the main value of this post is to provide an easy way to learn more about the problems with RLHF (as opposed to e.g. LOL which tries to be an insightful, comprehensive compilation on its own), thanks to the format and the organization.
The epistemic status of each point is unclear, which I think is a big issue. You give your thoughts after each section, but there is a big lack of systematic evaluation. You should separate for each point:
your opinion,
its severity,
its likelihood,
whether we have empirical, theoretical evidence or abstract reasons it should happen.
This has not been done in a systematic fashion, and it could be organized more clearly.
More specific criticism:
I am unsatisfied with how 7) is described. It is not a problem on the same level as others, more the destruction of a quality that fortunately seems to happen by default on GPTs. It could use a more in-depth explanation, especially since the linked article is mostly speculation.
I also think 11) belongs to this category of ‘not quite a problem’, because it is not obvious that direct human feedback would be better than learning a model of it. Maybe an easy way to predict humans noticing misalignment is to have a fully general model of what it means to be misaligned? Unlikely, but it deserves a longer discussion.
9) is another point that requires a longer discussion. Since it seems to be your own work, maybe you could write an article and link to it? What are the costs of RLHF (money and manpower) and how do they compare to scaling laws? Maybe it’s an issue… but maybe not. Data is needed here.
Talking about the Strawberry Problem is a bit unfair, because RLHF was never meant to solve it, so not only is it not surprising RLHF provides little insight into the Strawberry Problem, I also don’t expect that a solution to the Strawberry Problem would relate at all with RLHF. It seems like a different paradigm altogether.
More generally, RLHF is exactly the kind of methods warned against by a security mindset. It is an ad hoc method that afaik provides no theoretical guarantee of working at all. The issues with superficial alignment and the inability to generalize alignment in case of a distributional shift are related to that.
Why would we have any reason a priori to expect good behavior from RLHF? In the first section, you give empirical reasons to count RLHF as progress but a discussion of the reasons RLHF was even considered in the first place is noticeably lacking. To be honest, I am very surprised there is no mention of that. Did OpenAI not disclose how they invented RLHF? Did they randomly imagine the process and it happened to work?
In conclusion, I believe that there is a strong need for this kind of post, but that it could be polished more for the potential purposes proposed above.
Disclaimer: This comment was written as part of my application process to become an intern supervised by the author of this post.
Potential uses of the post:
This post is an excellent summary, and I think it has great potential for several purposes, in particular being used as part of a sequence on RLHF. It is a good introduction for many reasons:
It’s very useful to have lists like those, easily accessible to serve as reminders or pointers when you discuss with other people.
For aspiring RLHF understanders, it can provide minimum information to quickly prioritize what to learn about.
It can be used to generate ideas of research (“which of these problems could I solve?”) or superficially check that an idea is not promising (“it looks fancy, but actually it does not help against this problem”).
It can be used as a gateway to more in-depth articles. To that end, I would really appreciate it if you put links for each point, or mention that you are not aware of any specific article on the subject.
Meta level critics:
If it is taken as an introduction to RLHF risks, you should make clear where this list is exhaustive (to the best of your knowledge). This will allow readers who are aware it isn’t to easily propose additions.
To facilitate its improvement, you should make explicit calls to the reader to point out where you suspect the post might fail; in particular, there could be a class of readers who are experts in a specific problem with RLHF not listed here, who come only to get a glimpse of related failure modes. They should be encouraged to participate.
As Daniel Kokotajlo and trevor have pointed out, the main value of this post is to provide an easy way to learn more about the problems with RLHF (as opposed to e.g. LOL which tries to be an insightful, comprehensive compilation on its own), thanks to the format and the organization.
The epistemic status of each point is unclear, which I think is a big issue. You give your thoughts after each section, but there is a big lack of systematic evaluation. You should separate for each point:
your opinion,
its severity,
its likelihood,
whether we have empirical, theoretical evidence or abstract reasons it should happen.
This has not been done in a systematic fashion, and it could be organized more clearly.
More specific criticism:
I am unsatisfied with how 7) is described. It is not a problem on the same level as others, more the destruction of a quality that fortunately seems to happen by default on GPTs. It could use a more in-depth explanation, especially since the linked article is mostly speculation.
I also think 11) belongs to this category of ‘not quite a problem’, because it is not obvious that direct human feedback would be better than learning a model of it.
Maybe an easy way to predict humans noticing misalignment is to have a fully general model of what it means to be misaligned? Unlikely, but it deserves a longer discussion.
9) is another point that requires a longer discussion. Since it seems to be your own work, maybe you could write an article and link to it?
What are the costs of RLHF (money and manpower) and how do they compare to scaling laws? Maybe it’s an issue… but maybe not. Data is needed here.
Talking about the Strawberry Problem is a bit unfair, because RLHF was never meant to solve it, so not only is it not surprising RLHF provides little insight into the Strawberry Problem, I also don’t expect that a solution to the Strawberry Problem would relate at all with RLHF. It seems like a different paradigm altogether.
More generally, RLHF is exactly the kind of methods warned against by a security mindset. It is an ad hoc method that afaik provides no theoretical guarantee of working at all. The issues with superficial alignment and the inability to generalize alignment in case of a distributional shift are related to that.
Why would we have any reason a priori to expect good behavior from RLHF? In the first section, you give empirical reasons to count RLHF as progress but a discussion of the reasons RLHF was even considered in the first place is noticeably lacking.
To be honest, I am very surprised there is no mention of that. Did OpenAI not disclose how they invented RLHF? Did they randomly imagine the process and it happened to work?
In conclusion, I believe that there is a strong need for this kind of post, but that it could be polished more for the potential purposes proposed above.