The RLHF work I’m most excited by, and which constitutes a large fraction of current RLHF work, is focused on getting humans to reward the right thing, and I’m particularly excited about approaches that involve model assistance, since that’s the main way in which we can hope for the approach to scale gracefully with model capabilities.
Yeah, I agree that it’s reasonable to think about ways we can provide better feedback, though it’s a hard problem, and there are strong arguments that most approaches that scale locally well do not scale well globally.
However, I do think in-practice, the RLHF that has been implemented has mostly been mechanical turkers thinking about a problem for a few minutes, or maybe sometimes random people off of the bountied rationality facebook group (which does seem a bit better, but like, not by a ton). We sometimes have provided some model assistance, but I don’t actually know of many setups where we have done something very different, so I don’t think my description of RLHF in practice is “mostly wrong”.
Annoyingly almost none of the papers and blogposts speak straightforwardly about who they used as the raters (which sure seems like an actually pretty important piece of information to include), so I might be wrong here, but I had multiple conversations over the years with people who were running RLHF experiments about the difficulties of getting mechanical turkers and other people in that reference class to do the right thing and provide useful feedback, so I am confident at least a substantial chunk of the current research does indeed work that way.
I do think the disagreement here is likely mostly semantics. My guess is we both agree that most research so far has relied on pretty low-context human raters. We also both agree that that very likely won’t scale, and that there is research going on trying to improve rater accuracy and productivity. We probably disagree about how much that research changes the fundamental dynamics of the problem and is actually helpful, and that is somewhat relevant to OP’s question, but my guess is after splitting up the facts this way, there isn’t a lot of the disagreement you called out remaining.
However, I do think in-practice, the RLHF that has been implemented has mostly been mechanical turkers thinking about a problem for a few minutes
I do not consider this to be accurate. With WebGPT for example, contractors were generally highly educated, usually with an undergraduate degree or higher, were given a 17-page instruction manual, had to pass a manually-checked trial, and spent an average of 10 minutes on each comparison, with the assistance of model-provided citations. This information is all available in Appendix C of the paper.
There is RLHF work that uses lower-quality data, but it tends to be work that is more experimental, because data quality becomes important once you are working on a model that is going to be used in the real world.
Annoyingly almost none of the papers and blogposts speak straightforwardly about who they used as the raters
There is lots of information about rater selection given in RLHF papers, for example, Appendix B of InstructGPT and Appendix C of WebGPT. What additional information do you consider to be missing?
I do not consider this to be accurate. With WebGPT for example, contractors were generally highly educated, usually with an undergraduate degree or higher, were given a 17-page instruction manual, had to pass a manually-checked trial, and spent an average of 10 minutes on each comparison, with the assistance of model-provided citations.
Sorry, I don’t understand how this is in conflict to what I am saying. Here is the relevant section from your paper:
Our labelers consist of contractors hired either through Upwork, or sourced from Scale AI. [...]
[Some details about selection criteria]
After collecting this data, we selected the labelers who did well on all of these criteria (we performed selections on an anonymized version of the data).
Most mechanical turkers also have an undergraduate degree or higher, are often given long instruction manuals, and 10 minutes of thinking clearly qualifies as “thinking about a problem for a few minutes”. Maybe we are having a misunderstanding around the word “problem” in that sentence, where I meant to imply that they spent a few minutes about each datapoint they provide, not like, the whole overall problem.
Scale AI used to use Mechanical Turkers (though I think they transitioned towards their own workforce, or at least filter on Mechanical Turkers additionally), and I don’t think is qualitatively different in any substantial way. Upwork has higher variance, and at least in my experience doing a bunch of survey work does not perform better than Mechanical Turk (indeed my sense was that Mechanical Turk was actually better, though it’s pretty hard to compare).
This is indeed exactly the training setup I was talking about, and sure, I guess you used Scale AI and Upwork instead of Mechanical Turk, but I don’t think anyone would come away with a different impression if I had said “RLHF in-practice consists of hiring some random people from Upwork/Scale AI, doing some very basic filtering, giving them a 20-page instruction manual, and then having them think about a problem for a few minutes”.
There is lots of information about rater selection given in RLHF papers, for example, Appendix B of InstructGPT and Appendix C of WebGPT. What additional information do you consider to be missing?
Oh, great! That was actually exactly what I was looking for. I had indeed missed it when looking at a bunch of RLHF papers earlier today. When I wrote my comment I was looking at the “learning from human preferences” paper, which does not say anything about rater recruitment as far as I can tell.
I would estimate that the difference between “hire some mechanical turkers and have them think for like a few seconds” and the actual data collection process accounts for around 1⁄3 of the effort that went into WebGPT, rising to around 2⁄3 if you include model assistance in the form of citations. So I think that what you wrote gives a misleading impression of the aims and priorities of RLHF work in practice.
I think it’s best to err on the side of not saying things that are false in a literal sense when the distinction is important to other people, even when the distinction isn’t important to you—although I can see why you might not have realized the importance of the distinction to others from reading papers alone, and “a few minutes” is definitely less inaccurate.
Sorry, yeah, I definitely just messed up in my comment here in the sense that I do think that after looking at the research, I definitely should have said “spent a few minutes on each datapoint”, instead of “a few seconds” (and indeed I noticed myself forgetting that I had said “seconds” instead of “minutes” in the middle of this conversation, which also indicates I am a bit triggered and doing an amount of rhetorical thinking and weaseling that I think is pretty harmful, and I apologize for kind of sliding between seconds and minutes in my last two comments).
I think the two orders of magnitude of time spent evaluating here is important, and though I don’t think it changes my overall answer very much, I do agree with you that it’s quite important to not give literal falsehoods especially when I am aware that other people care about the details here.
I do think the distinction between Mechanical Turkers and Scale AI/Upwork is pretty minimal, and I think what I said in that respect is fine. I don’t think the people you used were much better educated than the average mechanical turker, though I do think one update most people should make here is towards “most mechanical turkers are actually well-educated americans”, and I do think there is something slightly rhetorically tricky going on when I just say “random mechanical turkers” which I think people might misclassify as being less educated and smart than they actually are.
I do think a revised summary sentence “most RLHF as currently practiced is mostly just Mechanical Turkers with like half an hour of training and a reward button thinking about each datapoint for a few minutes” seems accurate to me, and feels like an important thing to understand when thinking about the question of “why doesn’t RLHF just solve AI Alignment?”.
Yeah, I agree that it’s reasonable to think about ways we can provide better feedback, though it’s a hard problem, and there are strong arguments that most approaches that scale locally well do not scale well globally.
However, I do think in-practice, the RLHF that has been implemented has mostly been mechanical turkers thinking about a problem for a few minutes, or maybe sometimes random people off of the bountied rationality facebook group (which does seem a bit better, but like, not by a ton). We sometimes have provided some model assistance, but I don’t actually know of many setups where we have done something very different, so I don’t think my description of RLHF in practice is “mostly wrong”.
Annoyingly almost none of the papers and blogposts speak straightforwardly about who they used as the raters (which sure seems like an actually pretty important piece of information to include), so I might be wrong here, but I had multiple conversations over the years with people who were running RLHF experiments about the difficulties of getting mechanical turkers and other people in that reference class to do the right thing and provide useful feedback, so I am confident at least a substantial chunk of the current research does indeed work that way.
I do think the disagreement here is likely mostly semantics. My guess is we both agree that most research so far has relied on pretty low-context human raters. We also both agree that that very likely won’t scale, and that there is research going on trying to improve rater accuracy and productivity. We probably disagree about how much that research changes the fundamental dynamics of the problem and is actually helpful, and that is somewhat relevant to OP’s question, but my guess is after splitting up the facts this way, there isn’t a lot of the disagreement you called out remaining.
I do not consider this to be accurate. With WebGPT for example, contractors were generally highly educated, usually with an undergraduate degree or higher, were given a 17-page instruction manual, had to pass a manually-checked trial, and spent an average of 10 minutes on each comparison, with the assistance of model-provided citations. This information is all available in Appendix C of the paper.
There is RLHF work that uses lower-quality data, but it tends to be work that is more experimental, because data quality becomes important once you are working on a model that is going to be used in the real world.
There is lots of information about rater selection given in RLHF papers, for example, Appendix B of InstructGPT and Appendix C of WebGPT. What additional information do you consider to be missing?
Sorry, I don’t understand how this is in conflict to what I am saying. Here is the relevant section from your paper:
Most mechanical turkers also have an undergraduate degree or higher, are often given long instruction manuals, and 10 minutes of thinking clearly qualifies as “thinking about a problem for a few minutes”. Maybe we are having a misunderstanding around the word “problem” in that sentence, where I meant to imply that they spent a few minutes about each datapoint they provide, not like, the whole overall problem.
Scale AI used to use Mechanical Turkers (though I think they transitioned towards their own workforce, or at least filter on Mechanical Turkers additionally), and I don’t think is qualitatively different in any substantial way. Upwork has higher variance, and at least in my experience doing a bunch of survey work does not perform better than Mechanical Turk (indeed my sense was that Mechanical Turk was actually better, though it’s pretty hard to compare).
This is indeed exactly the training setup I was talking about, and sure, I guess you used Scale AI and Upwork instead of Mechanical Turk, but I don’t think anyone would come away with a different impression if I had said “RLHF in-practice consists of hiring some random people from Upwork/Scale AI, doing some very basic filtering, giving them a 20-page instruction manual, and then having them think about a problem for a few minutes”.
Oh, great! That was actually exactly what I was looking for. I had indeed missed it when looking at a bunch of RLHF papers earlier today. When I wrote my comment I was looking at the “learning from human preferences” paper, which does not say anything about rater recruitment as far as I can tell.
I would estimate that the difference between “hire some mechanical turkers and have them think for like a few seconds” and the actual data collection process accounts for around 1⁄3 of the effort that went into WebGPT, rising to around 2⁄3 if you include model assistance in the form of citations. So I think that what you wrote gives a misleading impression of the aims and priorities of RLHF work in practice.
I think it’s best to err on the side of not saying things that are false in a literal sense when the distinction is important to other people, even when the distinction isn’t important to you—although I can see why you might not have realized the importance of the distinction to others from reading papers alone, and “a few minutes” is definitely less inaccurate.
Sorry, yeah, I definitely just messed up in my comment here in the sense that I do think that after looking at the research, I definitely should have said “spent a few minutes on each datapoint”, instead of “a few seconds” (and indeed I noticed myself forgetting that I had said “seconds” instead of “minutes” in the middle of this conversation, which also indicates I am a bit triggered and doing an amount of rhetorical thinking and weaseling that I think is pretty harmful, and I apologize for kind of sliding between seconds and minutes in my last two comments).
I think the two orders of magnitude of time spent evaluating here is important, and though I don’t think it changes my overall answer very much, I do agree with you that it’s quite important to not give literal falsehoods especially when I am aware that other people care about the details here.
I do think the distinction between Mechanical Turkers and Scale AI/Upwork is pretty minimal, and I think what I said in that respect is fine. I don’t think the people you used were much better educated than the average mechanical turker, though I do think one update most people should make here is towards “most mechanical turkers are actually well-educated americans”, and I do think there is something slightly rhetorically tricky going on when I just say “random mechanical turkers” which I think people might misclassify as being less educated and smart than they actually are.
I do think a revised summary sentence “most RLHF as currently practiced is mostly just Mechanical Turkers with like half an hour of training and a reward button thinking about each datapoint for a few minutes” seems accurate to me, and feels like an important thing to understand when thinking about the question of “why doesn’t RLHF just solve AI Alignment?”.