habryka comments on Don’t you think RLHF solves outer alignment?

habryka Nov 6, 2022, 2:59 AM
10 points
4
Sorry, yeah, I definitely just messed up in my comment here in the sense that I do think that after looking at the research, I definitely should have said “spent a few minutes on each datapoint”, instead of “a few seconds” (and indeed I noticed myself forgetting that I had said “seconds” instead of “minutes” in the middle of this conversation, which also indicates I am a bit triggered and doing an amount of rhetorical thinking and weaseling that I think is pretty harmful, and I apologize for kind of sliding between seconds and minutes in my last two comments).

I think the two orders of magnitude of time spent evaluating here is important, and though I don’t think it changes my overall answer very much, I do agree with you that it’s quite important to not give literal falsehoods especially when I am aware that other people care about the details here.

I do think the distinction between Mechanical Turkers and Scale AI/Upwork is pretty minimal, and I think what I said in that respect is fine. I don’t think the people you used were much better educated than the average mechanical turker, though I do think one update most people should make here is towards “most mechanical turkers are actually well-educated americans”, and I do think there is something slightly rhetorically tricky going on when I just say “random mechanical turkers” which I think people might misclassify as being less educated and smart than they actually are.

I do think a revised summary sentence “most RLHF as currently practiced is mostly just Mechanical Turkers with like half an hour of training and a reward button thinking about each datapoint for a few minutes” seems accurate to me, and feels like an important thing to understand when thinking about the question of “why doesn’t RLHF just solve AI Alignment?”.