Curated. All of these examples together really point quite clearly at a change in how language models behave when they’re trained on RLHF, away from the “accurately predict text” story toward something else that has a very different set of biases — I am interested to read your potential follow-up with your own hypotheses. Plsu, the post is really fun to read.
I think this post and your prior post both overstate the case in some ways, but they’re still great additions to my and I expect many others’ thinking on this subject. I broadly feel like I’ve been ‘needing’ posts like these on LW about current ML projects, giving a grounded conceptual account of how to think about what’s happening, and nobody else has been writing them, so I’m very grateful.
Edit: Welp, turns out this is not true of RLHF. Alas. My guess is that I wouldn’t have curated it had I known this at the time. (Though it’s still an interesting post.)
Curated. All of these examples together really point quite clearly at a change in how language models behave when they’re trained on RLHF, away from the “accurately predict text” story toward something else that has a very different set of biases — I am interested to read your potential follow-up with your own hypotheses. Plsu, the post is really fun to read.
I think this post and your prior post both overstate the case in some ways, but they’re still great additions to my and I expect many others’ thinking on this subject. I broadly feel like I’ve been ‘needing’ posts like these on LW about current ML projects, giving a grounded conceptual account of how to think about what’s happening, and nobody else has been writing them, so I’m very grateful.
Edit: Welp, turns out this is not true of RLHF. Alas. My guess is that I wouldn’t have curated it had I known this at the time. (Though it’s still an interesting post.)