Charlie Steiner comments on A philosopher’s critique of RLHF

Charlie Steiner 7 Nov 2022 3:35 UTC
3 points
0
I dunno, I’d agree with LA Paul here. There’s a difference between cases where you’re not sure whether it’s A, or it’s B, or it’s C, and cases where A, B, and C are all valid outcomes, and you’re doing something more like calling one of them into existence by picking it.
The first cases are times when the AI doesn’t know what’s right, but to the human it’s obvious and uncomplicated which is right. The second cases are where human preferences are underdetermined—where there are multiple ways we could be in the future that are all acceptably compatible with how we’ve been up til now.
I think models that treat the thing its learning as entirely the first sort of thing are going to do fine on the obvious and uncomplicated questions, but would learn to to resolve questions of the second type using processes we wouldn’t approve of.
- TW123 7 Nov 2022 3:40 UTC
  1 point
  0
  Parent
  This is a broader criticism of alignment to preferences or intent in general, since these things can change (and sometimes, you can even make choices of whether to change them or not). L.A. Paul wrote a whole book about this sort of thing; if you’re interested, here’s a good talk.
  - Charlie Steiner 7 Nov 2022 3:56 UTC
    2 points
    0
    Parent
    Yes, I was deliberately phrasing things sort of like transformative experiences :P