I dunno, I’d agree with LA Paul here. There’s a difference between cases where you’re not sure whether it’s A, or it’s B, or it’s C, and cases where A, B, and C are all valid outcomes, and you’re doing something more like calling one of them into existence by picking it.
The first cases are times when the AI doesn’t know what’s right, but to the human it’s obvious and uncomplicated which is right. The second cases are where human preferences are underdetermined—where there are multiple ways we could be in the future that are all acceptably compatible with how we’ve been up til now.
I think models that treat the thing its learning as entirely the first sort of thing are going to do fine on the obvious and uncomplicated questions, but would learn to to resolve questions of the second type using processes we wouldn’t approve of.
This is a broader criticism of alignment to preferences or intent in general, since these things can change (and sometimes, you can even make choices of whether to change them or not). L.A. Paul wrote a whole book about this sort of thing; if you’re interested, here’s a good talk.
I dunno, I’d agree with LA Paul here. There’s a difference between cases where you’re not sure whether it’s A, or it’s B, or it’s C, and cases where A, B, and C are all valid outcomes, and you’re doing something more like calling one of them into existence by picking it.
The first cases are times when the AI doesn’t know what’s right, but to the human it’s obvious and uncomplicated which is right. The second cases are where human preferences are underdetermined—where there are multiple ways we could be in the future that are all acceptably compatible with how we’ve been up til now.
I think models that treat the thing its learning as entirely the first sort of thing are going to do fine on the obvious and uncomplicated questions, but would learn to to resolve questions of the second type using processes we wouldn’t approve of.
This is a broader criticism of alignment to preferences or intent in general, since these things can change (and sometimes, you can even make choices of whether to change them or not). L.A. Paul wrote a whole book about this sort of thing; if you’re interested, here’s a good talk.
Yes, I was deliberately phrasing things sort of like transformative experiences :P