I haven’t engaged with this much, though I’ve e.g. talked with Evan some about why I’m not as excited about conditioning generative models as a strategy. I’m happy to engage with particular arguments but feel like I don’t really know what argument is being made by the parent (or most of the other places I’ve seen this in passing).
I think there is a simple reason imitation is safer: the model won’t deliberately produce actions that the demosntrator wouldn’t, whereas RLHF may produce actions that are very creative ways to get reward and may be hamful.
I don’t think this is what people are talking about though (and it wouldn’t work for their broader arguments). I think they are imagining a higher probability of deceptive alignment and other generalization problems.
I don’t thinks I know the precise articulation of these concerns or the argument for it.
On the empirics, sometimes people mention this paper and the RLHF’d model behavior “hey do you want to be shut down? --> no” as evidence of a higher probability of deceptive alignment from RLHF. I don’t really think that’s a reasonable interpretation of the evidence but if that’s a large part of the argument people are making I’d be happy to engage on it.
I haven’t engaged with this much, though I’ve e.g. talked with Evan some about why I’m not as excited about conditioning generative models as a strategy. I’m happy to engage with particular arguments but feel like I don’t really know what argument is being made by the parent (or most of the other places I’ve seen this in passing).
I think there is a simple reason imitation is safer: the model won’t deliberately produce actions that the demosntrator wouldn’t, whereas RLHF may produce actions that are very creative ways to get reward and may be hamful.
I don’t think this is what people are talking about though (and it wouldn’t work for their broader arguments). I think they are imagining a higher probability of deceptive alignment and other generalization problems.
I don’t thinks I know the precise articulation of these concerns or the argument for it.
On the empirics, sometimes people mention this paper and the RLHF’d model behavior “hey do you want to be shut down? --> no” as evidence of a higher probability of deceptive alignment from RLHF. I don’t really think that’s a reasonable interpretation of the evidence but if that’s a large part of the argument people are making I’d be happy to engage on it.