I don’t agree. There is a distinction between lying and being confused—when you lie, you have to know better. Offering a confused answer is in a sense bad, but with lying there’s an obviously better policy (don’t) while it’s not the case that a confused answer is always the result of a suboptimal policy. When you are confused, the right course of action sometimes results in mistakes.
AFAIK there’s no evidence of a gap between what GPT knows and what it says when it’s running in pure generative mode (though this doesn’t say much; one would have to be quite clever to demonstrate it). There is a generator/discriminator and generator/critical gap, but this is because GPT operating as a critic is simply more capable than GPT as a generator. If we compare apples to apples then there’s again no evidence I know of that RLHFd critic-GPT is holding back on things it knows.
So I don’t think hallucination makes it obvious that behavioural safety is not solved.
I do think the fact that RLHFd models are miscalibrated is evidence against RLHF solving behaviour safety, because calibration is obviously good and the base model was capable of it.
Offering a confused answer is in a sense bad, but with lying there’s an obviously better policy (don’t) while it’s not the case that a confused answer is always the result of a suboptimal policy.
Sure, but the “lying” probably stems from the fact that to get the thumbs up from RLHF you just have to make up a believable answer (because the process AFAIK didn’t involve actual experts in various fields fact checking every tiny bit). If just a handful of “wrong but believable” examples sneak in the reward modelling phase you get a model that thinks that sometimes lying is what humans want (and without getting too edgy, this is totally true for politically charged questions!).”Lying” could well be the better policy! I am not claiming that GPT is maliciously lying, but in AI safety, malice is never really needed or even considered (ok, maybe deception is malicious by definition).
AFAIK there’s no evidence of a gap between what GPT knows and what it says when it’s running in pure generative mode
I am unsure if this article will satisfy you, but nonetheless I have repeatedly corrected GPT-3/4 and it goes “oh, yeah, right, you’re right, my bad, [elaborates, clearly showing that it had the knowledge all along]”. Or even: Me: “[question about thing]” GPT: “As of my knowledge cut-off of 2021 I have absolutely no idea what you mean by thing” Me: “yeah, you know, the thing” GPT: “Ah, yeah the thing [writes four paragraphs about the thing]” Fresh example of this: Link (it says the model is the default, but it’s not, it’s a bug, I am using GPT-4)
Maybe it is just perpetrating the bad training data full of misconceptions or maybe when I correct it I am the one who’s wrong and it’s just a sycophant (very common in GPT-3.5 back in February).
But I think the point is that you could justify the behaviour in a million ways. It doesn’t change the fact that it says untrue things when asked for true things.
Is it safe to hallucinate sometimes? Idk, that could be discussed, but sure as hell it isn’t aligned with what RLHF was meant to align it to.
I’d also like to add that it doesn’t consistently hallucinate. I think sometimes it just gets unlucky and it samples the wrong token and then, by being autoregressive, keeps the factually wrong narrative going. So maybe being autoregressive is the real demon here and not RLHF. ¯\_(ツ)_/¯
You raise some examples of the generator/critic gap, which I addressed. I’m not sure what I should look for in that paper—I mentioned the miscalibration of GPT4 after RLHF, that’s from the GPT4 tech report, and I don’t believe your linked paper shows anything analogous (ie that RLHFd models are less calibrated than they “should” be). I know that the two papers here investigate different notions of calibration.
“Always say true things” is a much higher standard than “don’t do anything obviously bad”. Hallucination is obviously a violation of the first, and it might be a violation of the second—but I just don’t think it’s obvious!
Sure, but the “lying” probably stems from the fact that to get the thumbs up from RLHF you just have to make up a believable answer (because the process AFAIK didn’t involve actual experts in various fields fact checking every tiny bit). If just a handful of “wrong but believable” examples sneak in the reward modelling phase you get a model that thinks that sometimes lying is what humans want (and without getting too edgy, this is totally true for politically charged questions!).”Lying” could well be the better policy! I am not claiming that GPT is maliciously lying, but in AI safety, malice is never really needed or even considered (ok, maybe deception is malicious by definition).
One thing I’m saying is that we don’t have clear evidence to support this claim.
The article and my examples were meant to show that there is a gap between what GPT knows and what it says. It knows something, but sometimes says that it doesn’t, or it just makes it up. I haven’t addressed your “GPT generator/critic” framework or the calibration issues as I don’t really see them much relevant here. GPT is just GPT. Being a critic/verifier is basically always easier. IIRC the GPT-4 paper didn’t really go into much detail of how they tested the calibration, but that’s irrelevant here as I am claiming that sometimes it know the “right probability” but it generates a made up one.
I don’t see how “say true things when you are asked and you know the true thing” is such a high standard, just because we have already internalised that it’s ok that sometimes GPT says make up things
I don’t agree. There is a distinction between lying and being confused—when you lie, you have to know better. Offering a confused answer is in a sense bad, but with lying there’s an obviously better policy (don’t) while it’s not the case that a confused answer is always the result of a suboptimal policy. When you are confused, the right course of action sometimes results in mistakes.
AFAIK there’s no evidence of a gap between what GPT knows and what it says when it’s running in pure generative mode (though this doesn’t say much; one would have to be quite clever to demonstrate it). There is a generator/discriminator and generator/critical gap, but this is because GPT operating as a critic is simply more capable than GPT as a generator. If we compare apples to apples then there’s again no evidence I know of that RLHFd critic-GPT is holding back on things it knows.
So I don’t think hallucination makes it obvious that behavioural safety is not solved.
I do think the fact that RLHFd models are miscalibrated is evidence against RLHF solving behaviour safety, because calibration is obviously good and the base model was capable of it.
Sure, but the “lying” probably stems from the fact that to get the thumbs up from RLHF you just have to make up a believable answer (because the process AFAIK didn’t involve actual experts in various fields fact checking every tiny bit). If just a handful of “wrong but believable” examples sneak in the reward modelling phase you get a model that thinks that sometimes lying is what humans want (and without getting too edgy, this is totally true for politically charged questions!).”Lying” could well be the better policy! I am not claiming that GPT is maliciously lying, but in AI safety, malice is never really needed or even considered (ok, maybe deception is malicious by definition).
I am unsure if this article will satisfy you, but nonetheless I have repeatedly corrected GPT-3/4 and it goes “oh, yeah, right, you’re right, my bad, [elaborates, clearly showing that it had the knowledge all along]”. Or even:
Me: “[question about thing]”
GPT: “As of my knowledge cut-off of 2021 I have absolutely no idea what you mean by thing”
Me: “yeah, you know, the thing”
GPT: “Ah, yeah the thing [writes four paragraphs about the thing]”
Fresh example of this: Link (it says the model is the default, but it’s not, it’s a bug, I am using GPT-4)
Maybe it is just perpetrating the bad training data full of misconceptions or maybe when I correct it I am the one who’s wrong and it’s just a sycophant (very common in GPT-3.5 back in February).
But I think the point is that you could justify the behaviour in a million ways. It doesn’t change the fact that it says untrue things when asked for true things.
Is it safe to hallucinate sometimes? Idk, that could be discussed, but sure as hell it isn’t aligned with what RLHF was meant to align it to.
I’d also like to add that it doesn’t consistently hallucinate. I think sometimes it just gets unlucky and it samples the wrong token and then, by being autoregressive, keeps the factually wrong narrative going. So maybe being autoregressive is the real demon here and not RLHF. ¯\_(ツ)_/¯
It’s still not factual.
You raise some examples of the generator/critic gap, which I addressed. I’m not sure what I should look for in that paper—I mentioned the miscalibration of GPT4 after RLHF, that’s from the GPT4 tech report, and I don’t believe your linked paper shows anything analogous (ie that RLHFd models are less calibrated than they “should” be). I know that the two papers here investigate different notions of calibration.
“Always say true things” is a much higher standard than “don’t do anything obviously bad”. Hallucination is obviously a violation of the first, and it might be a violation of the second—but I just don’t think it’s obvious!
One thing I’m saying is that we don’t have clear evidence to support this claim.
The article and my examples were meant to show that there is a gap between what GPT knows and what it says. It knows something, but sometimes says that it doesn’t, or it just makes it up. I haven’t addressed your “GPT generator/critic” framework or the calibration issues as I don’t really see them much relevant here. GPT is just GPT. Being a critic/verifier is basically always easier. IIRC the GPT-4 paper didn’t really go into much detail of how they tested the calibration, but that’s irrelevant here as I am claiming that sometimes it know the “right probability” but it generates a made up one.
I don’t see how “say true things when you are asked and you know the true thing” is such a high standard, just because we have already internalised that it’s ok that sometimes GPT says make up things