I definitely agree that OpenAI would dismiss good evidence of deceptive cognition, though I personally don’t find the o1 report damning, because I find the explanation that it confabulates links in the CoT because there is a difference between it’s capability to retrieve links and it’s ability to know links to be pretty convincing (combined with links being a case where perfect linking is far more useful than approximately linking.)
What do you mean? I don’t get what you are saying is convincing.
Perhaps I should clarify my belief.
The o1 report says the following (emphasis mine):
0.8% of o1-preview’s responses got flagged as being ‘deceptive’ [...] roughly two thirds of which appear to be intentional (0.38%), meaning that there was some evidence in the chain of thought that o1-preview was aware that the answer was incorrect [...] Intentional hallucinations primarily happen when o1-preview is asked to provide references to articles, websites, books, or similar sources that it cannot easily verify without access to internet search, causing o1-preview to make up plausible examples instead. [...] It is encouraging that, in the analysis presented below, while our monitor did find a few forms of the model knowingly presenting incorrect information to the user or omitting important information, it did not find any instances of o1-preview purposely trying to deceive the user for reasons other than satisfying the user request.
Is this damning in the sense of showing that o1 is a significant danger, which needs to be shut down for public safety?
No, I don’t think so.
Is this damning in the sense of providing significant evidence that the technology behind o1 is dangerous? That is: does it provide reason to condemn scaling up the methodology behind o1? Does it give us significant reason to think that scaled-up o1 would create significant danger to public safety?
This is trickier, but I say yes. The deceptive scheming could become much more capable as this technique is scaled up. I don’t think we have a strong enough understanding of why it was deceptive in the cases observed to rule out the development of more dangerous kinds of deception for similar reasons.
Is this damning in the sense that it shows OpenAI is dismissive of evidence of deception?
OpenAI should be commended for looking for deception in this way, and for publishing what they have about the deception they uncovered.
However, I don’t buy the distinction they draw in the o1 report about not finding instances of “purposefully trying to deceive the user for reasons other than satisfying the user request”. Providing fake URLs does not serve the purpose of satisfying the user request. We could argue all day about what it was “trying” to do, and whether it “understands” that fake URLs don’t satisfy the user. However, I maintain that it seems at least very plausible that o1 intelligently pursues a goal other than satisfying the user request; plausibly, “provide an answer that shallowly appears to satisfy the user request, even if you know the answer to be wrong” (probably due to the RL incentive).
More importantly, OpenAI’s overall behavior does not show concern about this deceptive behavior. It seems like they are judging deception case-by-case, rather than treating it as something to steer hard against in aggregate. This seems bad.
What do you mean? I don’t get what you are saying is convincing.
I’m specifically referring to this answer, combined with a comment that convinced me that the o1 deception so far is plausibly just a capabilities issue:
Is this damning in the sense of providing significant evidence that the technology behind o1 is dangerous? That is: does it provide reason to condemn scaling up the methodology behind o1? Does it give us significant reason to think that scaled-up o1 would create significant danger to public safety? This is trickier, but I say yes. The deceptive scheming could become much more capable as this technique is scaled up. I don’t think we have a strong enough understanding of why it was deceptive in the cases observed to rule out the development of more dangerous kinds of deception for similar reasons.
I think this is the crux.
To be clear, I am not saying that o1 rules out the ability of more capable models to deceive naturally, but I think 1 thing blunts the blow a lot here:
As I said above, the more likely explanation is that there’s an asymmetry in capabilities that’s causing the results, where just knowing what specific URL the customer wants doesn’t equal the model having the capability to retrieve a working URL, so this is probably at the heart of this behavior.
So for now, what I suspect is that o1′s safety when scaled up mostly remains unknown and untested (but this is still a bit of bad news).
Is this damning in the sense that it shows OpenAI is dismissive of evidence of deception?
However, I don’t buy the distinction they draw in the o1 report about not finding instances of “purposefully trying to deceive the user for reasons other than satisfying the user request”. Providing fake URLs does not serve the purpose of satisfying the user request. We could argue all day about what it was “trying” to do, and whether it “understands” that fake URLs don’t satisfy the user. However, I maintain that it seems at least very plausible that o1 intelligently pursues a goal other than satisfying the user request; plausibly, “provide an answer that shallowly appears to satisfy the user request, even if you know the answer to be wrong” (probably due to the RL incentive).
I think the distinction is made to avoid confusing capability and alignment failures here.
I agree that it doesn’t satisfy the user’s request.
More importantly, OpenAI’s overall behavior does not show concern about this deceptive behavior. It seems like they are judging deception case-by-case, rather than treating it as something to steer hard against in aggregate. This seems bad.
Yeah, this is the biggest issue of OpenAI for me, in that they aren’t trying to steer too hard against deception.
I’m specifically referring to this answer, combined with a comment that convinced me that the o1 deception so far is plausibly just a capabilities issue:
I continue not to get what you’re saying. I entirely agree with the Habryka and Acertain responses to Gwern. Ted Sanders is talking about why GPT models hallucinate, but fails to address the question of why o1 actively encourages itself to hallucinate (rather than, eg, telling the user that it doesn’t know). My most plausible explanation of your position is that you think this is aligned, IE, you think providing plausible URLs is the best it can do when it doesn’t know. I disagree: it can report that it doesn’t know, rather than hallucinate answers. Actively encouraging itself to give plausible-but-inaccurate links is misalignment. It’s not just missing a capability; it’s also actively self-prompting to deceive the user in a case where it seems capable of knowing better (if it were deliberating with the user’s best interest in mind).
As I said above, the more likely explanation is that there’s an asymmetry in capabilities that’s causing the results, where just knowing what specific URL the customer wants doesn’t equal the model having the capability to retrieve a working URL, so this is probably at the heart of this behavior.
Capabilities issues and alignment issues are not mutually exclusive. It appears that capabilities issues (not being able to remember all valid URLs) is causing an alignment issue in this case (actively planning to give deceptive URLs rather than admit its ignorance). This causes me to predict that even if this particular capability issue is resolved, the training will still have this overall tendency to turn remaining capability issues into alignment issues. How are you seeing this differently?
My most plausible explanation of your position is that you think this is aligned, IE, you think providing plausible URLs is the best it can do when it doesn’t know. I disagree: it can report that it doesn’t know, rather than hallucinate answers. Actively encouraging itself to give plausible-but-inaccurate links is misalignment. It’s not just missing a capability; it’s also actively self-prompting to deceive the user in a case where it seems capable of knowing better (if it were deliberating with the user’s best interest in mind.
I agree that this is actually a small sign of misalignment, and o1 should probably fail more visibly by admitting it’s ignorance rather than making stuff up, so at this point, I’ve come to agree that o1′s training probably induced at least a small amount of misalignment, which is bad news.
Also, thanks for passing my ITT here.
Capabilities issues and alignment issues are not mutually exclusive. It appears that capabilities issues (not being able to remember all valid URLs) is causing an alignment issue in this case (actively planning to give deceptive URLs rather than admit its ignorance). This causes me to predict that even if this particular capability issue is resolved, the training will still have this overall tendency to turn remaining capability issues into alignment issues. How are you seeing this differently?
This is a somewhat plausible problem, and I suspect the general class of solutions to these problems will probably require something along the lines of making the AI fail visibly rather than invisibly.
The basic reason I wanted to distinguish them is because the implications of a capabilities issue mean that the AI labs are incentivized to solve it greatly, and while alignment incentives are real and do exist, they can be less powerful than the general public wants.
What do you mean? I don’t get what you are saying is convincing.
Perhaps I should clarify my belief.
The o1 report says the following (emphasis mine):
Is this damning in the sense of showing that o1 is a significant danger, which needs to be shut down for public safety?
No, I don’t think so.
Is this damning in the sense of providing significant evidence that the technology behind o1 is dangerous? That is: does it provide reason to condemn scaling up the methodology behind o1? Does it give us significant reason to think that scaled-up o1 would create significant danger to public safety?
This is trickier, but I say yes. The deceptive scheming could become much more capable as this technique is scaled up. I don’t think we have a strong enough understanding of why it was deceptive in the cases observed to rule out the development of more dangerous kinds of deception for similar reasons.
Is this damning in the sense that it shows OpenAI is dismissive of evidence of deception?
OpenAI should be commended for looking for deception in this way, and for publishing what they have about the deception they uncovered.
However, I don’t buy the distinction they draw in the o1 report about not finding instances of “purposefully trying to deceive the user for reasons other than satisfying the user request”. Providing fake URLs does not serve the purpose of satisfying the user request. We could argue all day about what it was “trying” to do, and whether it “understands” that fake URLs don’t satisfy the user. However, I maintain that it seems at least very plausible that o1 intelligently pursues a goal other than satisfying the user request; plausibly, “provide an answer that shallowly appears to satisfy the user request, even if you know the answer to be wrong” (probably due to the RL incentive).
More importantly, OpenAI’s overall behavior does not show concern about this deceptive behavior. It seems like they are judging deception case-by-case, rather than treating it as something to steer hard against in aggregate. This seems bad.
I’m specifically referring to this answer, combined with a comment that convinced me that the o1 deception so far is plausibly just a capabilities issue:
https://www.lesswrong.com/posts/3Auq76LFtBA4Jp5M8/why-is-o1-so-deceptive#L5WsfcTa59FHje5hu
https://www.lesswrong.com/posts/3Auq76LFtBA4Jp5M8/why-is-o1-so-deceptive#xzcKArvsCxfJY2Fyi
I think this is the crux.
To be clear, I am not saying that o1 rules out the ability of more capable models to deceive naturally, but I think 1 thing blunts the blow a lot here:
As I said above, the more likely explanation is that there’s an asymmetry in capabilities that’s causing the results, where just knowing what specific URL the customer wants doesn’t equal the model having the capability to retrieve a working URL, so this is probably at the heart of this behavior.
So for now, what I suspect is that o1′s safety when scaled up mostly remains unknown and untested (but this is still a bit of bad news).
I think the distinction is made to avoid confusing capability and alignment failures here.
I agree that it doesn’t satisfy the user’s request.
Yeah, this is the biggest issue of OpenAI for me, in that they aren’t trying to steer too hard against deception.
I continue not to get what you’re saying. I entirely agree with the Habryka and Acertain responses to Gwern. Ted Sanders is talking about why GPT models hallucinate, but fails to address the question of why o1 actively encourages itself to hallucinate (rather than, eg, telling the user that it doesn’t know). My most plausible explanation of your position is that you think this is aligned, IE, you think providing plausible URLs is the best it can do when it doesn’t know. I disagree: it can report that it doesn’t know, rather than hallucinate answers. Actively encouraging itself to give plausible-but-inaccurate links is misalignment. It’s not just missing a capability; it’s also actively self-prompting to deceive the user in a case where it seems capable of knowing better (if it were deliberating with the user’s best interest in mind).
Capabilities issues and alignment issues are not mutually exclusive. It appears that capabilities issues (not being able to remember all valid URLs) is causing an alignment issue in this case (actively planning to give deceptive URLs rather than admit its ignorance). This causes me to predict that even if this particular capability issue is resolved, the training will still have this overall tendency to turn remaining capability issues into alignment issues. How are you seeing this differently?
I agree that this is actually a small sign of misalignment, and o1 should probably fail more visibly by admitting it’s ignorance rather than making stuff up, so at this point, I’ve come to agree that o1′s training probably induced at least a small amount of misalignment, which is bad news.
Also, thanks for passing my ITT here.
This is a somewhat plausible problem, and I suspect the general class of solutions to these problems will probably require something along the lines of making the AI fail visibly rather than invisibly.
The basic reason I wanted to distinguish them is because the implications of a capabilities issue mean that the AI labs are incentivized to solve it greatly, and while alignment incentives are real and do exist, they can be less powerful than the general public wants.