I’m specifically referring to this answer, combined with a comment that convinced me that the o1 deception so far is plausibly just a capabilities issue:
I continue not to get what you’re saying. I entirely agree with the Habryka and Acertain responses to Gwern. Ted Sanders is talking about why GPT models hallucinate, but fails to address the question of why o1 actively encourages itself to hallucinate (rather than, eg, telling the user that it doesn’t know). My most plausible explanation of your position is that you think this is aligned, IE, you think providing plausible URLs is the best it can do when it doesn’t know. I disagree: it can report that it doesn’t know, rather than hallucinate answers. Actively encouraging itself to give plausible-but-inaccurate links is misalignment. It’s not just missing a capability; it’s also actively self-prompting to deceive the user in a case where it seems capable of knowing better (if it were deliberating with the user’s best interest in mind).
As I said above, the more likely explanation is that there’s an asymmetry in capabilities that’s causing the results, where just knowing what specific URL the customer wants doesn’t equal the model having the capability to retrieve a working URL, so this is probably at the heart of this behavior.
Capabilities issues and alignment issues are not mutually exclusive. It appears that capabilities issues (not being able to remember all valid URLs) is causing an alignment issue in this case (actively planning to give deceptive URLs rather than admit its ignorance). This causes me to predict that even if this particular capability issue is resolved, the training will still have this overall tendency to turn remaining capability issues into alignment issues. How are you seeing this differently?
My most plausible explanation of your position is that you think this is aligned, IE, you think providing plausible URLs is the best it can do when it doesn’t know. I disagree: it can report that it doesn’t know, rather than hallucinate answers. Actively encouraging itself to give plausible-but-inaccurate links is misalignment. It’s not just missing a capability; it’s also actively self-prompting to deceive the user in a case where it seems capable of knowing better (if it were deliberating with the user’s best interest in mind.
I agree that this is actually a small sign of misalignment, and o1 should probably fail more visibly by admitting it’s ignorance rather than making stuff up, so at this point, I’ve come to agree that o1′s training probably induced at least a small amount of misalignment, which is bad news.
Also, thanks for passing my ITT here.
Capabilities issues and alignment issues are not mutually exclusive. It appears that capabilities issues (not being able to remember all valid URLs) is causing an alignment issue in this case (actively planning to give deceptive URLs rather than admit its ignorance). This causes me to predict that even if this particular capability issue is resolved, the training will still have this overall tendency to turn remaining capability issues into alignment issues. How are you seeing this differently?
This is a somewhat plausible problem, and I suspect the general class of solutions to these problems will probably require something along the lines of making the AI fail visibly rather than invisibly.
The basic reason I wanted to distinguish them is because the implications of a capabilities issue mean that the AI labs are incentivized to solve it greatly, and while alignment incentives are real and do exist, they can be less powerful than the general public wants.
I continue not to get what you’re saying. I entirely agree with the Habryka and Acertain responses to Gwern. Ted Sanders is talking about why GPT models hallucinate, but fails to address the question of why o1 actively encourages itself to hallucinate (rather than, eg, telling the user that it doesn’t know). My most plausible explanation of your position is that you think this is aligned, IE, you think providing plausible URLs is the best it can do when it doesn’t know. I disagree: it can report that it doesn’t know, rather than hallucinate answers. Actively encouraging itself to give plausible-but-inaccurate links is misalignment. It’s not just missing a capability; it’s also actively self-prompting to deceive the user in a case where it seems capable of knowing better (if it were deliberating with the user’s best interest in mind).
Capabilities issues and alignment issues are not mutually exclusive. It appears that capabilities issues (not being able to remember all valid URLs) is causing an alignment issue in this case (actively planning to give deceptive URLs rather than admit its ignorance). This causes me to predict that even if this particular capability issue is resolved, the training will still have this overall tendency to turn remaining capability issues into alignment issues. How are you seeing this differently?
I agree that this is actually a small sign of misalignment, and o1 should probably fail more visibly by admitting it’s ignorance rather than making stuff up, so at this point, I’ve come to agree that o1′s training probably induced at least a small amount of misalignment, which is bad news.
Also, thanks for passing my ITT here.
This is a somewhat plausible problem, and I suspect the general class of solutions to these problems will probably require something along the lines of making the AI fail visibly rather than invisibly.
The basic reason I wanted to distinguish them is because the implications of a capabilities issue mean that the AI labs are incentivized to solve it greatly, and while alignment incentives are real and do exist, they can be less powerful than the general public wants.