I’m specifically referring to this answer, combined with a comment that convinced me that the o1 deception so far is plausibly just a capabilities issue:
https://www.lesswrong.com/posts/3Auq76LFtBA4Jp5M8/why-is-o1-so-deceptive#L5WsfcTa59FHje5hu
https://www.lesswrong.com/posts/3Auq76LFtBA4Jp5M8/why-is-o1-so-deceptive#xzcKArvsCxfJY2Fyi
I continue not to get what you’re saying. I entirely agree with the Habryka and Acertain responses to Gwern. Ted Sanders is talking about why GPT models hallucinate, but fails to address the question of why o1 actively encourages itself to hallucinate (rather than, eg, telling the user that it doesn’t know). My most plausible explanation of your position is that you think this is aligned, IE, you think providing plausible URLs is the best it can do when it doesn’t know. I disagree: it can report that it doesn’t know, rather than hallucinate answers. Actively encouraging itself to give plausible-but-inaccurate links is misalignment. It’s not just missing a capability; it’s also actively self-prompting to deceive the user in a case where it seems capable of knowing better (if it were deliberating with the user’s best interest in mind).
As I said above, the more likely explanation is that there’s an asymmetry in capabilities that’s causing the results, where just knowing what specific URL the customer wants doesn’t equal the model having the capability to retrieve a working URL, so this is probably at the heart of this behavior.
Capabilities issues and alignment issues are not mutually exclusive. It appears that capabilities issues (not being able to remember all valid URLs) is causing an alignment issue in this case (actively planning to give deceptive URLs rather than admit its ignorance). This causes me to predict that even if this particular capability issue is resolved, the training will still have this overall tendency to turn remaining capability issues into alignment issues. How are you seeing this differently?
I would strongly guess that many people could physically locate the cringe pain, particularly if asked when they’re experiencing it.