Gunnar_Zarncke answers Why is o1 so deceptive?

Gunnar_Zarncke 27 Sep 2024 22:22 UTC
LW: 10 AF: 6
−3
AF
I notice that o1′s behavior (it’s cognitive process) looks suspiciously like human behaviors:
- Cognitive dissonance: o1 might fabricate or rationalize to maintain internal consistency of conflicting data (which means there is inconsistency).
- Impression management/Self-serving bias: o1 may attempt to appear knowledgeable or competent, leading to overconfidence because it is rewarded for the look more than for the content (which means the model is stronger than the feedback).
But why is this happening more when o1 can reason more than previous models? Shouldn’t that give it more ways to catch its own deception?
No:
1. Overconfidence in plausibility: With enhanced reasoning, o2 can generate more sophisticated explanations or justifications, even when incorrect. o2 “feels” more capable and thus might trust its own reasoning more, producing more confident errors (“feels” in the sense of expecting to be able to generate explanations that will be rewarded as good).
2. Lack of ground-truth: Advanced reasoning doesn’t guarantee access to verification mechanisms. o2 is rewarded for producing convincing responses, not necessarily for ensuring accuracy. Better reasoning can increase the capability to “rationalize” rather than self-correct.
3. Complexity in mistakes: Higher reasoning allows more complex thought processes, potentially leading to mistakes that are harder to identify or self-correct.
Most of this is analogous to how more intelligent people (“intellectuals”) can generate elaborate, convincing—but incorrect—explanations that cannot be detected by less intelligent participants (who may still suspect something is off but can’t prove it).
- abramdemski 3 Oct 2024 3:08 UTC
  LW: 4 AF: 3
  0
  AF Parent
  To me, this comparison to humans doesn’t seem to answer why the o1 training ended up producing this result.
  - Gunnar_Zarncke 3 Oct 2024 15:22 UTC
    4 points
    0
    Parent
    Convergence. Humans and LLMs with deliberation do the same thing and end up making the same class of errors