Sahil answers Why is o1 so deceptive?

Sahil 27 Sep 2024 18:13 UTC
LW: 16 AF: 11
6
AF
Great! I’d love to have included a remark that one, as a human, might anticipate forward-chainy/rational reasoning in these systems because we’re often taking the “thought” metaphor seriously/literally in the label “chain-of-thought”, rather than backwardy/rationalization “reasoning”.
But since it is is at least somewhat intelligent/predictive, it can make the move of “acausal collusion” with its own tendency to hallucinate, in generating its “chain”-of-”thought”. That is, the optimization to have chain-of-thought in correspondence with its output can work in the backwards direction, cohering with bad output instead of leading to better output, a la partial agency.
(Admittedly human thoughts do a lot of rationalization as well. So maybe the mistake is in taking directionality implied by “chain” too seriously?)
Maybe this is obvious, but it could become increasingly reckless to not notice when you’re drawing the face of “thoughts” or “chains” on CoT shoggoth-movements . You can be misled into thinking that the shoggoth is less able to deceive than it actually is.
Less obvious but important: in the reverse direction, drawing “hacker faces” on the shoggoth, as in the case of the Docker hack (section 4.2.1), can mislead into thinking that the shoggoth “wants” to or tends to hack/undermine/power-seek more than it actually, independently does. It seems at least somewhat relevant that the docker vulnerability was exploited for a challenge that was explicitly about exploiting vulnerability. Even though it was an impressive meta-hack, one must wonder how much this is cued by the prompt and therefore is zero evidence for an autonomy telos—which is crucial for the deceptive optimizer story—even though mechanistically possible.
(The word “independently” above is important: if it takes human “misuse”/participation to trigger its undermining personas, we also might have more of a continuous shot at pausing/shutdown or even corrigibilty.)

I was going to post this as a comment, but there’s also an answer here: I’d say calling o1 “deceptive” could be as misleading as calling it aligned if it outputs loving words.
It has unsteady referentiality, at least from the POV of the meanings of us life-forms. Even though it has some closeness to our meanings and referentiality, the quantity of the unsteadiness of that referentiality can be qualitative. Distinguishing “deceptively aligned mesa-optimizer” from “the tentacles of the shoggoth I find it useful to call ‘words’ don’t work like ‘words’ in some annoying ways” is important, in order to protect some of that (quantitatively-)qualitative difference. Both for not dismissing risks and for not hallucinating them.
What links here?
- The Logistics of Distribution of Meaning: Against Epistemic Bureaucratization by Sahil (7 Nov 2024 5:27 UTC; 26 points)
- abramdemski 27 Sep 2024 18:34 UTC
  LW: 10 AF: 4
  1
  AF Parent
  I want to draw a clear distinction between the hypothesis I mention in the OP (your ‘causal’ explanation) and the ‘acausal’ explanation you mention here. You were blending the two together during our conversation, but I think there are important differences.
  In particular, I am interested in the question: does o1 display any “backwards causation” from answers to chain-of-thought? Would its training result in chain-of-thought that is optimized to justify a sort of answer it tends to give (eg, hallucinated URLs)?
  This depends on details of the training which we may not have enough information on (plus potentially complex reasoning about the consequences of such reasoning).
- Stephen Fowler 29 Sep 2024 0:56 UTC
  2 points
  0
  Parent
  “But since it is is at least somewhat intelligent/predictive, it can make the move of “acausal collusion” with its own tendency to hallucinate, in generating its “chain”-of-”thought”.”
  
  I am not understanding what this sentence is trying to say. I understand what an acausal trade is. Could you phrase it more directly?
  
  I cannot see why you require the step that the model needs to be reasoning acausally for it to develop a strategy of deceptively hallucinating citations.
  
  What concrete predictions does the model in which this is an example of “acausal collusion” make?