abramdemski comments on Why is o1 so deceptive?

abramdemski 27 Sep 2024 18:34 UTC
LW: 11 AF: 5
1
AF
I want to draw a clear distinction between the hypothesis I mention in the OP (your ‘causal’ explanation) and the ‘acausal’ explanation you mention here. You were blending the two together during our conversation, but I think there are important differences.
In particular, I am interested in the question: does o1 display any “backwards causation” from answers to chain-of-thought? Would its training result in chain-of-thought that is optimized to justify a sort of answer it tends to give (eg, hallucinated URLs)?
This depends on details of the training which we may not have enough information on (plus potentially complex reasoning about the consequences of such reasoning).