nostalgebraist comments on the case for CoT unfaithfulness is overstated

nostalgebraist Oct 11, 2024, 1:30 AM
14 points
2
Thanks for the links!
I was pleased to see OpenAI reference this in their justification for why they aren’t letting users see o1′s CoT (even though of course it was a total non sequitur; they can show the users the CoT without also training on the resulting judgments).
As it happens, the decision to hide o1 CoTs was one of the main things that motivated me to write this post. Or rather, the muted reaction to it / lack of heated debate about it.
The way I see things, the ability to read CoTs (and more generally “the fact that all LLM sampling happens in plain sight”) is a huge plus for both alignment and capabilities – it’s a novel way for powerful AI to be useful and (potentially) safe that people hadn’t even really conceived of before LLMs existed, but which we now held in our hands.
So when I saw that o1 CoTs would be hidden, that felt like a turning point, a step down a very bad road that we didn’t have to choose.
(Like, remember those Anthropic deception papers that had a hidden scratchpad, and justified it by saying it was modeling a scenario where the model had learned to do similar reasoning inside a forward pass and/or steganographically? At the time I was like, “yeah, okay, obviously CoTs can’t be hidden in real life, but we’re trying to model those other situations, and I guess this is the best we can do.”
I never imagined that OpenAI would just come out and say “at long last, we’ve built the Hidden Scratchpad from Evan Hubinger’s sci-fi classic Don’t Build The Hidden Scratchpad”!)
Although I saw some people expressing frustration about the choice to hide o1 CoTs, it didn’t seem like other people were reacting with the intensity I’d expect if they shared my views. And I thought, hmm, well, maybe everyone’s just written off CoTs as inherently deceptive at this point, and that’s why they don’t care. And then I wrote this post.
(That said, I think I understand why OpenAI is doing it – some mixture of concern about people training about the CoTs, and/or being actually concerned about degraded faithfulness while being organizationally incapable of showing anything to users unless they put that thing under pressure look nice and “safe” in a way that could degrade faithfulness. I think the latter could happen even without a true feedback loop where the CoTs are trained on feedback from actual users, so long as they’re trained to comply with “what OpenAI thinks users like” even in a one-time, “offline” manner.
But then at that point, you have to ask: okay, maybe it’s faithful, but at what cost? And how would we even know? If the users aren’t reading the CoTs, then no one is going to read them the vast majority of the time. It’s not like OpenAI is going to have teams of people monitoring this stuff at scale.)
- Seth Herd Oct 20, 2024, 6:43 PM
  3 points
  0
  Parent
  My guess was that the primary reason OAI doesn’t show the scratchpad/CoT is to prevent competitors from training on those CoTs and replicating much of o1s abilities without spending time and compute on the RL process itself.
  
  But now that you mention it, their not wanting to show the whole CoT when it’s not necessarily nice or aligned in itself. I guess it’s like you wouldn’t want someone reading your thoughts even if you intended to be mostly helpful to them.