I used to be so bullish on CoT when I first heard about it both for capabilities and alignment but now I just hate it so fucking much...
We already have it that even pre-RL autoregression is “unfaithful”, which doesn’t really seem like the right word to me to describe the fact that the simplest way for whatever architecture you’re working with to get to the right answer is going to be basically by necessity not exactly what the tokens spell out.
It makes no sense that we now expect gradient descent on a trillion bf16s to correspond to any extent to a faithful rendition of the however many thousand words that it uses as basis for computation. It’s just a scratchpad for it to do its own thing. If we do what OpenAI says and just not do gradient descent on the internal chain of thought, then its just a worse scratchpad.
That we expect these things that do not operate similarly to humans in any fashion to be successful while expressing their thoughts (1) faithfully, and (2) in English seems so dumb to me that I’d so much rather drop the pretense of going from (2) to (1) and go full latent in some sort of structure that neatly exposes cause and effect.
just not do gradient descent on the internal chain of thought, then its just a worse scratchpad.
This seems like a misunderstanding. When OpenAI and others talk about not optimising the chain of thought, they mean not optimising it for looking nice. That still means optimising it for its contribution to the final answer i.e. for being the best scratchpad it can be (that’s the whole paradigm).
I see. I think the rest of my point still stands, and that as RL becomes more powerful what the model says it thinks and what it thinks will naturally diverge even if we don’t pressure it to, and the best way to avoid this is to have it represent it thoughts in an intermediate format that its more computationally bound to. My first guess would be that going harder on discrete search, or something with smaller computational depth and massive breadth more generally, would be a massive alignment win at near-ASI performance, even if we end up with problems like adverse selection it will be a lot easier to work through.
I think it may or may not diverge from meaningful natural language in the next couple of years, and importantly I think we’ll be able to roughly tell whether it has. So I think we should just see (although finding other formats for interpretable autogression could be good too).
I used to be so bullish on CoT when I first heard about it both for capabilities and alignment but now I just hate it so fucking much...
We already have it that even pre-RL autoregression is “unfaithful”, which doesn’t really seem like the right word to me to describe the fact that the simplest way for whatever architecture you’re working with to get to the right answer is going to be basically by necessity not exactly what the tokens spell out.
It makes no sense that we now expect gradient descent on a trillion bf16s to correspond to any extent to a faithful rendition of the however many thousand words that it uses as basis for computation. It’s just a scratchpad for it to do its own thing. If we do what OpenAI says and just not do gradient descent on the internal chain of thought, then its just a worse scratchpad.
That we expect these things that do not operate similarly to humans in any fashion to be successful while expressing their thoughts (1) faithfully, and (2) in English seems so dumb to me that I’d so much rather drop the pretense of going from (2) to (1) and go full latent in some sort of structure that neatly exposes cause and effect.
This seems like a misunderstanding. When OpenAI and others talk about not optimising the chain of thought, they mean not optimising it for looking nice. That still means optimising it for its contribution to the final answer i.e. for being the best scratchpad it can be (that’s the whole paradigm).
I see. I think the rest of my point still stands, and that as RL becomes more powerful what the model says it thinks and what it thinks will naturally diverge even if we don’t pressure it to, and the best way to avoid this is to have it represent it thoughts in an intermediate format that its more computationally bound to. My first guess would be that going harder on discrete search, or something with smaller computational depth and massive breadth more generally, would be a massive alignment win at near-ASI performance, even if we end up with problems like adverse selection it will be a lot easier to work through.
I think it may or may not diverge from meaningful natural language in the next couple of years, and importantly I think we’ll be able to roughly tell whether it has. So I think we should just see (although finding other formats for interpretable autogression could be good too).