o1′s reasoning traces being much terser (sometimes to the point of incomprehensibility) seems predicted by doing gradient updates based on the quality of the final output without letting the raters see the reasoning traces, since this means the optimization pressure exerted on the cognition used for the reasoning traces is almost entirely in the direction of performance, as opposed to human-readability.
In the short term this might be good news for the “faithfulness” of those traces, but what it’s faithful to is the model’s ontology (hence less human-readable), see e.g. here and here.
In the long term, if you keep doing pretraining on model-generated traces, you might rapidly find yourself in steganography-land, as the pretraining bakes in the previously-externalized cognition into capabilities that the model can deploy in a single forward pass, and anything it externalizes as part of its own chain of thought will be much more compressed (and more alien) than what we see now.
Just checking—you are aware that the reasoning traces shown in the UI are a summarized version of the full reasoning trace (which is not user accessible)?
One idea is to use a base LLM with no RLHF, compute the perplexity of the reasoning text, and add it as an additional term in the loss function. That should help with comprehensibility, but it doesn’t necessarily help with steganography. To disincentivize steganography, you could add noise to the reasoning in various ways, and remove any incentive for terseness, to ensure the model isn’t trying to squeeze more communication into a limited token budget.
A basic idea for detecting steganography is to monitor next-token probabilities for synonym pairs. If they tend to diverge over time, that suggests the model is assigning additional hidden meaning. (This might fail if the synonym embeddings are too close.)
My preferred approach to CoT would be something like:
Give human raters the task of next-token prediction on a large text corpus. Have them write out their internal monologue when trying to predict the next word in a sentence.
Train a model to predict the internal monologue of a human rater, conditional on previous tokens.
Train a second model to predict the next token in the corpus, conditional on previous tokens in the corpus and also the written internal monologue.
Only combine the above two models in production.
Now that you’ve embedded CoT in the base model, maybe it will be powerful enough that you can discard RHLF, and replace it with some sort of fine-tuning on PhDs roleplaying as a helpful/honest/harmless chatbot.
Basically give the base model a sort of “working memory” that’s incentivized for maximal human imitativeness and interpretability. Then you could build an interface where a person can mouse over any word in a sentence and see what the model was ‘thinking’ when it chose that word. (Realistically you wouldn’t do this for every word in a sentence, just the trickier ones.)
o1′s reasoning traces being much terser (sometimes to the point of incomprehensibility)
What did you consider incomprehensible? I agree the CoT has a very… distinct… character, but I’d call it “inefficient” rather than “incomprehensible”. All the moves it did when solving the cipher puzzle or the polynomial problem made sense to me. Did I overlook something?
Yeah, this is a good point. If I were designing a system where humans gave feedback on a CoT type system, I’d want to sometimes not show the human the full reasoning trace or the answer, and just have them rate whether the reasoning traces so far seem to be on the right track.
o1′s reasoning traces being much terser (sometimes to the point of incomprehensibility) seems predicted by doing gradient updates based on the quality of the final output without letting the raters see the reasoning traces, since this means the optimization pressure exerted on the cognition used for the reasoning traces is almost entirely in the direction of performance, as opposed to human-readability.
In the short term this might be good news for the “faithfulness” of those traces, but what it’s faithful to is the model’s ontology (hence less human-readable), see e.g. here and here.
In the long term, if you keep doing pretraining on model-generated traces, you might rapidly find yourself in steganography-land, as the pretraining bakes in the previously-externalized cognition into capabilities that the model can deploy in a single forward pass, and anything it externalizes as part of its own chain of thought will be much more compressed (and more alien) than what we see now.
Just checking—you are aware that the reasoning traces shown in the UI are a summarized version of the full reasoning trace (which is not user accessible)?
See “Hiding the Chains of Thought” here.
Yeah, I meant terser compared to typical RLHD’d output from e.g. 4o. (I was looking at the traces they showed in https://openai.com/index/learning-to-reason-with-llms/).
What is the source of this observation?
See here.
One idea is to use a base LLM with no RLHF, compute the perplexity of the reasoning text, and add it as an additional term in the loss function. That should help with comprehensibility, but it doesn’t necessarily help with steganography. To disincentivize steganography, you could add noise to the reasoning in various ways, and remove any incentive for terseness, to ensure the model isn’t trying to squeeze more communication into a limited token budget.
A basic idea for detecting steganography is to monitor next-token probabilities for synonym pairs. If they tend to diverge over time, that suggests the model is assigning additional hidden meaning. (This might fail if the synonym embeddings are too close.)
My preferred approach to CoT would be something like:
Give human raters the task of next-token prediction on a large text corpus. Have them write out their internal monologue when trying to predict the next word in a sentence.
Train a model to predict the internal monologue of a human rater, conditional on previous tokens.
Train a second model to predict the next token in the corpus, conditional on previous tokens in the corpus and also the written internal monologue.
Only combine the above two models in production.
Now that you’ve embedded CoT in the base model, maybe it will be powerful enough that you can discard RHLF, and replace it with some sort of fine-tuning on PhDs roleplaying as a helpful/honest/harmless chatbot.
Basically give the base model a sort of “working memory” that’s incentivized for maximal human imitativeness and interpretability. Then you could build an interface where a person can mouse over any word in a sentence and see what the model was ‘thinking’ when it chose that word. (Realistically you wouldn’t do this for every word in a sentence, just the trickier ones.)
What did you consider incomprehensible? I agree the CoT has a very… distinct… character, but I’d call it “inefficient” rather than “incomprehensible”. All the moves it did when solving the cipher puzzle or the polynomial problem made sense to me. Did I overlook something?
Yeah, this is a good point. If I were designing a system where humans gave feedback on a CoT type system, I’d want to sometimes not show the human the full reasoning trace or the answer, and just have them rate whether the reasoning traces so far seem to be on the right track.