I think this is a very important direction to investigate. CoT reasoners seem like our best shot at aligning an AGI.
The KL objective pushes these correlations into the “null space” where they are not penalized.
I don’t think this is true. There are many ways to express the same underlying idea, but GPT-3 is not agnostic between them. Divergences from GPT-3’s nullspace are still penalized. KL isn’t wrt “does this text contain the same high level contents according to a human?” It’s wrt “does this text match GPT-3’s most likely continuations?”
I think I understand what you’re saying, but I want to double check and try laying it out explicitly.
The “null space” is actually the null space of the prior model, which is human-like, but not exactly lined up to human perception of language
Because of this, the information is “hidden” w.r.t. human perceptions, it’s “hidden” w.r.t. the language model prior
I think I agree with both of these points, but here’s my thinking for why I still expect to see this phenomena (and why the article above was simplified to just be “human”)
Empirically drawing many samples results in lots of “contains similar high level contents according to a human (me)”
It might be worth studying this explicitly, such as setting up a bunch of chain of thought reasoning like scenarios, and sampling a lot of generated reasoning, then asking humans to rate how similar the generations are
My prediction is that people will think that they are largely similar
We could also use this same method to study human-measured relevance of other non-trivial sampling methods, like beam search, etc
“Hiding” w.r.t. language model prior is sufficient, in the limit of good language models
One of the things I think my reasoning relies on is that we will continue to produce better language models, which means the prior over time more and more closely resembles human priors
This suggests that one way to avoid this phenomena is to only use smaller / less capable models—or at least they will do a worse job at hiding information. I currently expect that many research labs will continue to do chain of thought reasoning with state of the art models, though.
There’s a lot of room in the model distribution
Chain of thought reasoning is not currently, nor do I expect it will ever be, overconstrained.
Instead, it can be seen as a mechanism for dramatically increasing the space of possible outputs, both those seen during training, and those encountered at test time
Human preference effects
Research Campaign Effects: Human researchers will want human-like outputs, and thus will be expected to tweak/retrain/etc in order to get human-like reasoning in the chain of thought. This introduces a separate optimization pressure from the KL penalty towards reasoning being human-like
Human preference models: I think an easy-to-predict future direction here is training reward models to match subjective human preferences and then optimizing with them or sampling with them (e.g. best of N) -- these preference models could also introduce additional optimization pressure towards human-like
I think the most important point in all of this is that I expect the model prior to be sufficiently close to the human prior—and I’d be interested in learning more about evidence or research to the contrary.
Apologies for my first comment. I was tired, and most of what I said about KL divergence is actually wrong.
The core issue is that you can’t actually avoid the KL penalty by hiding in the nullspace. That’s just not how KL divergence works. A KL divergence of zero implies that the two LMs specify exactly the same distribution over tokens. If you deviate away from the reference LM’s prior in any manner, then you take a KL penalty.
E.g., if the reference LM is indifferent between the phrases “My stomach is grumbling” or “I could use a snack”, but the steganographic LM always uses “My stomach is grumbling”, then this deviation is also penalized by KL divergence, regardless of whether the reference LM’s indifference between the two phrasings means you call this part of the LM’s “nullspace”.
If what you’re saying is “any change to the distribution will change KL” — I think that’s just correct.
This also applies to changes during training where the model is learning to perform better on the objective task.
So we are expecting some amount of KL divergence already.
My claims are:
The cheapest place to hide information (due to KL) are places where the model already has high entropy (ie it is uncertain between many possible outputs)
optimization pressure will try to push this extra information into the cheapest places to hide
the increase in KL won’t be clearly distinguishable from the increase due to increased performance on the task
I think it probably works out in the other direction. The easiest way to encode problem-relevant info is by changing the conceptual content of the generated text, not weird changes to the grammatical rules. This is because KL penalties and cross entropy loss actually prioritize modeling exact grammatical patterns over conceptual content. E.g., GPTs will quickly learn to generate ~perfect grammar, but are slow to pick up on / produce coherent conceptual content.
Another framing is to think of the KL regularizated capabilities training as Bayes updating of the model’s priors on the distribution over text. The model should have much stronger priors on grammatical rules than on higher level concepts. So, the update probably changes conceptual patterns more than grammatical ones.
the increase in KL won’t be clearly distinguishable from the increase due to increased performance on the task
Can’t you just compare the token predictions of the new model against the reference LM? Then you can check if they disagree on weird grammatical patterns or on conceptual tokens that are related to a human-like understanding of how to tackle the problem. Maybe you can even bias the KL penalty to apply higher weight to stopwords, punctuation, and synonym choices?
I think at this point these feel like empirical questions, which I think would be much more clearly answered by demonstrations or experiments.
Trying to encode an additional penalty on changing non-semantic information is an interesting idea.
However I think you’re missing that you don’t have the ability to directly compare to a reference LM in cases where you’re training to improve on some performance benchmark. During training the model will change its predictions on everything to some degree (both semantic and nonsemantic content).
So your proposed detection will always show some amount of disagreement between the prior and the trained model on weird grammatical patterns as well as conceptual tokens. The question is: “is the difference merely due to the changes to improve performance, or is it also transmitting hidden information”
I think this is a very important direction to investigate. CoT reasoners seem like our best shot at aligning an AGI.
I don’t think this is true.
There are many ways to express the same underlying idea, but GPT-3 is not agnostic between them. Divergences from GPT-3’s nullspace are still penalized. KL isn’t wrt “does this text contain the same high level contents according to a human?” It’s wrt “does this text match GPT-3’s most likely continuations?”Edit: not actually how KL divergence works.
I think I understand what you’re saying, but I want to double check and try laying it out explicitly.
The “null space” is actually the null space of the prior model, which is human-like, but not exactly lined up to human perception of language
Because of this, the information is “hidden” w.r.t. human perceptions, it’s “hidden” w.r.t. the language model prior
I think I agree with both of these points, but here’s my thinking for why I still expect to see this phenomena (and why the article above was simplified to just be “human”)
Empirically drawing many samples results in lots of “contains similar high level contents according to a human (me)”
It might be worth studying this explicitly, such as setting up a bunch of chain of thought reasoning like scenarios, and sampling a lot of generated reasoning, then asking humans to rate how similar the generations are
My prediction is that people will think that they are largely similar
We could also use this same method to study human-measured relevance of other non-trivial sampling methods, like beam search, etc
“Hiding” w.r.t. language model prior is sufficient, in the limit of good language models
One of the things I think my reasoning relies on is that we will continue to produce better language models, which means the prior over time more and more closely resembles human priors
This suggests that one way to avoid this phenomena is to only use smaller / less capable models—or at least they will do a worse job at hiding information. I currently expect that many research labs will continue to do chain of thought reasoning with state of the art models, though.
There’s a lot of room in the model distribution
Chain of thought reasoning is not currently, nor do I expect it will ever be, overconstrained.
Instead, it can be seen as a mechanism for dramatically increasing the space of possible outputs, both those seen during training, and those encountered at test time
Human preference effects
Research Campaign Effects: Human researchers will want human-like outputs, and thus will be expected to tweak/retrain/etc in order to get human-like reasoning in the chain of thought. This introduces a separate optimization pressure from the KL penalty towards reasoning being human-like
Human preference models: I think an easy-to-predict future direction here is training reward models to match subjective human preferences and then optimizing with them or sampling with them (e.g. best of N) -- these preference models could also introduce additional optimization pressure towards human-like
I think the most important point in all of this is that I expect the model prior to be sufficiently close to the human prior—and I’d be interested in learning more about evidence or research to the contrary.
Apologies for my first comment. I was tired, and most of what I said about KL divergence is actually wrong.
The core issue is that you can’t actually avoid the KL penalty by hiding in the nullspace. That’s just not how KL divergence works. A KL divergence of zero implies that the two LMs specify exactly the same distribution over tokens. If you deviate away from the reference LM’s prior in any manner, then you take a KL penalty.
E.g., if the reference LM is indifferent between the phrases “My stomach is grumbling” or “I could use a snack”, but the steganographic LM always uses “My stomach is grumbling”, then this deviation is also penalized by KL divergence, regardless of whether the reference LM’s indifference between the two phrasings means you call this part of the LM’s “nullspace”.
If what you’re saying is “any change to the distribution will change KL” — I think that’s just correct.
This also applies to changes during training where the model is learning to perform better on the objective task.
So we are expecting some amount of KL divergence already.
My claims are:
The cheapest place to hide information (due to KL) are places where the model already has high entropy (ie it is uncertain between many possible outputs)
optimization pressure will try to push this extra information into the cheapest places to hide
the increase in KL won’t be clearly distinguishable from the increase due to increased performance on the task
I think it probably works out in the other direction. The easiest way to encode problem-relevant info is by changing the conceptual content of the generated text, not weird changes to the grammatical rules. This is because KL penalties and cross entropy loss actually prioritize modeling exact grammatical patterns over conceptual content. E.g., GPTs will quickly learn to generate ~perfect grammar, but are slow to pick up on / produce coherent conceptual content.
Another framing is to think of the KL regularizated capabilities training as Bayes updating of the model’s priors on the distribution over text. The model should have much stronger priors on grammatical rules than on higher level concepts. So, the update probably changes conceptual patterns more than grammatical ones.
Can’t you just compare the token predictions of the new model against the reference LM? Then you can check if they disagree on weird grammatical patterns or on conceptual tokens that are related to a human-like understanding of how to tackle the problem. Maybe you can even bias the KL penalty to apply higher weight to stopwords, punctuation, and synonym choices?
I think at this point these feel like empirical questions, which I think would be much more clearly answered by demonstrations or experiments.
Trying to encode an additional penalty on changing non-semantic information is an interesting idea.
However I think you’re missing that you don’t have the ability to directly compare to a reference LM in cases where you’re training to improve on some performance benchmark. During training the model will change its predictions on everything to some degree (both semantic and nonsemantic content).
So your proposed detection will always show some amount of disagreement between the prior and the trained model on weird grammatical patterns as well as conceptual tokens. The question is: “is the difference merely due to the changes to improve performance, or is it also transmitting hidden information”