A Ray comments on Steganography in Chain of Thought Reasoning

A Ray 9 Aug 2022 3:57 UTC
LW: 2 AF: 2
1
AF
If what you’re saying is “any change to the distribution will change KL” — I think that’s just correct.

This also applies to changes during training where the model is learning to perform better on the objective task.

So we are expecting some amount of KL divergence already.

My claims are:
1. The cheapest place to hide information (due to KL) are places where the model already has high entropy (ie it is uncertain between many possible outputs)
2. optimization pressure will try to push this extra information into the cheapest places to hide
3. the increase in KL won’t be clearly distinguishable from the increase due to increased performance on the task
- Quintin Pope 9 Aug 2022 4:29 UTC
  LW: 4 AF: 2
  1
  AF Parent
  I think it probably works out in the other direction. The easiest way to encode problem-relevant info is by changing the conceptual content of the generated text, not weird changes to the grammatical rules. This is because KL penalties and cross entropy loss actually prioritize modeling exact grammatical patterns over conceptual content. E.g., GPTs will quickly learn to generate ~perfect grammar, but are slow to pick up on / produce coherent conceptual content.
  
  Another framing is to think of the KL regularizated capabilities training as Bayes updating of the model’s priors on the distribution over text. The model should have much stronger priors on grammatical rules than on higher level concepts. So, the update probably changes conceptual patterns more than grammatical ones.
  
  the increase in KL won’t be clearly distinguishable from the increase due to increased performance on the task
  
  Can’t you just compare the token predictions of the new model against the reference LM? Then you can check if they disagree on weird grammatical patterns or on conceptual tokens that are related to a human-like understanding of how to tackle the problem. Maybe you can even bias the KL penalty to apply higher weight to stopwords, punctuation, and synonym choices?
  - A Ray 9 Aug 2022 6:56 UTC
    LW: 1 AF: 1
    1
    AF Parent
    I think at this point these feel like empirical questions, which I think would be much more clearly answered by demonstrations or experiments.
    Trying to encode an additional penalty on changing non-semantic information is an interesting idea.
    However I think you’re missing that you don’t have the ability to directly compare to a reference LM in cases where you’re training to improve on some performance benchmark. During training the model will change its predictions on everything to some degree (both semantic and nonsemantic content).
    So your proposed detection will always show some amount of disagreement between the prior and the trained model on weird grammatical patterns as well as conceptual tokens. The question is: “is the difference merely due to the changes to improve performance, or is it also transmitting hidden information”