Fabien Roger comments on Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases

Fabien Roger 17 Mar 2025 14:31 UTC
LW: 2 AF: 2
0
AF
1. Both (the final answers are usually very short compared to the CoT)
2. The 1-sigma uncertainty over the mean is sigma = sqrt(acc * (1-acc) / n) ~ 1%. But I think the results are slightly more significant than this suggests as I evaluate all models on the exact same set of questions + given that chunk-paraphrased is worse across experiments on multiple models, I would guess that the same effect would exist for a much larger eval.
3. Good catch, I think I wrote that based on earlier results. I’ll edit the post.
On your notes:
If paraphrasing was a bijection between semantically-equivalent sentences, then I agree you could learn the structure in the bijection space. Is that what you were pointing at? My hope is that paraphrasing removes the mutual information through standardization / randomization (where what is random is not the map, but the actual samples because I am sampling at temperature 1). For example, if enumerated list ⇒ equation and bulleted list ⇒ sth else, then swapping enumerated and bulleted would not help but having the model use always a bulleted list or use a bulleted list a random 50% of the time does help to break the enumerated list ⇔ equation link.
I agree that it’s unclear where paraphrasing is on the [bijection]-[no syntax left unrandomized] spectrum.
- Tom Lieberum 19 Mar 2025 11:24 UTC
  LW: 1 AF: 1
  0
  AF Parent
  If paraphrasing was a bijection between semantically-equivalent sentences, then I agree you could learn the structure in the bijection space. Is that what you were pointing at?
  Yes exactly. I further claim that non-chunk paraphrasing seems more likely to be closer to the bijection end of the spectrum than chunk paraphrasing, though I agree that it could still be a many-to-one mapping (i.e. standardization).
  
  I am especially concerned about this when training on the final answer.
  Maybe my question just boils down to
  How do you distinguish between “paraphrasing didn’t break the syntactical <> semantic link” and “the model isn’t relying on that link / the link doesn’t exist in the first place”.
  - Fabien Roger 19 Mar 2025 12:20 UTC
    LW: 2 AF: 2
    0
    AF Parent
    I agree this is a concern.
    My view here is that if you look at the paraphrases, it seems that a decent amount of standardization happened. Empirically, it also looks like T=1 sampling of the paraphraser introduces a random variations in the syntax. So I would be very surprised if a decent chunk of the syntactic link was not broken (e.g. at least 20% of it). Therefore I think it’s reasonable to bound the performance due to encoded reasoning to ~5x the gap between the original distillation and the paraphrased distillation, which would be ~7% (in absolute pass@1), which still seems not very frightening to me given that the no-CoT vs CoT gap is >50%.
    This is sensitive to the size of the gap. If future work finds a ~5% gap on a new model, I’d be very concerned because it would make it more likely that the majority of the reasoning is encoded.
    (Edit: This is a rough sketch of the argument, there are complications around diminishing returns on reasoning: maybe only a very small amount of visible reasoning is responsible for the majority of the no-CoT vs CoT gap. But it’s unclear how much this undermines my argument because I most care about the ability of the model to not leak anything about what it is reasoning about.)