If paraphrasing was a bijection between semantically-equivalent sentences, then I agree you could learn the structure in the bijection space. Is that what you were pointing at?
Yes exactly. I further claim that non-chunk paraphrasing seems more likely to be closer to the bijection end of the spectrum than chunk paraphrasing, though I agree that it could still be a many-to-one mapping (i.e. standardization).
I am especially concerned about this when training on the final answer.
Maybe my question just boils down to
How do you distinguish between “paraphrasing didn’t break the syntactical <> semantic link” and “the model isn’t relying on that link / the link doesn’t exist in the first place”.
My view here is that if you look at the paraphrases, it seems that a decent amount of standardization happened. Empirically, it also looks like T=1 sampling of the paraphraser introduces a random variations in the syntax. So I would be very surprised if a decent chunk of the syntactic link was not broken (e.g. at least 20% of it). Therefore I think it’s reasonable to bound the performance due to encoded reasoning to ~5x the gap between the original distillation and the paraphrased distillation, which would be ~7% (in absolute pass@1), which still seems not very frightening to me given that the no-CoT vs CoT gap is >50%.
This is sensitive to the size of the gap. If future work finds a ~5% gap on a new model, I’d be very concerned because it would make it more likely that the majority of the reasoning is encoded.
(Edit: This is a rough sketch of the argument, there are complications around diminishing returns on reasoning: maybe only a very small amount of visible reasoning is responsible for the majority of the no-CoT vs CoT gap. But it’s unclear how much this undermines my argument because I most care about the ability of the model to not leak anything about what it is reasoning about.)
Yes exactly. I further claim that non-chunk paraphrasing seems more likely to be closer to the bijection end of the spectrum than chunk paraphrasing, though I agree that it could still be a many-to-one mapping (i.e. standardization).
I am especially concerned about this when training on the final answer.
Maybe my question just boils down to
I agree this is a concern.
My view here is that if you look at the paraphrases, it seems that a decent amount of standardization happened. Empirically, it also looks like T=1 sampling of the paraphraser introduces a random variations in the syntax. So I would be very surprised if a decent chunk of the syntactic link was not broken (e.g. at least 20% of it). Therefore I think it’s reasonable to bound the performance due to encoded reasoning to ~5x the gap between the original distillation and the paraphrased distillation, which would be ~7% (in absolute pass@1), which still seems not very frightening to me given that the no-CoT vs CoT gap is >50%.
This is sensitive to the size of the gap. If future work finds a ~5% gap on a new model, I’d be very concerned because it would make it more likely that the majority of the reasoning is encoded.
(Edit: This is a rough sketch of the argument, there are complications around diminishing returns on reasoning: maybe only a very small amount of visible reasoning is responsible for the majority of the no-CoT vs CoT gap. But it’s unclear how much this undermines my argument because I most care about the ability of the model to not leak anything about what it is reasoning about.)