What we care about is whether compute being done by the model faithfully factors through token outputs. To the extent that a given token under the usual human reading doesn’t represent much compute, then it doesn’t matter about whether the output is sensitively dependent on that token. As Daniel mentions, we should also expect some amount of error correction, and a reasonable (non-steganographic, actually uses CoT) model should error-correct mistakes as some monotonic function of how compute-expensive correction is.
For copying-errors, the copying operation involves minimal compute, and so insensitivity to previous copy-errors isn’t all that surprising or concerning. You can see this in the heatmap plots. E.g. the ‘9’ token in 3+6=9 seems to care more about the first ‘3’ token than the immediately preceding summand token—i.e. suggesting the copying operation was not really helpful/meaningful compute. Whereas I’d expect the outputs of arithmetic operations to be meaningful. Would be interested to see sensitivities when you aggregate only over outputs of arithmetic / other non-copying operations.
I like the application of Shapley values here, but I think aggregating over all integer tokens is a bit misleading for this reason. When evaluating CoT faithfulness, token-intervention-sensitivity should be weighted by how much compute it costs to reproduce/correct that token in some sense (e.g. perhaps by number of forward passes needed when queried separately). Not sure what the right, generalizable way to do this is, but an interesting comparison point might be if you replaced certain numbers (and all downstream repetitions of that number) with variable tokens like ‘X’. This seems more natural than just ablating individual tokens with e.g. ‘_’.
What we care about is whether compute being done by the model faithfully factors through token outputs. To the extent that a given token under the usual human reading doesn’t represent much compute, then it doesn’t matter about whether the output is sensitively dependent on that token. As Daniel mentions, we should also expect some amount of error correction, and a reasonable (non-steganographic, actually uses CoT) model should error-correct mistakes as some monotonic function of how compute-expensive correction is.
For copying-errors, the copying operation involves minimal compute, and so insensitivity to previous copy-errors isn’t all that surprising or concerning. You can see this in the heatmap plots. E.g. the ‘9’ token in 3+6=9 seems to care more about the first ‘3’ token than the immediately preceding summand token—i.e. suggesting the copying operation was not really helpful/meaningful compute. Whereas I’d expect the outputs of arithmetic operations to be meaningful. Would be interested to see sensitivities when you aggregate only over outputs of arithmetic / other non-copying operations.
I like the application of Shapley values here, but I think aggregating over all integer tokens is a bit misleading for this reason. When evaluating CoT faithfulness, token-intervention-sensitivity should be weighted by how much compute it costs to reproduce/correct that token in some sense (e.g. perhaps by number of forward passes needed when queried separately). Not sure what the right, generalizable way to do this is, but an interesting comparison point might be if you replaced certain numbers (and all downstream repetitions of that number) with variable tokens like ‘X’. This seems more natural than just ablating individual tokens with e.g. ‘_’.