Speculative hypothesis: Maybe some of the cases in which editing one number doesn’t ruin the bottom line result, are not as bad as it sounds, for the following reason: The system is surprised to see the wrong number there, and mentally writes it off as a typo, and proceeds with the true number that it expected in mind.
(This happens to me a lot. Sometimes I’ll be reading someone’s argument about how X leads to anti-Y which correlates with Z, and I’ll notice an obvious typo like “don’t they mean anticorrelates here?” and then I’ll just keep reading assuming they meant what I think they meant instead of what they actually said.)
Does the structure of the transformer allow for this sort of cognition?
Are you familiar with the Sparks of AGI folks’ work on a practical example of GPT-4 asserting early errors in CoT are typos (timsestamped link)? Pretty good prediction if not
What we care about is whether compute being done by the model faithfully factors through token outputs. To the extent that a given token under the usual human reading doesn’t represent much compute, then it doesn’t matter about whether the output is sensitively dependent on that token. As Daniel mentions, we should also expect some amount of error correction, and a reasonable (non-steganographic, actually uses CoT) model should error-correct mistakes as some monotonic function of how compute-expensive correction is.
For copying-errors, the copying operation involves minimal compute, and so insensitivity to previous copy-errors isn’t all that surprising or concerning. You can see this in the heatmap plots. E.g. the ‘9’ token in 3+6=9 seems to care more about the first ‘3’ token than the immediately preceding summand token—i.e. suggesting the copying operation was not really helpful/meaningful compute. Whereas I’d expect the outputs of arithmetic operations to be meaningful. Would be interested to see sensitivities when you aggregate only over outputs of arithmetic / other non-copying operations.
I like the application of Shapley values here, but I think aggregating over all integer tokens is a bit misleading for this reason. When evaluating CoT faithfulness, token-intervention-sensitivity should be weighted by how much compute it costs to reproduce/correct that token in some sense (e.g. perhaps by number of forward passes needed when queried separately). Not sure what the right, generalizable way to do this is, but an interesting comparison point might be if you replaced certain numbers (and all downstream repetitions of that number) with variable tokens like ‘X’. This seems more natural than just ablating individual tokens with e.g. ‘_’.
Speculative hypothesis: Maybe some of the cases in which editing one number doesn’t ruin the bottom line result, are not as bad as it sounds, for the following reason: The system is surprised to see the wrong number there, and mentally writes it off as a typo, and proceeds with the true number that it expected in mind.
(This happens to me a lot. Sometimes I’ll be reading someone’s argument about how X leads to anti-Y which correlates with Z, and I’ll notice an obvious typo like “don’t they mean anticorrelates here?” and then I’ll just keep reading assuming they meant what I think they meant instead of what they actually said.)
Does the structure of the transformer allow for this sort of cognition?
Are you familiar with the Sparks of AGI folks’ work on a practical example of GPT-4 asserting early errors in CoT are typos (timsestamped link)? Pretty good prediction if not
Nope, hadn’t seen that before, thanks for tip.
What we care about is whether compute being done by the model faithfully factors through token outputs. To the extent that a given token under the usual human reading doesn’t represent much compute, then it doesn’t matter about whether the output is sensitively dependent on that token. As Daniel mentions, we should also expect some amount of error correction, and a reasonable (non-steganographic, actually uses CoT) model should error-correct mistakes as some monotonic function of how compute-expensive correction is.
For copying-errors, the copying operation involves minimal compute, and so insensitivity to previous copy-errors isn’t all that surprising or concerning. You can see this in the heatmap plots. E.g. the ‘9’ token in 3+6=9 seems to care more about the first ‘3’ token than the immediately preceding summand token—i.e. suggesting the copying operation was not really helpful/meaningful compute. Whereas I’d expect the outputs of arithmetic operations to be meaningful. Would be interested to see sensitivities when you aggregate only over outputs of arithmetic / other non-copying operations.
I like the application of Shapley values here, but I think aggregating over all integer tokens is a bit misleading for this reason. When evaluating CoT faithfulness, token-intervention-sensitivity should be weighted by how much compute it costs to reproduce/correct that token in some sense (e.g. perhaps by number of forward passes needed when queried separately). Not sure what the right, generalizable way to do this is, but an interesting comparison point might be if you replaced certain numbers (and all downstream repetitions of that number) with variable tokens like ‘X’. This seems more natural than just ablating individual tokens with e.g. ‘_’.