Thanks for the comment Rohin, that’s interesting (though I haven’t looked at the paper you linked).
I’ll just record some confusion I had after reading your comment that stopped me replying initially: I was confused by the distinction between modular and non-modular because I kept thinking: If I add a bunch of numbers x and y and don’t do any modding, then it is equivalent to doing modular addition modulo some large number (i.e. at least as large as the largest sum you get). And otoh if I tell you I’m doing ‘addition modulo 113’, but I only ever use inputs that add up to 112 or less, then you never see the fact that I was secretly intending to do modular addition. And these thoughts sort of stopped me having anything more interesting to add.
I agree—the point is that if you train on addition examples without any modular wraparound (whether you think of that as regular addition or modular addition with a large prime, doesn’t super matter), then there is at least some evidence that you get a different representation than the one Nanda et al found.
Thanks for the comment Rohin, that’s interesting (though I haven’t looked at the paper you linked).
I’ll just record some confusion I had after reading your comment that stopped me replying initially: I was confused by the distinction between modular and non-modular because I kept thinking: If I add a bunch of numbers x and y and don’t do any modding, then it is equivalent to doing modular addition modulo some large number (i.e. at least as large as the largest sum you get). And otoh if I tell you I’m doing ‘addition modulo 113’, but I only ever use inputs that add up to 112 or less, then you never see the fact that I was secretly intending to do modular addition. And these thoughts sort of stopped me having anything more interesting to add.
I agree—the point is that if you train on addition examples without any modular wraparound (whether you think of that as regular addition or modular addition with a large prime, doesn’t super matter), then there is at least some evidence that you get a different representation than the one Nanda et al found.