One critique is that GPT-3 still takes far too long to “identify” a task—why does it need 50 examples of addition in order to figure out that what it should do is addition? Why isn’t 1 sufficient? It’s not like there are a bunch of other conceptions of “addition” that need to be disambiguated
Check it out: 2500 is the four-digit chunk ’2500′. 3500 is the digits ‘35’ followed by the digits ‘00’. And 4500 is the digit ‘4’ followed by the digits ‘500’.
As we head further into 4-digit numerals, we start seeing 3-chunk ones eventually. The first 3-chunk numeral is (place your bets…) 4761 = “ 4″ + “76″ + “1″ (did you guess it?). The next is 4791, then 4861, 4862, 4863, then 4881, and so on in another inscrutable integer sequence.
Unlike 2-chunking, though, 3-chunking is consistent about where to split. It’s always first digit + middle two + last digit. This holds across the whole range from the 4761, the first 4-digit / 3-chunk number, to 9984, the last 4-digit / 3-chunk number. Among 4-digit numbers overall, 2.0% are 1 chunk, 95.7% are 2 chunks, and 2.4% are 3 chunks.
… got that?
GPT-3 isn’t learning ‘arithmetic’, it’s learning ‘arithmetic on 2-chunking’, ‘arithmetic on 2-chunking when overflowing to 3-chunking’, ‘arithmetic on 3-chunking’, ‘arithmetic on 3-chunking overflowing to 1-3 chunks’...
To update on this: I think that actually, there are a lot of ‘additions’ which are ambiguous in GPT-3, because of the bizarreness of the BPE representation of numbers. I’ve discussed how empirically arithmetic & other tasks improve when avoiding BPEs, but perhaps more useful is to look at the task of ‘addition’ on BPEs directly. Nostalgebraist helpfully provides some tables of BPEs/numbers: https://nostalgebraist.tumblr.com/post/620663843893493761/bpe-blues
GPT-3 isn’t learning ‘arithmetic’, it’s learning ‘arithmetic on 2-chunking’, ‘arithmetic on 2-chunking when overflowing to 3-chunking’, ‘arithmetic on 3-chunking’, ‘arithmetic on 3-chunking overflowing to 1-3 chunks’...