Finally, the scramble task is about shuffling around letters in the right way, and arithmetic is about adding, subtracting, dividing, and multiplying numbers. The main interesting thing about these tasks is that performance doesn’t improve at all in the beginning, and then starts improving very fast. This is some evidence that we might expect non-linear improvements on particular tasks, though I mostly interpret it as these tasks being quite narrow, such that when a model starts getting the trick, it’s quite easy to systematically get right.
To beat my usual drum: I think the Arithmetic/Scramble task curves are just due to BPEs. The ‘trick’ here is not that scrambling or arithmetic are actually all that difficult, but that it needs to memorize enough of the encrypted number/word representations to finally crack the BPE code and then once it’s done that, the task itself is straightforward. The cause for this phase transition or ‘breakthrough’, so to speak, is seeing through the scrambled BPE representations. I predict that using tricks like rewriting numbers to individual digits or BPE-dropout to expose all possible tokenizations, or better yet, character-level representations, would show much smoother learning curves and that much smaller models would achieve the GPT-3-175b performance. (Does this make arithmetic more or less interesting? Depends on how many tasks you think have ‘tricks’ or pons asinorums. It’s worth noting that the existence of plateaus and breakthroughs is widely noted in the development of human children. They may seem to grow steadily each day, but this is the net over a lot of stacked sigmoids and plateaus/breakthroughs.)
To beat my usual drum: I think the Arithmetic/Scramble task curves are just due to BPEs. The ‘trick’ here is not that scrambling or arithmetic are actually all that difficult, but that it needs to memorize enough of the encrypted number/word representations to finally crack the BPE code and then once it’s done that, the task itself is straightforward. The cause for this phase transition or ‘breakthrough’, so to speak, is seeing through the scrambled BPE representations. I predict that using tricks like rewriting numbers to individual digits or BPE-dropout to expose all possible tokenizations, or better yet, character-level representations, would show much smoother learning curves and that much smaller models would achieve the GPT-3-175b performance. (Does this make arithmetic more or less interesting? Depends on how many tasks you think have ‘tricks’ or pons asinorums. It’s worth noting that the existence of plateaus and breakthroughs is widely noted in the development of human children. They may seem to grow steadily each day, but this is the net over a lot of stacked sigmoids and plateaus/breakthroughs.)