which is that you could actually get decent data-efficiency out of current architectures if they were just really really big?
You mean in some way other than the improvements on zero/few-shotting/meta-learning we already see from stuff like Dactyl or GPT-3 where bigger=better?
GPT-3 can zero-shot add numbers (to the extent that it can) because it’s had to predict a lot of numbers getting added. And it’s way better than GPT-2 which could only sometimes add 1 or 2 digits (citation just for clarity).
In a “weak scaling” view, this trend (such as it is) would continue—GPT-4 will be able to do more arithmetic, and will basically always carry the 1 when adding 3-digit numbers, and is starting to do notably well at adding 5-digit numbers, though it still often fails to carry the 1 across multiple places there. In this picture adding more data and compute is analogous to doing interpolation better and between rarer examples. After all, this is all that’s necessary to make the loss go down.
In a “strong scaling” view, the prediction function that gets learned isn’t just expected to interpolate, but to extrapolate, and extrapolate quite far with enough data and compute. And so maybe not GPT-4, but at least GPT-5 would be expected to “actually learn addition,” in the sense that even if we scrubbed all 10+ digit addition from the training data, it would effortlessly (given an appropriate prompt) be able to add 15-digit numbers, because at some point the best hypothesis for predicting addition-like text involves a reliably-extrapolating algorithm for addition.
You mean in some way other than the improvements on zero/few-shotting/meta-learning we already see from stuff like Dactyl or GPT-3 where bigger=better?
So in short, how much better is bigger? I think the first case is more likely for a lot of different sorts of tasks, and I think that this is still going to lead to super-impressive performance but is simultaneously really bad data efficiency. I’m also fairly convinced by Steve’s arguments for humans having architectural/algorithmic reasons for better data efficiency.
You mean in some way other than the improvements on zero/few-shotting/meta-learning we already see from stuff like Dactyl or GPT-3 where bigger=better?
Here’s maybe an example of what I’m thinking:
GPT-3 can zero-shot add numbers (to the extent that it can) because it’s had to predict a lot of numbers getting added. And it’s way better than GPT-2 which could only sometimes add 1 or 2 digits (citation just for clarity).
In a “weak scaling” view, this trend (such as it is) would continue—GPT-4 will be able to do more arithmetic, and will basically always carry the 1 when adding 3-digit numbers, and is starting to do notably well at adding 5-digit numbers, though it still often fails to carry the 1 across multiple places there. In this picture adding more data and compute is analogous to doing interpolation better and between rarer examples. After all, this is all that’s necessary to make the loss go down.
In a “strong scaling” view, the prediction function that gets learned isn’t just expected to interpolate, but to extrapolate, and extrapolate quite far with enough data and compute. And so maybe not GPT-4, but at least GPT-5 would be expected to “actually learn addition,” in the sense that even if we scrubbed all 10+ digit addition from the training data, it would effortlessly (given an appropriate prompt) be able to add 15-digit numbers, because at some point the best hypothesis for predicting addition-like text involves a reliably-extrapolating algorithm for addition.
So in short, how much better is bigger? I think the first case is more likely for a lot of different sorts of tasks, and I think that this is still going to lead to super-impressive performance but is simultaneously really bad data efficiency. I’m also fairly convinced by Steve’s arguments for humans having architectural/algorithmic reasons for better data efficiency.