I already believe in the scaling hypothesis, I just don’t think we’re in a world that’s going to get to test it until after transformative AI is built by people who’ve continued to make progress on algorithms and architecture.
Perhaps there’s an even stronger hypothesis that I’m more skeptical about, which is that you could actually get decent data-efficiency out of current architectures if they were just really really big? (I think that my standards for “decent” involve beating what is currently thought of as the scaling law for dataset size for transformers doing text prediction.) I think this would greatly increase the importance I’d place on politics / policy ASAP, because then we’d already be living in a world where a sufficiently large project would be transformative, I think.
which is that you could actually get decent data-efficiency out of current architectures if they were just really really big?
You mean in some way other than the improvements on zero/few-shotting/meta-learning we already see from stuff like Dactyl or GPT-3 where bigger=better?
GPT-3 can zero-shot add numbers (to the extent that it can) because it’s had to predict a lot of numbers getting added. And it’s way better than GPT-2 which could only sometimes add 1 or 2 digits (citation just for clarity).
In a “weak scaling” view, this trend (such as it is) would continue—GPT-4 will be able to do more arithmetic, and will basically always carry the 1 when adding 3-digit numbers, and is starting to do notably well at adding 5-digit numbers, though it still often fails to carry the 1 across multiple places there. In this picture adding more data and compute is analogous to doing interpolation better and between rarer examples. After all, this is all that’s necessary to make the loss go down.
In a “strong scaling” view, the prediction function that gets learned isn’t just expected to interpolate, but to extrapolate, and extrapolate quite far with enough data and compute. And so maybe not GPT-4, but at least GPT-5 would be expected to “actually learn addition,” in the sense that even if we scrubbed all 10+ digit addition from the training data, it would effortlessly (given an appropriate prompt) be able to add 15-digit numbers, because at some point the best hypothesis for predicting addition-like text involves a reliably-extrapolating algorithm for addition.
You mean in some way other than the improvements on zero/few-shotting/meta-learning we already see from stuff like Dactyl or GPT-3 where bigger=better?
So in short, how much better is bigger? I think the first case is more likely for a lot of different sorts of tasks, and I think that this is still going to lead to super-impressive performance but is simultaneously really bad data efficiency. I’m also fairly convinced by Steve’s arguments for humans having architectural/algorithmic reasons for better data efficiency.
I already believe in the scaling hypothesis, I just don’t think we’re in a world that’s going to get to test it until after transformative AI is built by people who’ve continued to make progress on algorithms and architecture.
Perhaps there’s an even stronger hypothesis that I’m more skeptical about, which is that you could actually get decent data-efficiency out of current architectures if they were just really really big? (I think that my standards for “decent” involve beating what is currently thought of as the scaling law for dataset size for transformers doing text prediction.) I think this would greatly increase the importance I’d place on politics / policy ASAP, because then we’d already be living in a world where a sufficiently large project would be transformative, I think.
You mean in some way other than the improvements on zero/few-shotting/meta-learning we already see from stuff like Dactyl or GPT-3 where bigger=better?
Here’s maybe an example of what I’m thinking:
GPT-3 can zero-shot add numbers (to the extent that it can) because it’s had to predict a lot of numbers getting added. And it’s way better than GPT-2 which could only sometimes add 1 or 2 digits (citation just for clarity).
In a “weak scaling” view, this trend (such as it is) would continue—GPT-4 will be able to do more arithmetic, and will basically always carry the 1 when adding 3-digit numbers, and is starting to do notably well at adding 5-digit numbers, though it still often fails to carry the 1 across multiple places there. In this picture adding more data and compute is analogous to doing interpolation better and between rarer examples. After all, this is all that’s necessary to make the loss go down.
In a “strong scaling” view, the prediction function that gets learned isn’t just expected to interpolate, but to extrapolate, and extrapolate quite far with enough data and compute. And so maybe not GPT-4, but at least GPT-5 would be expected to “actually learn addition,” in the sense that even if we scrubbed all 10+ digit addition from the training data, it would effortlessly (given an appropriate prompt) be able to add 15-digit numbers, because at some point the best hypothesis for predicting addition-like text involves a reliably-extrapolating algorithm for addition.
So in short, how much better is bigger? I think the first case is more likely for a lot of different sorts of tasks, and I think that this is still going to lead to super-impressive performance but is simultaneously really bad data efficiency. I’m also fairly convinced by Steve’s arguments for humans having architectural/algorithmic reasons for better data efficiency.