I don’t actually think we’re bottlenecked by data. Chinchilla represents a change in focus (for current architectures), but I think it’s useful to remember what that paper actually told the rest of the field: “hey you can get way better results for way less compute if you do it this way.”
I feel like characterizing Chinchilla most directly as a bottleneck would be missing its point. It was a major capability gain, and it tells everyone else how to get even more capability gain. There are some data-related challenges far enough down the implied path, but we have no reason to believe that they are insurmountable. In fact, it looks an awful lot like it won’t even be very difficult!
Some of my confidence here arises from things that I don’t think would be wise to blab about in public, so my arguments might not be quite as convincing sounding as I’d like, but I’ll give a try.
I wouldn’t quite say it’s not a problem at all, but rather it’s the type of problem that the field is really good at solving. They don’t have to solve ethics or something. They just need to do some clever engineering with the backing of infinite money.
I’d put it at a similar tier of difficulty as scaling up transformers to begin with. That wasn’t nothing! And the industry blew straight through it.
To give some examples that I’m comfortable having in public:
Suppose you stick to text-only training. Could you expand your training sets automatically? Maybe create a higher quality transcription AI and use it to pad your training set using the entirety of youtube?
Maybe you figure out a relatively simple way to extract more juice from a smaller dataset that doesn’t collapse into pathological overfitting.
Maybe you make existing datasets more informative by filtering out sequences that seem to interfere with training.
Maybe you embrace multimodal training where text-only bottlenecks are irrelevant.
Maybe you do it the hard way. What’s a few billion dollars?
(I guess this technically covers my “by the end of this year we’ll see at least one large model making progress on Chinchilla” prediction, though apparently it was up even before my prediction!)
Could you explain why you feel that way about Chinchilla? Because I found that post: https://www.lesswrong.com/posts/6Fpvch8RR29qLEWNH/chinchilla-s-wild-implications to give very compelling reasons for why data should be considered a bottleneck and I’m curious what makes you say that it shouldn’t be a problem at all.
Some of my confidence here arises from things that I don’t think would be wise to blab about in public, so my arguments might not be quite as convincing sounding as I’d like, but I’ll give a try.
I wouldn’t quite say it’s not a problem at all, but rather it’s the type of problem that the field is really good at solving. They don’t have to solve ethics or something. They just need to do some clever engineering with the backing of infinite money.
I’d put it at a similar tier of difficulty as scaling up transformers to begin with. That wasn’t nothing! And the industry blew straight through it.
To give some examples that I’m comfortable having in public:
Suppose you stick to text-only training. Could you expand your training sets automatically? Maybe create a higher quality transcription AI and use it to pad your training set using the entirety of youtube?
Maybe you figure out a relatively simple way to extract more juice from a smaller dataset that doesn’t collapse into pathological overfitting.
Maybe you make existing datasets more informative by filtering out sequences that seem to interfere with training.
Maybe you embrace multimodal training where text-only bottlenecks are irrelevant.
Maybe you do it the hard way. What’s a few billion dollars?
Another recent example: https://openreview.net/forum?id=NiEtU7blzN
(I guess this technically covers my “by the end of this year we’ll see at least one large model making progress on Chinchilla” prediction, though apparently it was up even before my prediction!)