I’m not sure that the AI interacting with the world would help, at least with the narrow issue described here.
If we’re talking about data produced by humans (perhaps solicited from them by an AI), then we’re limited by the timescales of human behavior. The data sources described in this post were produced by millions of humans writing text over the course of decades (in rough order-of-magnitude terms).
All that text was already there in the world when the current era of large LMs began, so large LMs got to benefit from it immediately, “for free.” But once it’s exhausted, producing more is slow.
IMO, most people are currently overestimating the potential of large generative models—including image models like DALLE2 -- because of this fact.
There was all this massive data already sitting around from human activity (the web, Github, “books,” Instagram, Flickr, etc) long before ML compute/algorithms were anywhere near the point where they needed more data than that.
When our compute finally began to catch up with our data, we effectively spent all the “stored-up potential energy” in that data all at once, and then confused ourselves into thinking that compute was only necessary input for the reaction.
But now compute has finally caught up with data, and it wants more. We are forced for the first time to stop thinking of data as effectively infinite and free, and to face the reality of how much time and how many people it took to produce our huge-but-finite store of “data startup capital.”
I suppose the AI’s interactions with the world could involve soliciting more data of the kind it needs to improve (ie active learning), which is much more valuable per unit than generic data.
I would still be surprised if this approach could get much of anywhere without requiring solicitation-from-humans on a massive scale, but it’d be nice to see a back-of-the-envelope calculation using existing estimates of the benefit of active learning.
It seems to me that the key to human intelligence is nothing like what LMs do anyway; we don’t just correlate vast quantities of text tokens. They have meanings. That is, words correlate to objects in our world model, learned through lived experience, and sentences correspond to claims about how those objects are related to one another or are changing. Without being rooted in sensory, and perhaps even motor, experience, I don’t think general intelligence can be achieved. Language by itself can only go so far.
Language models seem to do a pretty good job at judging text “quality” in a way that agrees with humans. And of course, they’re good at generating new text. Could it be useful for a model to generate a bunch of output, filter it for quality by its own judgment, and then continue training on its own output? If so, would it be possible to “bootstrap” arbitrary amounts of extra training data?
It might be even better to just augment the data with quality judgements instead of only keeping the high-quality samples. This way, quality can have the form of a natural language description instead of a one-dimensional in-built thing, and you can later prime the model for an appropriate sense/dimension/direction of quality, as a kind of objective, without retraining.
But now compute has finally caught up with data, and it wants more. We are forced for the first time to stop thinking of data as effectively infinite and free, and to face the reality of how much time and how many people it took to produce our huge-but-finite store of “data startup capital.”
We may be running up against text data limits on the public web. But the big data companies got that name for a reason. If they can tap into the data of a Gmail, Facebook Messenger or YouTube then they will find tons of more fuel for their generative models.
I definitely think it makes LM --> AGI less likely, although I didn’t think it was very likely to begin with.
I’m not sure that the AI interacting with the world would help, at least with the narrow issue described here.
If we’re talking about data produced by humans (perhaps solicited from them by an AI), then we’re limited by the timescales of human behavior. The data sources described in this post were produced by millions of humans writing text over the course of decades (in rough order-of-magnitude terms).
All that text was already there in the world when the current era of large LMs began, so large LMs got to benefit from it immediately, “for free.” But once it’s exhausted, producing more is slow.
IMO, most people are currently overestimating the potential of large generative models—including image models like DALLE2 -- because of this fact.
There was all this massive data already sitting around from human activity (the web, Github, “books,” Instagram, Flickr, etc) long before ML compute/algorithms were anywhere near the point where they needed more data than that.
When our compute finally began to catch up with our data, we effectively spent all the “stored-up potential energy” in that data all at once, and then confused ourselves into thinking that compute was only necessary input for the reaction.
But now compute has finally caught up with data, and it wants more. We are forced for the first time to stop thinking of data as effectively infinite and free, and to face the reality of how much time and how many people it took to produce our huge-but-finite store of “data startup capital.”
I suppose the AI’s interactions with the world could involve soliciting more data of the kind it needs to improve (ie active learning), which is much more valuable per unit than generic data.
I would still be surprised if this approach could get much of anywhere without requiring solicitation-from-humans on a massive scale, but it’d be nice to see a back-of-the-envelope calculation using existing estimates of the benefit of active learning.
It seems to me that the key to human intelligence is nothing like what LMs do anyway; we don’t just correlate vast quantities of text tokens. They have meanings. That is, words correlate to objects in our world model, learned through lived experience, and sentences correspond to claims about how those objects are related to one another or are changing. Without being rooted in sensory, and perhaps even motor, experience, I don’t think general intelligence can be achieved. Language by itself can only go so far.
Language models seem to do a pretty good job at judging text “quality” in a way that agrees with humans. And of course, they’re good at generating new text. Could it be useful for a model to generate a bunch of output, filter it for quality by its own judgment, and then continue training on its own output? If so, would it be possible to “bootstrap” arbitrary amounts of extra training data?
It might be even better to just augment the data with quality judgements instead of only keeping the high-quality samples. This way, quality can have the form of a natural language description instead of a one-dimensional in-built thing, and you can later prime the model for an appropriate sense/dimension/direction of quality, as a kind of objective, without retraining.
We may be running up against text data limits on the public web. But the big data companies got that name for a reason. If they can tap into the data of a Gmail, Facebook Messenger or YouTube then they will find tons of more fuel for their generative models.