In the age of big data and machine learning, it was common to say that data was the new oil. Every big company spent a lot of money creating their Hadoop instances, hiring data scientists, to discover new things from the data they had. It’s easy to see how the big data promise failed to live up to its promise.
Today, when companies go train their LLMs, they can use databases that are publicly available: the Common Crawl was 82% of the raw tokens used to train GPT-3 and the latest version is 428TB. The novel word in town is “high value tokens”, in this world, adding some data generated by humans exclusively for the dataset (aka: RLHF) helps way more than just adding lots of unrelated data.
But I am seeing tech executives talking a lot about how data is important, which goes opposite to my understanding of AI. Here’s Larry Ellison in the latest Oracle earnings call, replying to a question if the data gravity in his systems of record would change with the advent of AI:
Well, you know, you can’t build any of these AI models without enormous amounts of training data. So, if anything, what—what AI—generative AI has shown that the big issue about training one of these models is just getting—that this vast amount of data ingested into your GPU supercluster, it is a huge data problem in a sense you need so much data to, you know, train OpenAI, to train ChatGPT 3.5. They read the entire public internet. They read all of it, Wikipedia, they read everything.
They ingested everything. And to specialize—then you take something like ChatGPT 4.0 and you want to specialize it, you need specialized training data from electronic health records. Does it help doctors diagnose and treat cancer, let’s say. And we have partners imaging, for example, that is ingesting huge amounts of image data to train their AI models.
We have Ronin, another partner of ours in AI, ingesting huge amounts of electronic health records to train their models. AI doesn’t work without getting access to and ingesting enormous amounts of data. So, in terms of a shift away from data or a change in gravity to AI, AI is utterly dependent upon vast amounts of training data. Trillions of elements went into building ChatGPT 3.5, multiple times that for ChatGPT 4.0 because you had to deal with all the image data and ingest all of that to train—to train image recognition.
So, we think this is very good for our database business. And Oracle’s new vector database will contain highly specialized training data like electronic health records while keeping that data anonymized and private yet still training the specialized models that can—that can help doctors improve their diagnostic capability and their treatment prescriptions for cancer and heart disease and all sorts of other diseases. So, we think it’s a boon to our business, and we are now getting into the deep water of the information age. Nothing has changed about that.
The demands on data are getting stronger and more important.
This trend is common elsewhere with other companies’ executives (obviously Ellison is biased to give answers that put Oracle is a good spot).
But what Ellison is saying goes contrary to my intuition about AI. Companies like Nuance or Cerner may need data to train their medical AI LLMs, but a hospital, or even an insurer like UnitedHealth or a farmaCo like Pfizer have no edge whatsoever for having more data. Most of these tokens will be bought, generated by humans (you can literally hire 500 doctors to generate these tokens), and use what is publicly available.
In this world, some big companies will spend a lot to generate data (it’s speculated that Google already have a data budget in the billions), they’ll buy from other companies and pay human beings to generate it (it makes total sense when there’s a real possibility the world is spending $100B in GPUs in 2024). But you want high quality data and data that isn’t repetitive.
Does Less Wrong agree that data is less valuable in the new world of AI?
Your title question and closing question is not the same. Oil is valuable when demand exceeds supply and when easy-to-access supply declines and needs to be replaced with harder-to-access supply. Demand grows when there are more things we can do with oil and falls when we find better ways to do those things. When and where oil is abundant and easy to access, it’s cheap. Why should data be different?
But metaphors aside, your closing question is still interesting. More data should never have zero or negative value. And if the key is choosing the right data, then I’d expect the heavily-trained-on-big-data LLMs or equivalent will be part of the solution for generating and validating and curating data for training more efficient, targeted models in the future.
Consider the limiting cases. Right now we start with a randomized initial model and use lots of data. The far of the intelligence curve may or may not look like Deep Thought’s “even before its data banks had been connected up it had started from I think therefore I am and got as far as deducing the existence of rice pudding and income tax before anyone managed to turn it off” level. But even for Deep Thought more data is still better. It can make better use of any individual piece of data than any less capable system could. It just may not need it. So the value can be high, even if the price-per-unit it’s willing to pay is low because of its excellent BATNA. Just like how the value of a tiny amount of precious metal is extremely high to me for what it does in my electronics and in catalysts used to make products I buy (compared to the ancients who could only use it for decoration, jewelry, and money), but that doesn’t mean I’ll pay an exorbitant amount for lots of gold.
That’s as far as my speculation can go here, but that’s how I think about it.