Your title question and closing question is not the same. Oil is valuable when demand exceeds supply and when easy-to-access supply declines and needs to be replaced with harder-to-access supply. Demand grows when there are more things we can do with oil and falls when we find better ways to do those things. When and where oil is abundant and easy to access, it’s cheap. Why should data be different?
But metaphors aside, your closing question is still interesting. More data should never have zero or negative value. And if the key is choosing the right data, then I’d expect the heavily-trained-on-big-data LLMs or equivalent will be part of the solution for generating and validating and curating data for training more efficient, targeted models in the future.
Consider the limiting cases. Right now we start with a randomized initial model and use lots of data. The far of the intelligence curve may or may not look like Deep Thought’s “even before its data banks had been connected up it had started from I think therefore I am and got as far as deducing the existence of rice pudding and income tax before anyone managed to turn it off” level. But even for Deep Thought more data is still better. It can make better use of any individual piece of data than any less capable system could. It just may not need it. So the value can be high, even if the price-per-unit it’s willing to pay is low because of its excellent BATNA. Just like how the value of a tiny amount of precious metal is extremely high to me for what it does in my electronics and in catalysts used to make products I buy (compared to the ancients who could only use it for decoration, jewelry, and money), but that doesn’t mean I’ll pay an exorbitant amount for lots of gold.
That’s as far as my speculation can go here, but that’s how I think about it.
Your title question and closing question is not the same. Oil is valuable when demand exceeds supply and when easy-to-access supply declines and needs to be replaced with harder-to-access supply. Demand grows when there are more things we can do with oil and falls when we find better ways to do those things. When and where oil is abundant and easy to access, it’s cheap. Why should data be different?
But metaphors aside, your closing question is still interesting. More data should never have zero or negative value. And if the key is choosing the right data, then I’d expect the heavily-trained-on-big-data LLMs or equivalent will be part of the solution for generating and validating and curating data for training more efficient, targeted models in the future.
Consider the limiting cases. Right now we start with a randomized initial model and use lots of data. The far of the intelligence curve may or may not look like Deep Thought’s “even before its data banks had been connected up it had started from I think therefore I am and got as far as deducing the existence of rice pudding and income tax before anyone managed to turn it off” level. But even for Deep Thought more data is still better. It can make better use of any individual piece of data than any less capable system could. It just may not need it. So the value can be high, even if the price-per-unit it’s willing to pay is low because of its excellent BATNA. Just like how the value of a tiny amount of precious metal is extremely high to me for what it does in my electronics and in catalysts used to make products I buy (compared to the ancients who could only use it for decoration, jewelry, and money), but that doesn’t mean I’ll pay an exorbitant amount for lots of gold.
That’s as far as my speculation can go here, but that’s how I think about it.