Yes, I like the point of view shared in this post, but I agree with you. As datasets grow sufficiently large, diverse and complex, they do an increasingly better job at revealing the underlying reality. Like taking a rubbing using a very fine pencil of a very complex object. Your first ten strokes miss so much that they are nearly noise. The first hundred also miss a lot and are hard to interpret. But a million? The information of each individual stroke may be low, but the underlying features remain a consistent force.
I think anyone working with toy datasets even moderate sized datasets of a few million examples ought to keep this issue very much in mind.
There could be an argument made that the information coming through language is systematically missing certain key aspects of the underlying reality. My intuition strongly suggests that the set of language corpus + video corpus contains all the necessary information to accurately model reality though.
I’d be happy to bet on this point if anyone cares to make a prediction market about it.
I think humanity’s text corpus is sufficiently rich and comprehensive for language models to generalise far into the superhuman domain.
Yes, I like the point of view shared in this post, but I agree with you. As datasets grow sufficiently large, diverse and complex, they do an increasingly better job at revealing the underlying reality. Like taking a rubbing using a very fine pencil of a very complex object. Your first ten strokes miss so much that they are nearly noise. The first hundred also miss a lot and are hard to interpret. But a million? The information of each individual stroke may be low, but the underlying features remain a consistent force. I think anyone working with toy datasets even moderate sized datasets of a few million examples ought to keep this issue very much in mind. There could be an argument made that the information coming through language is systematically missing certain key aspects of the underlying reality. My intuition strongly suggests that the set of language corpus + video corpus contains all the necessary information to accurately model reality though. I’d be happy to bet on this point if anyone cares to make a prediction market about it.
“I’d be happy to bet on this point if anyone cares to make a prediction market about it.”
Hmmmm. Gary Marcus, where are you?