The reason that language models require large amounts of data is because they lack grounding. When humans write a sentence about.. let’s say “fire”, we can relate that word to visual, auditory and kinesthetic experiences built from a coherent world model. Without this world model the LM needs a lot of examples, essentially it has to remember all the different contexts in which the word “fire” appears and figure out when it’s appropriate to use this word in a sentence [...]
In other words, language models need so much more language data than humans because they have no symbol grounding, and they have no symbol grounding because they lack a world model. This hypothesis would predict that required text data shrinks when multi modal models form world models and associate words with sensory data (e.g. from being trained on video data).
A comment from hacker news on this piece:
In other words, language models need so much more language data than humans because they have no symbol grounding, and they have no symbol grounding because they lack a world model. This hypothesis would predict that required text data shrinks when multi modal models form world models and associate words with sensory data (e.g. from being trained on video data).