However, there is something that’s likely to happen that might perturb this extrapolation. Companies building large foundation models are likely soon going to start building multimodal models (indeed, GPT-4 is already multimodal, since it understands images as well as text). This will happen for at least three inter-related reasons:
Multimodal models are inherently more useful, since they also understand some combination of images, video, music… as well as text, and the relationships between them.
It’s going to be challenging to find orders of magnitude more high-quality text data than exists on the Internet, but there are huge amounts of video and image data (YouTube, TV and cinema, Google Street View, satellite images, everything any Tesla’s cameras have ever uploaded, …), and it seems that the models of reality needed to understand/predict text, images, and video overlap and interact significantly and usefully.
It seems likely that video will give the models better understanding of commonsense aspects of physical reality important to humans (and humanoid robots): humans are heavily visual, and so are a lot of things in the society we’ve built
The question then is, does a thousand tokens-worth of text, video, and image data teach the model the same net amount? It seems plausible that video or image data might require more input to learn the same amount (depending on details of compression and tokenization), in which case training compute requirements might increase, which could throw the trend lines off. Even if not, the set of skills the model is learning will be larger, and while some things it’s learning overlap between these, others don’t, which could also alter the trend lines.
This is very interesting: thanks for plotting it.
However, there is something that’s likely to happen that might perturb this extrapolation. Companies building large foundation models are likely soon going to start building multimodal models (indeed, GPT-4 is already multimodal, since it understands images as well as text). This will happen for at least three inter-related reasons:
Multimodal models are inherently more useful, since they also understand some combination of images, video, music… as well as text, and the relationships between them.
It’s going to be challenging to find orders of magnitude more high-quality text data than exists on the Internet, but there are huge amounts of video and image data (YouTube, TV and cinema, Google Street View, satellite images, everything any Tesla’s cameras have ever uploaded, …), and it seems that the models of reality needed to understand/predict text, images, and video overlap and interact significantly and usefully.
It seems likely that video will give the models better understanding of commonsense aspects of physical reality important to humans (and humanoid robots): humans are heavily visual, and so are a lot of things in the society we’ve built
The question then is, does a thousand tokens-worth of text, video, and image data teach the model the same net amount? It seems plausible that video or image data might require more input to learn the same amount (depending on details of compression and tokenization), in which case training compute requirements might increase, which could throw the trend lines off. Even if not, the set of skills the model is learning will be larger, and while some things it’s learning overlap between these, others don’t, which could also alter the trend lines.