There is a genre of LLM critique that criticises LLMs for being, well, LLMs.
Yann LeCun for example points to the inability of GPT-4 to visually imagine the rotation of interlocking gears as a fact that shows how far away AGI is, instead of a fact that shows how GPT-4 has not been trained on video data yet.
There are many models now that “understand” images or videos or even more modalities. However, they are not end-to-end trained on these multiple modalities. Instead they use an intermediary model like CLIP, that translates into the language domain. This is a rather big limitation, because CLIP can only represent concepts in images that are commonly described in image captions.
Why do I consider this a big limitation? Currently it looks like intelligence emerges from learning to solve a huge number of tiny problems. Language seems to contain a lot of useful tiny problems. Additionally it is the interface to our kind of intelligence, which allows us to assess and use the intelligence extracted from huge amounts of text.
This means that adding a modality with a CLIP-like embedding and than doing some fine-tuning does not add any intelligence to the system. It only adds eyes or ears or gears.
Training end-to-end on multi-modal data should allow the model to extract new problem solving circuits from the new modalities. The resulting model would not just have eyes, but visual understanding.
Deepmind did a mildly convincing proof-of-concept with Gato last year, a small transformer trained on text, images, computer games and robotics. Now it seems they will try to scale Gato to Gemini, leapfrogging GPT-4 in the process.
GPT-4 itself has image processing capabilities that are not yet available to the general public. But whether these are an add-on or result of integrated image modelling we don’t know yet.
To me it seems very likely, that a world where the current AI boom fizzles is a world where multi-modality does not bring much benefits or we cannot figure out how to do it right or possibly the compute requirements of doing it right is still prohibitive.
I think Gemini will give us a good chunk of information about whether that is the world we are living in.
Gemini will bring the next big timeline update
There is a genre of LLM critique that criticises LLMs for being, well, LLMs.
Yann LeCun for example points to the inability of GPT-4 to visually imagine the rotation of interlocking gears as a fact that shows how far away AGI is, instead of a fact that shows how GPT-4 has not been trained on video data yet.
There are many models now that “understand” images or videos or even more modalities. However, they are not end-to-end trained on these multiple modalities. Instead they use an intermediary model like CLIP, that translates into the language domain. This is a rather big limitation, because CLIP can only represent concepts in images that are commonly described in image captions.
Why do I consider this a big limitation? Currently it looks like intelligence emerges from learning to solve a huge number of tiny problems. Language seems to contain a lot of useful tiny problems. Additionally it is the interface to our kind of intelligence, which allows us to assess and use the intelligence extracted from huge amounts of text.
This means that adding a modality with a CLIP-like embedding and than doing some fine-tuning does not add any intelligence to the system. It only adds eyes or ears or gears.
Training end-to-end on multi-modal data should allow the model to extract new problem solving circuits from the new modalities. The resulting model would not just have eyes, but visual understanding.
Deepmind did a mildly convincing proof-of-concept with Gato last year, a small transformer trained on text, images, computer games and robotics. Now it seems they will try to scale Gato to Gemini, leapfrogging GPT-4 in the process.
GPT-4 itself has image processing capabilities that are not yet available to the general public. But whether these are an add-on or result of integrated image modelling we don’t know yet.
To me it seems very likely, that a world where the current AI boom fizzles is a world where multi-modality does not bring much benefits or we cannot figure out how to do it right or possibly the compute requirements of doing it right is still prohibitive.
I think Gemini will give us a good chunk of information about whether that is the world we are living in.