Gemini will bring the next big timeline update
There is a genre of LLM critique that criticises LLMs for being, well, LLMs.
Yann LeCun for example points to the inability of GPT-4 to visually imagine the rotation of interlocking gears as a fact that shows how far away AGI is, instead of a fact that shows how GPT-4 has not been trained on video data yet.
There are many models now that “understand” images or videos or even more modalities. However, they are not end-to-end trained on these multiple modalities. Instead they use an intermediary model like CLIP, that translates into the language domain. This is a rather big limitation, because CLIP can only represent concepts in images that are commonly described in image captions.
Why do I consider this a big limitation? Currently it looks like intelligence emerges from learning to solve a huge number of tiny problems. Language seems to contain a lot of useful tiny problems. Additionally it is the interface to our kind of intelligence, which allows us to assess and use the intelligence extracted from huge amounts of text.
This means that adding a modality with a CLIP-like embedding and than doing some fine-tuning does not add any intelligence to the system. It only adds eyes or ears or gears.
Training end-to-end on multi-modal data should allow the model to extract new problem solving circuits from the new modalities. The resulting model would not just have eyes, but visual understanding.
Deepmind did a mildly convincing proof-of-concept with Gato last year, a small transformer trained on text, images, computer games and robotics. Now it seems they will try to scale Gato to Gemini, leapfrogging GPT-4 in the process.
GPT-4 itself has image processing capabilities that are not yet available to the general public. But whether these are an add-on or result of integrated image modelling we don’t know yet.
To me it seems very likely, that a world where the current AI boom fizzles is a world where multi-modality does not bring much benefits or we cannot figure out how to do it right or possibly the compute requirements of doing it right is still prohibitive.
I think Gemini will give us a good chunk of information about whether that is the world we are living in.
I agree that Gemini will give us an update on timelines. But even if it’s not particularly impressive, there’s another route to LLM improvements that should be mentioned in any discussion on LLM timelines.
The capabilities of LLMs can be easily and dramatically improved, at least in some domains, by using scaffolding scripts that prompt the LLM to do internal reasoning and call external tools, as in HuggingGPT. These include creating sensory simulations with generative networks, then interpeting those simulations to access modality-specific knowledge. SmartGPT and Tree of Thoughts show massive improvements in logical reasoning using simple prompt arrangements. Whether or not these expand to be full language model based cognitive architectures (LMCAs), LLMs don’t need to have sensory knowledge embedded to use it. Given the ease of fine-tuning, adding this knowledge in an automated way seems within reach as well.
latent capacity overhang
Yes. That’s why we should include these likely improvements in our timelines.
It seems like multi-modality will also result in AIs that are much less interpretable than pure LLMs.
This is not obvious to me. It seems somewhat likely that the multimodaility actually induces more explicit representations and uses of human-level abstract concepts, e.g. a Jennifer Aniston neuron in a human brain is multimodal.
Relevant: Goh et al. finding multimodal neurons (ones responding to the same subject in photographs, drawings, and images of their name) in the CLIP image model, including ones for Spiderman, USA, Donald Trump, Catholicism, teenage, anime, birthdays, Minecraft, Nike, and others.