Ok, I was thinking about this a bit and finally got some time to write it down. I realized that it is quite hard to make predictions about the first version of GATO as it depends on what the team would prioritize in development. Therefore I’ll try to predict some attributes/features of a GATO-like model that should be available in next two years, while expecting that many will appear sooner—it is just difficult to say which ones. I’m not a professional ML researcher so I might get some factual things wrong, so I would be happy to hear from people with more insight to correct me.
First the prediction regarding the size of the model: I would expect a GATO-like architecture to have a larger amount of commercial success/usefulness than e.g. GPT-3 and so the investment should also be higher. Furthermore, I would guess that there would be several significant improvements to the training infrastructure e.g. by companies such as Cerebras/Graphcore etc. Therefore I estimate the model to use somewhere between 10-100x more compute compared to GPT-3. This might result in the model having more parameters, a larger context window, or most likely both. I predict the most likely context window size to be ~40000 (10x more) and param count 1T (6x compared to gpt). Regarding the context window—I think there will be some algorithmic improvements, so it won’t be the same as before (see bellow).
Since GATO is multimodal, I would expect the scaling laws to change a bit due to the transfer learning. E.g. since the model won’t need to find all information about the shape of the objects in the text, but instead can just look at the images, it should be much easier for it to answer questions such as “Can scissors be inserted into a glass bottle?” and so require a significantly smaller amount of data. Thus the scaling laws would also need to be multi-dimensional to answer what is the optimal ratio of audio/text/image/video to achieve best results. For example to improve the language model part of GATO, we may counterintuitively need to train on more images, instead of text.
I predict that GATO would be trained on text, images, audio and video. I belive they would try to do also image/audio/video generation which they didn’t do in the current version, by essentially just trying to predict what is the next image/video token, instead of using diffusion models. The context window size seems too small for video, however I believe there are two reasons why it won’t be a huge problem. First, by using something like Perceiver where more processing is happening on tokens seen recently and just a small amount of computation is used on older far away tokens, the context window could be increased significantly (or some other kind of memory could be added).
Second, I don’t think that the model needs to see the whole image/video. When humans look at something, only a very small part of the image is sharp and the rest is blurry. Similarly, I think that Gato will get image information with just a small amount of tokens that describe a small rectangle of the picture/video clearly and some small number of tokens decribing the blurred rest. There would further be action tokens describing the “eye movement” when the focus shifts to a different part of the image. In this way I think GATO will be able to watch/generate videos or read/write long books.
Futhermore, I think that in general, RL could be used to make the model “think slow” when predicting the tokens. For example, instead of the task of predicting the next token, GATO could be trained on a RL task “find a set of actions by which you find what the next token is”. So to predict what the next image token, next word, or next action in a game should be, it would first look around the image/video/book and only after collecting the relevant information, it would emit the right token. Possibly it could also emit tokens summarizing the information it has seen so far from a large amount of data that it might need in the future. Of course it would probably still be trained on Atari games(likely now with actual RL) or in some coding environment with predefined inputs/outputs, but I think these will be much less significant compared to the “information finding RL”. Maybe a smaller feature would be that GATO would be able to emit commands modifying its context window and deleting/adding tokens to it.
So some capabilities that I would predict in 2 years: Generation of images,video,audio,text, even quite long ones. E.g. size of a book, 5min. long video etc. Instead of “Let’s think step by step” we would have “Let’s sketch the solution”, to draw diagrams etc. Gato would be able to reasonably operate a computer using text commands—e.g. search on the internet, use paint, IDE/debugger and so on. It will be much better at coding by combining various prompting methods and would perform in top 10% of competitive programmers (compared to Alphacode being in 50th percentile). Solve some IMO problems (probably wouldn’t get gold medal, but maybe bronze by ignoring combinatorics). Act like a smart research assisant—e.g. finding relevant papers, pointing their strengths/weaknesses, suggesting improvements/ideas. Learn to play entirely new (atari) games similarly fast as humans—this would probably however require RL instead of just being a prediction model. Complete IQ test and get above average result.
Capabilities I don’t expect: generating novel jokes, outperforming best humans at long-term planning, research, math or coding. Image/Video generation wouldn’t match reality. Similarly, AI generated books wouldn’t sell well. Empathy—it wouldn’t make a very good friend. It will be slow and expensive—so real-time robotics will probably still be a challenge. Also, It won’t be a very reliable doctor, despite being connected to the internet.
I should also make a prediction for the nearer version of GATO to actually answer the questions from the post. So if a new version of GATO appears in next 4 months, I predict:
80% confidence interval: Gato will have 50B-200B params. Context window will be 2-4x larger(similar to GPT-3)
50%: No major algorithmic improvements, RL or memory. Maybe use of perceiver. Likely some new tokenizers. The improvements would come more from new data and scale.
80%: More text,images,video,audio. More games and new kinds of data. E.g. special prompting to do something in a game, draw a picture, perform some action.
75%: Visible transfer learning. Gato trained on more tasks and pre-trained on video would perform better in most but not all games, compared to a model with similar size trained just on the particular task. Language model would be able to descripe shape of objects better after being trained together with images/video/audio.
70%: Chain of thought reasoning would perform better compared to a LLM of similar size. The improvement won’t be huge though and I wouldn’t expect it to gain some suprisingly sophisticated new LLM capabilities.
80%: It won’t be able to play new Atari games similarly to humans, but there would be a visible progress—the actions would be less random and directed towards the goal of the game. With sophisticated prompting, e.g. “Describe first what the goal of this game is, how to play it, what is the best strategy”, significant improvements would be seen, but still sub-human.
Some of my updates: at least one version with several trillion parameters, at least 100k tokens long context window(with embeddings etc. seemingly 1million), otherwise I am quite surprised that I mostly still agree with my predictions, regarding multimodal/RL capabilities. I think robotics could still see some latency challenges, but anyway there would be a significant progress in tasks not requiring fast reactions—e.g. picking up things, cleaning a room, etc. Things like superagi might become practically useful and controlling a computer with text/voice would seem easy.
Ok, I was thinking about this a bit and finally got some time to write it down. I realized that it is quite hard to make predictions about the first version of GATO as it depends on what the team would prioritize in development. Therefore I’ll try to predict some attributes/features of a GATO-like model that should be available in next two years, while expecting that many will appear sooner—it is just difficult to say which ones. I’m not a professional ML researcher so I might get some factual things wrong, so I would be happy to hear from people with more insight to correct me.
First the prediction regarding the size of the model: I would expect a GATO-like architecture to have a larger amount of commercial success/usefulness than e.g. GPT-3 and so the investment should also be higher. Furthermore, I would guess that there would be several significant improvements to the training infrastructure e.g. by companies such as Cerebras/Graphcore etc. Therefore I estimate the model to use somewhere between 10-100x more compute compared to GPT-3. This might result in the model having more parameters, a larger context window, or most likely both. I predict the most likely context window size to be ~40000 (10x more) and param count 1T (6x compared to gpt). Regarding the context window—I think there will be some algorithmic improvements, so it won’t be the same as before (see bellow).
Since GATO is multimodal, I would expect the scaling laws to change a bit due to the transfer learning. E.g. since the model won’t need to find all information about the shape of the objects in the text, but instead can just look at the images, it should be much easier for it to answer questions such as “Can scissors be inserted into a glass bottle?” and so require a significantly smaller amount of data. Thus the scaling laws would also need to be multi-dimensional to answer what is the optimal ratio of audio/text/image/video to achieve best results. For example to improve the language model part of GATO, we may counterintuitively need to train on more images, instead of text.
I predict that GATO would be trained on text, images, audio and video. I belive they would try to do also image/audio/video generation which they didn’t do in the current version, by essentially just trying to predict what is the next image/video token, instead of using diffusion models. The context window size seems too small for video, however I believe there are two reasons why it won’t be a huge problem. First, by using something like Perceiver where more processing is happening on tokens seen recently and just a small amount of computation is used on older far away tokens, the context window could be increased significantly (or some other kind of memory could be added).
Second, I don’t think that the model needs to see the whole image/video. When humans look at something, only a very small part of the image is sharp and the rest is blurry. Similarly, I think that Gato will get image information with just a small amount of tokens that describe a small rectangle of the picture/video clearly and some small number of tokens decribing the blurred rest. There would further be action tokens describing the “eye movement” when the focus shifts to a different part of the image. In this way I think GATO will be able to watch/generate videos or read/write long books.
Futhermore, I think that in general, RL could be used to make the model “think slow” when predicting the tokens. For example, instead of the task of predicting the next token, GATO could be trained on a RL task “find a set of actions by which you find what the next token is”. So to predict what the next image token, next word, or next action in a game should be, it would first look around the image/video/book and only after collecting the relevant information, it would emit the right token. Possibly it could also emit tokens summarizing the information it has seen so far from a large amount of data that it might need in the future. Of course it would probably still be trained on Atari games(likely now with actual RL) or in some coding environment with predefined inputs/outputs, but I think these will be much less significant compared to the “information finding RL”. Maybe a smaller feature would be that GATO would be able to emit commands modifying its context window and deleting/adding tokens to it.
So some capabilities that I would predict in 2 years: Generation of images,video,audio,text, even quite long ones. E.g. size of a book, 5min. long video etc. Instead of “Let’s think step by step” we would have “Let’s sketch the solution”, to draw diagrams etc. Gato would be able to reasonably operate a computer using text commands—e.g. search on the internet, use paint, IDE/debugger and so on. It will be much better at coding by combining various prompting methods and would perform in top 10% of competitive programmers (compared to Alphacode being in 50th percentile). Solve some IMO problems (probably wouldn’t get gold medal, but maybe bronze by ignoring combinatorics). Act like a smart research assisant—e.g. finding relevant papers, pointing their strengths/weaknesses, suggesting improvements/ideas. Learn to play entirely new (atari) games similarly fast as humans—this would probably however require RL instead of just being a prediction model. Complete IQ test and get above average result.
Capabilities I don’t expect: generating novel jokes, outperforming best humans at long-term planning, research, math or coding. Image/Video generation wouldn’t match reality. Similarly, AI generated books wouldn’t sell well. Empathy—it wouldn’t make a very good friend. It will be slow and expensive—so real-time robotics will probably still be a challenge. Also, It won’t be a very reliable doctor, despite being connected to the internet.
oh and besides IQ tests, i predict it would also be able to pass most current CAPTCHA-like tests (though humans would still be better in some)
I should also make a prediction for the nearer version of GATO to actually answer the questions from the post. So if a new version of GATO appears in next 4 months, I predict:
80% confidence interval: Gato will have 50B-200B params. Context window will be 2-4x larger(similar to GPT-3)
50%: No major algorithmic improvements, RL or memory. Maybe use of perceiver. Likely some new tokenizers. The improvements would come more from new data and scale.
80%: More text,images,video,audio. More games and new kinds of data. E.g. special prompting to do something in a game, draw a picture, perform some action.
75%: Visible transfer learning. Gato trained on more tasks and pre-trained on video would perform better in most but not all games, compared to a model with similar size trained just on the particular task. Language model would be able to descripe shape of objects better after being trained together with images/video/audio.
70%: Chain of thought reasoning would perform better compared to a LLM of similar size. The improvement won’t be huge though and I wouldn’t expect it to gain some suprisingly sophisticated new LLM capabilities.
80%: It won’t be able to play new Atari games similarly to humans, but there would be a visible progress—the actions would be less random and directed towards the goal of the game. With sophisticated prompting, e.g. “Describe first what the goal of this game is, how to play it, what is the best strategy”, significant improvements would be seen, but still sub-human.
Some of my updates:
at least one version with several trillion parameters, at least 100k tokens long context window(with embeddings etc. seemingly 1million), otherwise I am quite surprised that I mostly still agree with my predictions, regarding multimodal/RL capabilities. I think robotics could still see some latency challenges, but anyway there would be a significant progress in tasks not requiring fast reactions—e.g. picking up things, cleaning a room, etc. Things like superagi might become practically useful and controlling a computer with text/voice would seem easy.