Yeah, makes sense. I guess maybe part of what’s going on is that forecasting the next Gato in particular is less exciting/interesting than forecasting stuff like AGI, even though in order to forecast the latter it’s valuable to build up skill forecasting things like the former.
Anyhow here are some ass-number predictions I’ll make to answer some of your questions, thanks for putting them up:
75% confidence interval: Gato II will be between 5B and 50B parameters. Context window will be between 2x and 4x as long.
75% credence: No significant new algorithmic improvements, just some minor things that make it slightly better, or maybe nothing at all.
90% credence: They’ll train it on a bigger suite of tasks than Gato. E.g. more games, more diverse kinds of text-based tasks, etc.
70% credence: There will be some transfer learning, in the following sense: It will be clear from the paper that, probably, Gato II would outperform on task X a hypothetical variant of itself that had only trained on task X but not on any others (holding fixed the amount of task X training). For most but not all tasks studied.
60% credence: Assuming they test Gato II’s ability to improve via chain-of-thought, the gains from doing so will be greater than the gains for a similarly sized language model.
90% credence: Gato II will still not be able to play new Atari games without being trained on them as well as humans, i.e. it’ll be sub-human-level on such games.
75% credence: No significant new algorithmic improvements, just some minor things that make it slightly better, or maybe nothing at all.
Nothing at all would feel really surprising to me. I would expect that the programmers working on it do some work on the algorithm. It might be minor things, but I would expect that they only decide on investing more compute into training once they have at least some improvement to show.
75% confidence interval: Gato II will be between 5B and 50B parameters. Context window will be between 2x and 4x as long.
It seems like the main reason why Gato is currently small is that they want to be able to interact at real-world speed with robots. For that goal, it would be easily possible to train one 100B parameter model and then downscale it for applications that need to be that fast.
Yeah, makes sense. I guess maybe part of what’s going on is that forecasting the next Gato in particular is less exciting/interesting than forecasting stuff like AGI, even though in order to forecast the latter it’s valuable to build up skill forecasting things like the former.
Anyhow here are some ass-number predictions I’ll make to answer some of your questions, thanks for putting them up:
75% confidence interval: Gato II will be between 5B and 50B parameters. Context window will be between 2x and 4x as long.
75% credence: No significant new algorithmic improvements, just some minor things that make it slightly better, or maybe nothing at all.
90% credence: They’ll train it on a bigger suite of tasks than Gato. E.g. more games, more diverse kinds of text-based tasks, etc.
70% credence: There will be some transfer learning, in the following sense: It will be clear from the paper that, probably, Gato II would outperform on task X a hypothetical variant of itself that had only trained on task X but not on any others (holding fixed the amount of task X training). For most but not all tasks studied.
60% credence: Assuming they test Gato II’s ability to improve via chain-of-thought, the gains from doing so will be greater than the gains for a similarly sized language model.
90% credence: Gato II will still not be able to play new Atari games without being trained on them as well as humans, i.e. it’ll be sub-human-level on such games.
Nothing at all would feel really surprising to me. I would expect that the programmers working on it do some work on the algorithm. It might be minor things, but I would expect that they only decide on investing more compute into training once they have at least some improvement to show.
It seems like the main reason why Gato is currently small is that they want to be able to interact at real-world speed with robots. For that goal, it would be easily possible to train one 100B parameter model and then downscale it for applications that need to be that fast.