Sorry this post hasn’t gotten much engagement! It would get more engagement from me if it was less effort to engage; you could make it less effort to engage if you put up specific questions to forecast on. For example, look at the performance metrics for Gato and then ask what % better Gato2 will perform on average across all those metrics. Or maybe pick out a few specific metrics of interest. Maybe also ask about the language abilities of Gato2 in some way. And how many parameters it’ll have. Etc.
I see. I will update the post with some questions. I find it quite difficult though to forecast on how percentages of the performance metrics would improve, compared to just predicting capabilities as the datasets are probably not so well known.
Yeah, makes sense. I guess maybe part of what’s going on is that forecasting the next Gato in particular is less exciting/interesting than forecasting stuff like AGI, even though in order to forecast the latter it’s valuable to build up skill forecasting things like the former.
Anyhow here are some ass-number predictions I’ll make to answer some of your questions, thanks for putting them up:
75% confidence interval: Gato II will be between 5B and 50B parameters. Context window will be between 2x and 4x as long.
75% credence: No significant new algorithmic improvements, just some minor things that make it slightly better, or maybe nothing at all.
90% credence: They’ll train it on a bigger suite of tasks than Gato. E.g. more games, more diverse kinds of text-based tasks, etc.
70% credence: There will be some transfer learning, in the following sense: It will be clear from the paper that, probably, Gato II would outperform on task X a hypothetical variant of itself that had only trained on task X but not on any others (holding fixed the amount of task X training). For most but not all tasks studied.
60% credence: Assuming they test Gato II’s ability to improve via chain-of-thought, the gains from doing so will be greater than the gains for a similarly sized language model.
90% credence: Gato II will still not be able to play new Atari games without being trained on them as well as humans, i.e. it’ll be sub-human-level on such games.
75% credence: No significant new algorithmic improvements, just some minor things that make it slightly better, or maybe nothing at all.
Nothing at all would feel really surprising to me. I would expect that the programmers working on it do some work on the algorithm. It might be minor things, but I would expect that they only decide on investing more compute into training once they have at least some improvement to show.
75% confidence interval: Gato II will be between 5B and 50B parameters. Context window will be between 2x and 4x as long.
It seems like the main reason why Gato is currently small is that they want to be able to interact at real-world speed with robots. For that goal, it would be easily possible to train one 100B parameter model and then downscale it for applications that need to be that fast.
Sorry this post hasn’t gotten much engagement! It would get more engagement from me if it was less effort to engage; you could make it less effort to engage if you put up specific questions to forecast on. For example, look at the performance metrics for Gato and then ask what % better Gato2 will perform on average across all those metrics. Or maybe pick out a few specific metrics of interest. Maybe also ask about the language abilities of Gato2 in some way. And how many parameters it’ll have. Etc.
I see. I will update the post with some questions. I find it quite difficult though to forecast on how percentages of the performance metrics would improve, compared to just predicting capabilities as the datasets are probably not so well known.
Yeah, makes sense. I guess maybe part of what’s going on is that forecasting the next Gato in particular is less exciting/interesting than forecasting stuff like AGI, even though in order to forecast the latter it’s valuable to build up skill forecasting things like the former.
Anyhow here are some ass-number predictions I’ll make to answer some of your questions, thanks for putting them up:
75% confidence interval: Gato II will be between 5B and 50B parameters. Context window will be between 2x and 4x as long.
75% credence: No significant new algorithmic improvements, just some minor things that make it slightly better, or maybe nothing at all.
90% credence: They’ll train it on a bigger suite of tasks than Gato. E.g. more games, more diverse kinds of text-based tasks, etc.
70% credence: There will be some transfer learning, in the following sense: It will be clear from the paper that, probably, Gato II would outperform on task X a hypothetical variant of itself that had only trained on task X but not on any others (holding fixed the amount of task X training). For most but not all tasks studied.
60% credence: Assuming they test Gato II’s ability to improve via chain-of-thought, the gains from doing so will be greater than the gains for a similarly sized language model.
90% credence: Gato II will still not be able to play new Atari games without being trained on them as well as humans, i.e. it’ll be sub-human-level on such games.
Nothing at all would feel really surprising to me. I would expect that the programmers working on it do some work on the algorithm. It might be minor things, but I would expect that they only decide on investing more compute into training once they have at least some improvement to show.
It seems like the main reason why Gato is currently small is that they want to be able to interact at real-world speed with robots. For that goal, it would be easily possible to train one 100B parameter model and then downscale it for applications that need to be that fast.
I’d be interested to forecast on whether Gato2 will see bigger gains from chain-of-thought style stuff than a similarly-sized LLM.
I weakly predict yes.