If someone succeeds in getting, say a ~13B parameter model to be equal in performance (at high-level tasks) to a previous-gen model 10x that size, using a 10x smaller FLOPs budget during training, isn’t that a pretty big win for Eliezer? That seems to be kind of what is happening: this list mostly has larger models at the top, but not uniformly so.
I’d say, it was more like, there was a large minimum amount of compute needed to make things work at all, but most of the innovation in LLMs comes from algorithmic improvements needed to make them work at all.
Hobbyists and startups can train their own models from scratch without massive capital investment, though not the absolute largest ones, and not completely for free. This capability does require massive capital expenditures by hardware manufacturers to improve the underlying compute technology sufficiently, but massive capital investments in silicon manufacturing technology are nothing new, even if they have been accelerated and directed a bit by AI in the last 15 years.
And I don’t think it would have been surprising to Eliezer (or anyone else in 2008) that if you dump more compute at some problems, you get gradually increasing performance. For example, in 2008, you could have made massive capital investments to build the largest supercomputer in the world, and gotten the best chess engine by enabling the SoTA algorithms to search 1 or 2 levels deeper in the Chess game tree. Or you could have used that money to pay for researchers to continue looking for algorithmic improvements and optimizations.
Coming in late, but the surprising thing on Yudkowsky’s models is that compute was way more important than he realized, with it usually being 50⁄50 on the most favorable models to Yudkowsky, which means compute increases are not negligible, and algorithms aren’t totally dominant.
Even granting the assumption that algorithms will increasingly be a bottleneck, and compute being less important, Yudkowsky way overrated the power of algorithms/thinking hard compared to just getting more resources/scaling.
If someone succeeds in getting, say a ~13B parameter model to be equal in performance (at high-level tasks) to a previous-gen model 10x that size, using a 10x smaller FLOPs budget during training, isn’t that a pretty big win for Eliezer? That seems to be kind of what is happening: this list mostly has larger models at the top, but not uniformly so.
I’d say, it was more like, there was a large minimum amount of compute needed to make things work at all, but most of the innovation in LLMs comes from algorithmic improvements needed to make them work at all.
Hobbyists and startups can train their own models from scratch without massive capital investment, though not the absolute largest ones, and not completely for free. This capability does require massive capital expenditures by hardware manufacturers to improve the underlying compute technology sufficiently, but massive capital investments in silicon manufacturing technology are nothing new, even if they have been accelerated and directed a bit by AI in the last 15 years.
And I don’t think it would have been surprising to Eliezer (or anyone else in 2008) that if you dump more compute at some problems, you get gradually increasing performance. For example, in 2008, you could have made massive capital investments to build the largest supercomputer in the world, and gotten the best chess engine by enabling the SoTA algorithms to search 1 or 2 levels deeper in the Chess game tree. Or you could have used that money to pay for researchers to continue looking for algorithmic improvements and optimizations.
Coming in late, but the surprising thing on Yudkowsky’s models is that compute was way more important than he realized, with it usually being 50⁄50 on the most favorable models to Yudkowsky, which means compute increases are not negligible, and algorithms aren’t totally dominant.
Even granting the assumption that algorithms will increasingly be a bottleneck, and compute being less important, Yudkowsky way overrated the power of algorithms/thinking hard compared to just getting more resources/scaling.