Implementing old methods more vigorously is more or less exactly what got modern deep learning started
is just straightforwardly true.
Everyone times the start of the deep learning… thing, to 2012′s AlexNet. AlexNet has convolutions, and reLU, and backprop, but didn’t invent any of them. Here’s what Wikipedia says is important about AlexNet
AlexNet competed in the ImageNet Large Scale Visual Recognition Challenge on September 30, 2012.[3] The network achieved a top-5 error of 15.3%, more than 10.8 percentage points lower than that of the runner up. The original paper’s primary result was that the depth of the model was essential for its high performance, which was computationally expensive, but made feasible due to the utilization of graphics processing units (GPUs) during training.[2]
So.… I think that what I’m saying about how DL started is the boring consensus. Of course, new algorithms did come along, and I agree that they are important. But still—if there’s something important that has worked without big compute, what is it?
(I do agree that in a counterfactual world I’d probably prefer to get Attention is All You Need.)
And yeah, I accidentally posted this a month ago for 30 min when it was in draft, so you might have seen it it before.
Implementing old methods more vigorously is more or less exactly what got modern deep learning started
is just straightforwardly true.
I would say that the “vigor” was almost entirely bottlenecked on researcher effort and serial thinking time, rather than compute resources. A bunch of concrete examples to demonstrate what I mean:
There are some product rollouts (e.g. multimodal and 32k GPT-4) and probably some frontier capabilities experiments which are currently bottlenecked on H100 capacity. But this is a very recent phenomenon, and has more to do with the sudden massive demand for inference rather than anything to do with training. In the meantime, there are plenty of other ways that OpenAI and others are pushing the capabilities frontier by things other than just piling on more layers on more GPUs.
If you sent back the recipe for training the smallest GPT model that works at all (GPT-J 6B, maybe?), people in 2008 could probably cobble together existing GPUs into a supercomputer, or, failing that, have the foundries of 2008 fabricate ASICs, and have it working in <1 year.
OTOH, if researchers in 2008 had access to computing resources of today, I suspect it would take them many years to get to GPT-3. Maybe not the full 14 years, since now many of their smaller-scale training runs go much quicker, and they can try larger things out much faster. But the time required to: (a) think up which experiments to try (b) implement the code to try them, and (c) analyze the results, dominates the time spent waiting around for a training run to finish.
More generally, it’s not obvious how to “just scale things up” with more compute, and figuring out the exact way of scaling things is itself an algorithmic innovation. You can’t just literally throw GPUs at the DL methods of 10-15 years ago and have them work.
“Researcher time” isn’t exactly the same thing as algorithmic innovation, but I think that has been the actual bottleneck on the rate of capabilities advancement during most of the last 15 years (again, with very recent exceptions). That seems much more like a win for Yudkowksy than Hanson, even if the innovations themselves are “boring” or somewhat obvious in retrospect.
In worlds where Hanson was actually right, I would expect that the SoTA models of today are all controlled by whichever cash-rich organizations (including non-tech companies and governmaents) can literally dump the most money into GPUs. Instead, we’re in a world where any tech company which has the researcher expertise can usually scrounge up enough cash to get ~SoTA. Compute isn’t a negligible line item in the budget even for the biggest players, but it’s not (and has never been) the primary bottleneck on progress.
But—regardless of Yudkowsky’s current position—it still remains that you’d have been extremely surprised by the last decade’s use of compute if you had believed him, and much less surprised if you had believed Hanson.
I don’t think 2008!Yudkowsky’s world model is “extremely surprised” by the observation that, if you have an AI algorithm that works at all, you can make it work ~X% better using a factor of Y more computing power.
For one, that just doesn’t seem like it would be very surprising to anyone in 2008, though maybe I’m misremembering the recent past or suffering from hindsight bias here. For two, Eliezer himself has always been pretty consistent that, even if algorithmic innovation is likely to be more important, one of the first things that a superintelligence would do is turn as much of the Earth as possible into computronium as fast as possible. That doesn’t seem like someone who would be surprised that more compute is more effective.
So I think that
is just straightforwardly true.
Everyone times the start of the deep learning… thing, to 2012′s AlexNet. AlexNet has convolutions, and reLU, and backprop, but didn’t invent any of them. Here’s what Wikipedia says is important about AlexNet
So.… I think that what I’m saying about how DL started is the boring consensus. Of course, new algorithms did come along, and I agree that they are important. But still—if there’s something important that has worked without big compute, what is it?
(I do agree that in a counterfactual world I’d probably prefer to get Attention is All You Need.)
And yeah, I accidentally posted this a month ago for 30 min when it was in draft, so you might have seen it it before.
I would say that the “vigor” was almost entirely bottlenecked on researcher effort and serial thinking time, rather than compute resources. A bunch of concrete examples to demonstrate what I mean:
There are some product rollouts (e.g. multimodal and 32k GPT-4) and probably some frontier capabilities experiments which are currently bottlenecked on H100 capacity. But this is a very recent phenomenon, and has more to do with the sudden massive demand for inference rather than anything to do with training. In the meantime, there are plenty of other ways that OpenAI and others are pushing the capabilities frontier by things other than just piling on more layers on more GPUs.
If you sent back the recipe for training the smallest GPT model that works at all (GPT-J 6B, maybe?), people in 2008 could probably cobble together existing GPUs into a supercomputer, or, failing that, have the foundries of 2008 fabricate ASICs, and have it working in <1 year.
OTOH, if researchers in 2008 had access to computing resources of today, I suspect it would take them many years to get to GPT-3. Maybe not the full 14 years, since now many of their smaller-scale training runs go much quicker, and they can try larger things out much faster. But the time required to: (a) think up which experiments to try (b) implement the code to try them, and (c) analyze the results, dominates the time spent waiting around for a training run to finish.
More generally, it’s not obvious how to “just scale things up” with more compute, and figuring out the exact way of scaling things is itself an algorithmic innovation. You can’t just literally throw GPUs at the DL methods of 10-15 years ago and have them work.
“Researcher time” isn’t exactly the same thing as algorithmic innovation, but I think that has been the actual bottleneck on the rate of capabilities advancement during most of the last 15 years (again, with very recent exceptions). That seems much more like a win for Yudkowksy than Hanson, even if the innovations themselves are “boring” or somewhat obvious in retrospect.
In worlds where Hanson was actually right, I would expect that the SoTA models of today are all controlled by whichever cash-rich organizations (including non-tech companies and governmaents) can literally dump the most money into GPUs. Instead, we’re in a world where any tech company which has the researcher expertise can usually scrounge up enough cash to get ~SoTA. Compute isn’t a negligible line item in the budget even for the biggest players, but it’s not (and has never been) the primary bottleneck on progress.
I don’t think 2008!Yudkowsky’s world model is “extremely surprised” by the observation that, if you have an AI algorithm that works at all, you can make it work ~X% better using a factor of Y more computing power.
For one, that just doesn’t seem like it would be very surprising to anyone in 2008, though maybe I’m misremembering the recent past or suffering from hindsight bias here. For two, Eliezer himself has always been pretty consistent that, even if algorithmic innovation is likely to be more important, one of the first things that a superintelligence would do is turn as much of the Earth as possible into computronium as fast as possible. That doesn’t seem like someone who would be surprised that more compute is more effective.