Slowing compute growth could lead to a greater focus on efficiency. Easy to find gains in efficiency will be found anyway, but harder to find gains in efficiency currently don’t seem to me to be getting that much effort, relative to ways to derive some benefit from rapidly increasing amounts of compute.
If models on the capabilities frontier are currently not very efficient, because their creators are focused on getting any benefit at all from the most compute that is practically available to them now, restricting compute could trigger an existing “efficiency overhang”. If (some of) the efficient techniques found are also scalable (which some and maybe most won’t be, to be sure), then if larger amounts of compute do later become available, we could end up with greater capabilities at the time a certain amount of compute becomes available, relative to the world where available compute kept going up too smoothly to incentivize a focus on efficiency.
This seems reasonably likely to me. You seem to consider this negligibly likely. Why?
I briefly discuss my skepticism in footnote 12. I struggle to tell a story about how labs would only pursue algorithmic improvements if they couldn’t scale training compute. But I’m pretty unconfident and contrary opinions from people at major labs would change my mind.
I certainly don’t think labs will only try to improve algorithms if they can’t scale compute! Rather, I think that the algorithmic improvements that will be found by researchers trying to figure out how to improve performance given twice as much compute as the last run won’t be the same ones found by researchers trying to improve performance given no increase in compute.
One would actually expect the low hanging fruit in the compute-no-longer-growing regime to be specifically the techniques that don’t scale, since after all, scaling well is an existing constraint that the compute-no-longer-growing regime removes. I’m not talking about those. I’m saying it seems reasonably likely to me that the current techniques producing state of the art results are very inefficient, and that a newfound focus on “how much can you do with N FLOPs, because that’s all you’re going to get for the foreseeable future” might give fundamentally more efficient techniques that turn out to scale better too.
It’s certainly possible that with a compute limit, labs will just keep doing the same “boring” stuff they already “know” they can fit into that limit… it just seems to me like people in AI safety advocating for compute limits are overconfident in that. It seems to me that the strongest plausible version of this possibility should be addressed by anyone arguing in favor of compute limits. I currently weakly expect that compute limits would make things worse because of these considerations.
Thanks. idk. I’m interested in evidence. I’d be surprised by the conjunction (1) you’re more likely to get techniques that scale better by looking for “fundamentally more efficient techniques that turn out to scale better too” and (2) labs aren’t currently trying that.
Some points which I think support the plausibility of this scenario:
(1) EY’s ideas about a “simple core of intelligence”, how chimp brains don’t seem to have major architectural differences from human brains, etc.
(2) RWKV vs Transformers. Why haven’t Transformers been straight up replaced by RWKV at this point? Looks to me like potentially huge efficiency gains being basically ignored because lab researchers can get away with it. Granted, affects efficiency of inference but not training AFAIK, and maybe it wouldn’t work at the 100B+ scale, but it certainly looks like enough evidence to do the experiment.
(3) Why didn’t researchers jump straight to the end on smaller and smaller floating point (or fixed point) precision? Okay, sure, “the hardware didn’t support it” can explain some of it, but you could still do smaller scale experiments to show it appears to work and get support into the next generation of hardware (or at some point even custom hardware if the gains are huge enough) if you’re serious about maximizing efficiency.
(4) I have a few more ideas for huge efficiency gains that I don’t want to state publicly. Probably most of them wouldn’t work. But the thing about huge efficiency gains is that if they do work, doing the experiments to find that out is (relatively) cheap, because of the huge efficiency gains. I’m not saying anyone should update on my claim to have such ideas, but if you understand modern ML, you can try to answer the question “what would you try if you wanted to drastically improve efficiency” and update on the answers you come up with. And there are probably better ideas than those, and almost certainly more such ideas. I end up mostly thinking lab researchers aren’t trying because it’s just not what they’re being paid to do, and/or it isn’t what interests them. Of course they are trying to improve efficiency, but they’re looking for smaller improvements that are more likely to pan out, not massive improvements any given one of which probably won’t work.
Anyway, I think a world in which you could even run GPT-4 quality inference (let alone training) on a current smartphone looks like a world where AI is soon going to determine the future more than humans do, if it hasn’t already happened at that point… and I’m far from certain this is where compute limits (moderate ones, not crushingly tight ones that would restrict or ban a lot of already-deployed hardware) would lead, but it doesn’t seem to me like this possibility is one that people advocating for compute limits have really considered, even if only to say why they find it very unlikely. (Well, I guess if you only care about buying a moderate amount of time, compute limits would probably do that even in this scenario, since researchers can’t pivot on a dime to improving efficiency, and we’re specifically talking about higher-hanging efficiency gains here.)
Slowing compute growth could lead to a greater focus on efficiency. Easy to find gains in efficiency will be found anyway, but harder to find gains in efficiency currently don’t seem to me to be getting that much effort, relative to ways to derive some benefit from rapidly increasing amounts of compute.
If models on the capabilities frontier are currently not very efficient, because their creators are focused on getting any benefit at all from the most compute that is practically available to them now, restricting compute could trigger an existing “efficiency overhang”. If (some of) the efficient techniques found are also scalable (which some and maybe most won’t be, to be sure), then if larger amounts of compute do later become available, we could end up with greater capabilities at the time a certain amount of compute becomes available, relative to the world where available compute kept going up too smoothly to incentivize a focus on efficiency.
This seems reasonably likely to me. You seem to consider this negligibly likely. Why?
I briefly discuss my skepticism in footnote 12. I struggle to tell a story about how labs would only pursue algorithmic improvements if they couldn’t scale training compute. But I’m pretty unconfident and contrary opinions from people at major labs would change my mind.
I certainly don’t think labs will only try to improve algorithms if they can’t scale compute! Rather, I think that the algorithmic improvements that will be found by researchers trying to figure out how to improve performance given twice as much compute as the last run won’t be the same ones found by researchers trying to improve performance given no increase in compute.
One would actually expect the low hanging fruit in the compute-no-longer-growing regime to be specifically the techniques that don’t scale, since after all, scaling well is an existing constraint that the compute-no-longer-growing regime removes. I’m not talking about those. I’m saying it seems reasonably likely to me that the current techniques producing state of the art results are very inefficient, and that a newfound focus on “how much can you do with N FLOPs, because that’s all you’re going to get for the foreseeable future” might give fundamentally more efficient techniques that turn out to scale better too.
It’s certainly possible that with a compute limit, labs will just keep doing the same “boring” stuff they already “know” they can fit into that limit… it just seems to me like people in AI safety advocating for compute limits are overconfident in that. It seems to me that the strongest plausible version of this possibility should be addressed by anyone arguing in favor of compute limits. I currently weakly expect that compute limits would make things worse because of these considerations.
Thanks. idk. I’m interested in evidence. I’d be surprised by the conjunction (1) you’re more likely to get techniques that scale better by looking for “fundamentally more efficient techniques that turn out to scale better too” and (2) labs aren’t currently trying that.
Some points which I think support the plausibility of this scenario:
(1) EY’s ideas about a “simple core of intelligence”, how chimp brains don’t seem to have major architectural differences from human brains, etc.
(2) RWKV vs Transformers. Why haven’t Transformers been straight up replaced by RWKV at this point? Looks to me like potentially huge efficiency gains being basically ignored because lab researchers can get away with it. Granted, affects efficiency of inference but not training AFAIK, and maybe it wouldn’t work at the 100B+ scale, but it certainly looks like enough evidence to do the experiment.
(3) Why didn’t researchers jump straight to the end on smaller and smaller floating point (or fixed point) precision? Okay, sure, “the hardware didn’t support it” can explain some of it, but you could still do smaller scale experiments to show it appears to work and get support into the next generation of hardware (or at some point even custom hardware if the gains are huge enough) if you’re serious about maximizing efficiency.
(4) I have a few more ideas for huge efficiency gains that I don’t want to state publicly. Probably most of them wouldn’t work. But the thing about huge efficiency gains is that if they do work, doing the experiments to find that out is (relatively) cheap, because of the huge efficiency gains. I’m not saying anyone should update on my claim to have such ideas, but if you understand modern ML, you can try to answer the question “what would you try if you wanted to drastically improve efficiency” and update on the answers you come up with. And there are probably better ideas than those, and almost certainly more such ideas. I end up mostly thinking lab researchers aren’t trying because it’s just not what they’re being paid to do, and/or it isn’t what interests them. Of course they are trying to improve efficiency, but they’re looking for smaller improvements that are more likely to pan out, not massive improvements any given one of which probably won’t work.
Anyway, I think a world in which you could even run GPT-4 quality inference (let alone training) on a current smartphone looks like a world where AI is soon going to determine the future more than humans do, if it hasn’t already happened at that point… and I’m far from certain this is where compute limits (moderate ones, not crushingly tight ones that would restrict or ban a lot of already-deployed hardware) would lead, but it doesn’t seem to me like this possibility is one that people advocating for compute limits have really considered, even if only to say why they find it very unlikely. (Well, I guess if you only care about buying a moderate amount of time, compute limits would probably do that even in this scenario, since researchers can’t pivot on a dime to improving efficiency, and we’re specifically talking about higher-hanging efficiency gains here.)