It’s certainly true that having a lot of hardware is super-useful. One can try more things, one can pull more resources towards things deemed more important, one can do longer runs if a training scheme does not saturate, but keeps improving.
:-) And yes, I don’t think a laptop with a 4090 is existentially dangerous (yet), and even a single installation with 8 H100s is probably not enough (at the current and near-future state of algorithmic art) :-)
But take a configuration worth a few million dollars, and one starts having some chances...
Of course, if a place with more hardware decides to adopt a non-standard scheme invented by a relatively hardware-poor place, the place with more hardware would win. But a non-standard scheme might be non-public, and even if it is public, people often have strong opinions about what to try and what not to try, and those opinions might interfere with a timely attempt.
I think that non-standard architectural and algorithmic breakthroughs can easily make smaller players competitive, especially as inertia of adherence to “what has been proven before” will inhibit the largest players.
Do these exist?
Yes, of course, we are seeing a rich stream of promising new things, ranging from evolutionary schemas (many of which tend towards open-endedness and therefore might be particularly unsafe, while very promising) to various derivatives of Mamba to potentially more interpretable architectures (like Kolmogorov-Arnold networks or like recent Memory Mosaics, which is an academic collaboration with Meta, but which has not been a consumer of significant compute yet) to GFlowNet motifs from Bengio group, and so on.
These things are mostly coming from places which seem to have “medium compute” (although we don’t have exact knowledge about their compute): Schmidhuber’s group, Sakana AI, Zyphra AI, Liquid AI, and so on. And I doubt that Microsoft or Google have a program dedicated to “trying everything that look promising”, even though it is true that they have manpower and hardware to do just that. But would they choose to do that?
OK, so we are likely to have that (I don’t think he is over-optimistic here), and the models are already very capable of discussing AI research papers and exhibit good comprehension of those papers (that’s one of my main use cases for LLMs: to help me understand an AI research paper better and faster). And they will get better at that as well.
This really does not sound like AGI to me (or at least highly depends on what a coding project means here)
If it’s an open-ended AI project, it sounds like “foom before AGI”, with AGI-strength appearing at some point on the trajectory as a side-effect.
The key here is that when people discuss “foom”, they usually tend to focus on a (rather strong) argument that AGI is likely to be sufficient for “foom”. But AGI is not necessary for “foom”, one can have “foom” fully in progress before full AGI is achieved (“the road to superintelligence goes not via human equivalence, but around it”).
And I doubt that Microsoft or Google have a program dedicated to “trying everything that look promising”, even though it is true that they have manpower and hardware to do just that. But would they choose to do that?
Actually I’m under the impression a lot of what they do is just sharing papers in a company slack and reproducing stuff at scale. Now of course they might intuitively block out certain approaches that they think are dead-ends but turn out to be promising, but I wouldn’t underestimate their agility at adapting new approaches if something unexpected is found.[1] My mental model is entirely informed from seeing Dwarkesh’s interview with DeepMind researchers. They talk about ruthless efficiency in trying out new ideas and seeing what works. They also talk about how having more compute would make them X times better researchers.
Yes, of course, we are seeing a rich stream of promising new things, ranging from evolutionary schemas (many of which tend towards open-endedness and therefore might be particularly unsafe, while very promising) to various derivatives of Mamba to potentially more interpretable architectures (like Kolmogorov-Arnold networks or like recent Memory Mosaics, which is an academic collaboration with Meta, but which has not been a consumer of significant compute yet) to GFlowNet motifs from Bengio group, and so on.
I think these are all at best marginal improvements and will be dwarfed by more compute[2] or at least will only beat SOTA after being given more compute. I think the space for algo improvement for a given amount of compute is saturated quickly. Also if anything, the average smaller place will over-index on techniques that crank out a little extra performance on smaller models but fail at scale.
Of course, if a place with more hardware decides to adopt a non-standard scheme invented by a relatively hardware-poor place, the place with more hardware would win.
My mental model of the hardware poor is they want to publicize their results as fast as they can so they get more clout, VC funding, or just getting essentially acquired by big tech. Academic recognition in the form of citations drive researchers. Getting rich drives the founders.
The key here is that when people discuss “foom”, they usually tend to focus on a (rather strong) argument that AGI is likely to be sufficient for “foom”. But AGI is not necessary for “foom”, one can have “foom” fully in progress before full AGI is achieved (“the road to superintelligence goes not via human equivalence, but around it”).
Yes I agree there is a small possibility, but I find this is almost “pascal mugging”. I think there is a stickiness of the AlphaGo model of things that’s informing some choices which are objectively bad in a world where the AlphaGo model doesn’t hold. The fear response to the low odds world is not appropriate for the high odds world.
As an example most improvements from Llama-3 came from just training the models on more data (with more compute). Sora looks worse than SOTA approaches until you throw more compute at it.
And I doubt that Microsoft or Google have a program dedicated to “trying everything that look promising”, even though it is true that they have manpower and hardware to do just that. But would they choose to do that?
Actually I’m under the impression a lot of what they do is just sharing papers in a company slack and reproducing stuff at scale.
I’d love to have a better feel for how much of the promising things they try to reproduce at scale...
Unfortunately, I don’t have enough inside access for that...
My mental model of the hardware poor is they want to publicize their results as fast as they can so they get more clout, VC funding, or just getting essentially acquired by big tech. Academic recognition in the form of citations drive researchers. Getting rich drives the founders.
There are all kinds of people. I think Schmidhuber’s group might be happy to deliberately create an uncontrollable foom, if they can (they have Saudi funding, so I have no idea how much hardware do they actually have, and how much options for more hardware do they have contingent on preliminary results). Some other people just don’t think their methods are strong enough to be that unsafe. Some people do care about safety (but still want to go ahead; some of those say “this is potentially risky, but in the future, and not right now”, and they might be right or wrong). Some people feel their approach does increase safety (they might be right or wrong). A number of people are ideological (they feel that their preferred approach is not getting a fair shake from the research community, and they want to make a strong attempt to show that the community is wrong and myopic)...
I think most places tend to publish some of their results for the reasons you’ve stated, but they are also likely to hold some of stronger things back (at least, for a while); after all, if one is after VC funding, one needs to show those VCs that there is some secret sauce which remains proprietary...
Unfortunately, I don’t have enough inside access for that...
Yeah, with you there. I am just speculating based on what I’ve heard online and through the grapevine, so take my model of their internal workings with a grain of salt. With that said I feel pretty confident in it.
if one is after VC funding, one needs to show those VCs that there is some secret sauce which remains proprietary
IMO software/algorithmic moat is pretty impossible to keep. Researchers tend to be pretty smart, enough to figure it out independently, even if they manage to stop any researcher from leaving and diffusing knowledge. Some parallels:
The India trade done by Jane Street. They are were making billions of dollars contingent on the fact that no one else knows about this trade, but eventually their alpha also got diffused.
TikTok’s content algorithm which the Chinese government doesn’t want to export only took a couple months for Meta/Google to replicate.
if one is after VC funding, one needs to show those VCs that there is some secret sauce which remains proprietary
IMO software/algorithmic moat is pretty impossible to keep.
Indeed.
That is, unless the situation is highly non-stationary (that is, algorithms and methods are modified fast without stopping; of course, a foom would be one such situation, but I can imagine a more pedestrian “rapid fire” evolution of methods which goes at a good clip, but does not accelerate beyond reason).
It’s certainly true that having a lot of hardware is super-useful. One can try more things, one can pull more resources towards things deemed more important, one can do longer runs if a training scheme does not saturate, but keeps improving.
:-) And yes, I don’t think a laptop with a 4090 is existentially dangerous (yet), and even a single installation with 8 H100s is probably not enough (at the current and near-future state of algorithmic art) :-)
But take a configuration worth a few million dollars, and one starts having some chances...
Of course, if a place with more hardware decides to adopt a non-standard scheme invented by a relatively hardware-poor place, the place with more hardware would win. But a non-standard scheme might be non-public, and even if it is public, people often have strong opinions about what to try and what not to try, and those opinions might interfere with a timely attempt.
Yes, of course, we are seeing a rich stream of promising new things, ranging from evolutionary schemas (many of which tend towards open-endedness and therefore might be particularly unsafe, while very promising) to various derivatives of Mamba to potentially more interpretable architectures (like Kolmogorov-Arnold networks or like recent Memory Mosaics, which is an academic collaboration with Meta, but which has not been a consumer of significant compute yet) to GFlowNet motifs from Bengio group, and so on.
These things are mostly coming from places which seem to have “medium compute” (although we don’t have exact knowledge about their compute): Schmidhuber’s group, Sakana AI, Zyphra AI, Liquid AI, and so on. And I doubt that Microsoft or Google have a program dedicated to “trying everything that look promising”, even though it is true that they have manpower and hardware to do just that. But would they choose to do that?
If it’s an open-ended AI project, it sounds like “foom before AGI”, with AGI-strength appearing at some point on the trajectory as a side-effect.
The key here is that when people discuss “foom”, they usually tend to focus on a (rather strong) argument that AGI is likely to be sufficient for “foom”. But AGI is not necessary for “foom”, one can have “foom” fully in progress before full AGI is achieved (“the road to superintelligence goes not via human equivalence, but around it”).
Actually I’m under the impression a lot of what they do is just sharing papers in a company slack and reproducing stuff at scale. Now of course they might intuitively block out certain approaches that they think are dead-ends but turn out to be promising, but I wouldn’t underestimate their agility at adapting new approaches if something unexpected is found.[1] My mental model is entirely informed from seeing Dwarkesh’s interview with DeepMind researchers. They talk about ruthless efficiency in trying out new ideas and seeing what works. They also talk about how having more compute would make them X times better researchers.
I think these are all at best marginal improvements and will be dwarfed by more compute[2] or at least will only beat SOTA after being given more compute. I think the space for algo improvement for a given amount of compute is saturated quickly. Also if anything, the average smaller place will over-index on techniques that crank out a little extra performance on smaller models but fail at scale.
My mental model of the hardware poor is they want to publicize their results as fast as they can so they get more clout, VC funding, or just getting essentially acquired by big tech. Academic recognition in the form of citations drive researchers. Getting rich drives the founders.
Yes I agree there is a small possibility, but I find this is almost “pascal mugging”. I think there is a stickiness of the AlphaGo model of things that’s informing some choices which are objectively bad in a world where the AlphaGo model doesn’t hold. The fear response to the low odds world is not appropriate for the high odds world.
I think the time it takes to deploy a model after training is making people think these labs are slower than they actually are.
As an example most improvements from Llama-3 came from just training the models on more data (with more compute). Sora looks worse than SOTA approaches until you throw more compute at it.
I’d love to have a better feel for how much of the promising things they try to reproduce at scale...
Unfortunately, I don’t have enough inside access for that...
There are all kinds of people. I think Schmidhuber’s group might be happy to deliberately create an uncontrollable foom, if they can (they have Saudi funding, so I have no idea how much hardware do they actually have, and how much options for more hardware do they have contingent on preliminary results). Some other people just don’t think their methods are strong enough to be that unsafe. Some people do care about safety (but still want to go ahead; some of those say “this is potentially risky, but in the future, and not right now”, and they might be right or wrong). Some people feel their approach does increase safety (they might be right or wrong). A number of people are ideological (they feel that their preferred approach is not getting a fair shake from the research community, and they want to make a strong attempt to show that the community is wrong and myopic)...
I think most places tend to publish some of their results for the reasons you’ve stated, but they are also likely to hold some of stronger things back (at least, for a while); after all, if one is after VC funding, one needs to show those VCs that there is some secret sauce which remains proprietary...
Yeah, with you there. I am just speculating based on what I’ve heard online and through the grapevine, so take my model of their internal workings with a grain of salt. With that said I feel pretty confident in it.
IMO software/algorithmic moat is pretty impossible to keep. Researchers tend to be pretty smart, enough to figure it out independently, even if they manage to stop any researcher from leaving and diffusing knowledge. Some parallels:
The India trade done by Jane Street. They
arewere making billions of dollars contingent on the fact that no one else knows about this trade, but eventually their alpha also got diffused.TikTok’s content algorithm which the Chinese government doesn’t want to export only took a couple months for Meta/Google to replicate.
Indeed.
That is, unless the situation is highly non-stationary (that is, algorithms and methods are modified fast without stopping; of course, a foom would be one such situation, but I can imagine a more pedestrian “rapid fire” evolution of methods which goes at a good clip, but does not accelerate beyond reason).