As far as I can tell, this is what is going on: they do not have any such thing, because GB and DM do not believe in the scaling hypothesis the way that Sutskever, Amodei and others at OA do.
GB is entirely too practical and short-term focused to dabble in such esoteric & expensive speculation, although Quoc’s group occasionally surprises you. They’ll dabble in something like GShard, but mostly because they expect to be likely to be able to deploy it or something like it to production in Google Translate.
DM (particularly Hassabis, I’m not sure about Legg’s current views) believes that AGI will require effectively replicating the human brain module by module, and that while these modules will be extremely large and expensive by contemporary standards, they still need to be invented and finetuned piece by piece, with little risk or surprise until the final assembly. That is how you get DM contraptions like Agent57 which are throwing the kitchen sink at the wall to see what sticks, and why they place such emphasis on neuroscience as inspiration and cross-fertilization for reverse-engineering the brain. When someone seems to have come up with a scalable architecture for a problem, like AlphaZero or AlphaStar, they are willing to pour on the gas to make it scale, but otherwise, incremental refinement on ALE and then DMLab is the game plan. They have been biting off and chewing pieces of the brain for a decade, and it’ll probably take another decade or two of steady chewing if all goes well. Because they have locked up so much talent and have so much proprietary code and believe all of that is a major moat to any competitor trying to replicate the complicated brain, they are fairly easygoing. You will not see DM ‘bet the company’ on any moonshot; Google’s cashflow isn’t going anywhere, and slow and steady wins the race.
OA, lacking anything like DM’s long-term funding from Google or its enormous headcount, is making a startup-like bet that they know an important truth which is a secret: “the scaling hypothesis is true” and so simple DRL algorithms like PPO on top of large simple architectures like RNNs or Transformers can emerge and meta-learn their way to powerful capabilities, enabling further funding for still more compute & scaling, in a virtuous cycle. And if OA is wrong to trust in the God of Straight Lines On Graphs, well, they never could compete with DM directly using DM’s favored approach, and were always going to be an also-ran footnote.
While all of this hypothetically can be replicated relatively easily (never underestimate the amount of tweaking and special sauce it takes) by competitors if they wished (the necessary amounts of compute budgets are still trivial in terms of Big Science or other investments like AlphaGo or AlphaStar or Waymo, after all), said competitors lack the very most important thing, which no amount of money or GPUs can ever cure: the courage of their convictions. They are too hidebound and deeply philosophically wrong to ever admit fault and try to overtake OA until it’s too late. This might seem absurd, but look at the repeated criticism of OA every time they release a new example of the scaling hypothesis, from GPT-1 to Dactyl to OA5 to GPT-2 to iGPT to GPT-3… (When faced with the choice between having to admit all their fancy hard work is a dead-end, swallow the bitter lesson, and start budgeting tens of millions of compute, or instead writing a tweet explaining how, “actually, GPT-3 shows that scaling is a dead end and it’s just imitation intelligence”—most people will get busy on the tweet!)
What I’ll be watching for is whether orgs beyond ‘the usual suspects’ (MS ZeRO, Nvidia, Salesfore, Allen, DM/GB, Connor/EleutherAI, FAIR) start participating or if they continue to dismiss scaling.
DL so far has been easy to predict—if you bought into a specific theory of connectionism & scaling espoused by Schmidhuber, Moravec, Sutskever, and a few others, as I point out in https://www.gwern.net/newsletter/2019/13#what-progress & https://www.gwern.net/newsletter/2020/05#gpt-3 . Even the dates are more or less correct! The really surprising thing is that that particular extreme fringe lunatic theory turned out to be correct. So the question is, was everyone else wrong for the right reasons (similar to the Greeks dismissing heliocentrism for excellent reasons yet still being wrong), or wrong for the wrong reasons, and why, and how can we prevent that from happening again and spending the next decade being surprised in potentially very bad ways?
Personally, these two comments have kicked me into thinking about theories of AI in the same context as also-ran theories of physics like vortex atoms or the Great Debate. It really is striking how long one person with a major prior success to their name can push for a theory when the evidence is being stacked against it.
A bit closer to home than DM and GB, it also feels like a lot of AI safety people have missed the mark. It’s hard for me to criticise too loudly because, well, ‘AI anxiety’ doesn’t show up in my diary until June 3rd (and that’s with a link to your May newsletter). But a lot of AI safety work increasingly looks like it’d help make a hypothetical kind of AI safe, rather than helping with the prosaic ones we’re actually building.
I’m committing something like the peso problem here in that lots of safety work was—is—influenced by worries about the worst-case world, where something self-improving bootstraps itself out of something entirely innocuous. In that sense we’re kind of fortunate that we’ve ended up with a bloody language model fire-alarmof all things, but I can’t claim that helps me sleep at night.
I’m imagining a tiny AI Safety organization, circa 2010, that focused on how to achieve probable alignment for scaled-up versions of that year’s state-of-the-art AI designs. It’s interesting to ask whether that organization would have achieved more or less than MIRI has, in terms of generalizable work and in terms of field-building.
Certainly it would have resulted in a lot of work that was initially successful but ultimately dead-end. But maybe early concrete results would have attracted more talent/attention/respect/funding, and the org could have thrown that at DL once it began to win the race.
On the other hand, maybe committing to 2010′s AI paradigm would have made them a laughingstock by 2015, and killed the field. Maybe the org would have too much inertia to pivot, and it would have taken away the oxygen for anyone else to do DL-compatible AI safety work. Maybe it would have stated its problems less clearly, inviting more philosophical confusion and even more hangers-on answering the wrong questions.
Or, worst, maybe it would have made a juicy target for a hostile takeover. Compare what happened to nanotechnology research (and nanotech safety research) when too much money got in too early—savvy academics and industry representatives exiled Drexler from the field he founded so that they could spend the federal dollars on regular materials science and call it nanotechnology.
One thing they could have achieved was dataset and leaderboard creation (MSCOCO, GLUE, and imagenet for example). These have tended to focus and help research and persist in usefulness for some time, as long as they are chosen wisely.
a lot of AI safety work increasingly looks like it’d help make a hypothetical kind of AI safe
I think there are many reasons a researcher might still prioritize non-prosaic AI safety work. Off the top of my head:
You think prosaic AI safety is so doomed that you’re optimizing for worlds in which AGI takes a long time, even if you think it’s probably soon.
There’s a skillset gap or other such cost, such that reorienting would decrease your productivity by some factor (say, .6) for an extended period of time. The switch only becomes worth it in expectation once you’ve become sufficiently confident AGI will be prosaic.
Disagreement about prosaic AGI probabilities.
Lack of clear opportunities to contribute to prosaic AGI safety / shovel-ready projects (the severity of this depends on how agentic the researcher is).
Entirely seriously: I can never decide whether the drunkard’s search is a parable about the wisdom in looking under the streetlight, or the wisdom of hunting around in the dark.
I think the drunkard’s search is about the wisdom of improving your tools. Sure, spend some time out looking, but let’s spend a lot of time making better streetlights and flashlights, etc.
Look at, for example, Moravec. His extrapolation assumes that supercomputer will not be made available for AI work until AI work has already been proven successful (correct) and that AI will have to wait for hardware to become so powerful that even a grad student can afford it with $1k (also correct, see AlexNet), and extrapolating from ~1998, estimates:
At the present rate, computers suitable for humanlike robots will appear in the 2020s.
Last year it only took Google Brain half a year to make a Transformer 8x larger than GPT-2 (the T5). And they concluded that model size is a key component of progress. So I won’t be surprised if they release something with a trillion parameters this year.
Thinking about this a bit more, do you have any insight on Tesla? I can believe that it’s outside DM and GB’s culture to run with the scaling hypothesis, but watching Karpathy’s presentations (which I think is the only public information on their AI program?) I get the sense they’re well beyond $10m/run by now. Considering that self-driving is still not there—and once upon a time I’d have expected driving to be easier than Harry Potter parodies—it suggests that language is special in some way. Information density? Rich, diff’able reward signal?
Self driving is very unforgiving of mistakes. The text generation on the other hand doesn’t have similar failure conditions and bad content can be easily fixed.
Tesla publishes nothing and I only know a little from Karpathy’s occasional talks, which are as much about PR (to keep Tesla owners happy and investing in FSD, presumably) & recruiting as anything else. But their approach seems heavily focused on supervised learning in CNNs and active learning using their fleet to collect new images, and to have nothing to do with AGI plans. They don’t seem to even be using DRL much. It is extremely unlikely that Tesla is going to be relevant to AGI or progress in the field in general given their secrecy and domain-specific work. (I’m not sure how well they’re doing even at self-driving cars—I keep reading about people dying when their Tesla runs into a stationary object on a highway in the middle of the day, which you’d think they’d’ve solved by now...)
I’m pretty sure I remember hearing they use unsupervised learning to form their 3D model of their local environment, and that’s the most important part, no?
I believe that is referring to the baseline driver assistance system, and not the advanced “full self driving” one (that has to be paid for separately). Though it’s hard to tell that level of detail from a mainstream media report.
I just realized with a start that this is _absolutely_ going to happen. We are going to, in the not-too-distant-future see a GPT-x (or similar) be ported to a Tesla and drive it.
It frustrates me that there are not enough people IRL I can excitedly talk about how big of a deal this is.
Presumably, because with a big-enough X, we can generate text descriptions of scenes from cameras and feed them in to get driving output more easily than the seemingly fairly slow process to directly train a self-driving system that is safe. And if GPT-X is effectively magic, that’s enough.
I’m not sure I buy it, though. I think that once people agree that scaling just works, we’ll end up scaling the NNs used for self driving instead, and just feed them much more training data.
There might be some architectures that are more scaleable then others. As far as I understand the present models for self driving have for the most part a lot of hardcoded elements. That might make them more complicated to scale.
My hypothesis: Language models work by being huge. Tesla can’t use huge models because they are limited by the size of the computers on their cars. They could make bigger computers, but then that would cost too much per car and drain the battery too much (e.g. a 10x bigger computer would cut dozens of miles off the range and also add $9,000 to the car price, at least.)
[EDIT: oops, I thought you were talking about the direct power consumption of the computation, not the extra hardware weight. My bad.]
It’s not about the power consumption.
The air conditioner in your car uses 3 kW, and GPT-3 takes 0.4 kWH for 100 pages of output—thus a dedicated computer on AC power could produce 700 pages per hour, going substantially faster than AI Dungeon (literally and metaphorically). So a model as large as GPT-3 could run on the electricity of a car.
The hardware would be more expensive, of course. But that’s different.
Huh, thanks—I hadn’t run the numbers myself, so this is a good wake-up call for me. I was going off what Elon said. (He said multiple times that power efficiency was an important design constraint on their hardware because otherwise it would reduce the range of the car too much.) So now I’m just confused. Maybe Elon had the hardware weight in mind, but still...
Maybe the real problem is just that it would add too much to the price of the car?
Re hardware limit: flagging the implicit assumption here that network speeds are spotty/unreliable enough that you can’t or are unwilling to safely do hybrid on-device/cloud processing for the important parts of self-driving cars.
(FWIW I think the assumption is probably correct).
Both DM/GB have moved enormously towards scaling since May 2020, and there are a number of enthusiastic scaling proponents inside both in addition to the obvious output of things like Chinchilla or PaLM. (Good for them, not that I really expected otherwise given that stuff just kept happening and happening after GPT-3.) This happened fairly quickly for DM (given when Gopher was apparently started), and maybe somewhat slower for GB despite Dean’s history & enthusiasm. (I still think MoEs were a distraction.) I don’t know enough about the internal dynamics to say if they are fully scale-pilled, but scaling has worked so well, even in crazy applications like dropping language models into robotics planning (SayCan), that critics are in pell-mell retreat and people are getting away with publishing manifestos like “reward is enough” or openly saying on Twitter “scaling is all you need”. I expect that top-down organizational constraints are probably now a bigger deal: I’m far from the first person to note that DM/GB seem unable to ship (publicly visible) products and researchers keep fleeing for startups where they can be more like OA in actually shipping.
FAIR puzzles me because FAIR researchers are certainly not stupid or blind, FB continues to make large investments in hardware like their new GPU cluster, and the most interesting FAIR research is strongly scaling flavored, like their unsupervised work on audio/video, so you’d think they’d’ve caught up. But FB is also experiencing heavy weather while Zuckerberg seems to be aiming it all at ‘metaverse’ applications (which leads away from DRL) and further, FAIR has recently been somehow broken up & everyone reorganized (?). Meanwhile, of course, Yann LeCun continues saying things like ‘general intelligence doesn’t exist’, scoffing at scaling, and proposing elaborately engineered modular neuroscience-based AGI paradigms. So I guess it looks like they’re grudingly backing their way into scaling work simply because they are forced to if they want any results worth publishing or systems which can meet Zuckerberg’s Five Year Plans, but one could not call FAIR scaling-pilled. Scaling enthusiasts probably feel chilled about proposing any explicit scaling research or mentioning the reasons for it being important, which will shut down anything daring.
If you extrapolated those straight lines further, doesn’t it mean that even small businesses will be able to afford training their own quadrillion-parameter-models just a few years after Google?
What makes you think there will be small businesses at that point, or that anyone would care what these hypothetical small businesses may or may not be doing?
I’m not sure it’s good for this comment to get a lot of attention? OpenAI is more altruism-oriented than a typical AI research group, and this is essentially a persuasive essay for why other groups should compete with them.
‘Why the hell has our competitor got this transformative capability that we don’t?’ is not a hard thought to have, especially among tech executives. I would be very surprised if there wasn’t a running battle over long-term perspectives on AI in the C-suite of both Google Brain and DeepMind.
If you do want to think along these lines though, the bigger question for me is why OpenAI released the API now, and gave concrete warning of the transformative capabilities they intend to deploy in six? twelve? months’ time. ‘Why the hell has our competitor got this transformative capability that we don’t?’ is not a hard thought now, but it that’s largely because the API was a piece of compelling evidence thrust in all of our faces.
Maybe they didn’t expect it to latch into the dev-community consciousness like it has, or for it to be quite as compelling a piece of evidence as it’s turned out to be. Maybe it just seemed like a cool thing to do and in-line with their culture. Maybe it’s an investor demo for how things will be monetised in future, which will enable the $10bn punt they need to keep abreast of Google.
I think the fact that’s it’s not a hard thought to have is not too much evidence about whether other orgs will change approach. It takes a lot to turn the ship.
Consider how easy it would be to have the thought, “Electric cars are the future, we should switch to making electric cars.” any time in the last 15 years. And yet, look at how slow traditional automakers have been to switch.
Indeed. No one seriously doubted that the future was not gas, but always at a sufficiently safe remove that they didn’t have to do anything themselves beyond a minor side R&D program, because there was no fire alarm. (“God, grant me [electrification] and [the scaling hypothesis] - but not yet!”)
Is it more than 30% likely that in the short term (say 5 years), Google isn’t wrong? If you applied massive scale to the AI algorithms of 1997, you would get better performance, but would your result be economically useful? Is it possible we’re in a similar situation today where the real-world applications of AI are already good-enough and additional performance is less useful than the money spent on extra compute? (self-driving cars is perhaps the closest example: clearly it would be economically valuable, but what if the compute to train it would cost 20 billion US dollars? Your competitors will catch up eventually, could you make enough profit in the interim to pay for that compute?)
I’d say it’s at least 30% likely that’s the case! But if you believe that, you’d be pants-on-head loony not to drop a billion on the ‘residual’ 70% chance that you’ll be first to market on a world-changing trillion-dollar technology. VCs would sacrifice their firstborn for that kind of deal.
As far as I can tell, this is what is going on: they do not have any such thing, because GB and DM do not believe in the scaling hypothesis the way that Sutskever, Amodei and others at OA do.
GB is entirely too practical and short-term focused to dabble in such esoteric & expensive speculation, although Quoc’s group occasionally surprises you. They’ll dabble in something like GShard, but mostly because they expect to be likely to be able to deploy it or something like it to production in Google Translate.
DM (particularly Hassabis, I’m not sure about Legg’s current views) believes that AGI will require effectively replicating the human brain module by module, and that while these modules will be extremely large and expensive by contemporary standards, they still need to be invented and finetuned piece by piece, with little risk or surprise until the final assembly. That is how you get DM contraptions like Agent57 which are throwing the kitchen sink at the wall to see what sticks, and why they place such emphasis on neuroscience as inspiration and cross-fertilization for reverse-engineering the brain. When someone seems to have come up with a scalable architecture for a problem, like AlphaZero or AlphaStar, they are willing to pour on the gas to make it scale, but otherwise, incremental refinement on ALE and then DMLab is the game plan. They have been biting off and chewing pieces of the brain for a decade, and it’ll probably take another decade or two of steady chewing if all goes well. Because they have locked up so much talent and have so much proprietary code and believe all of that is a major moat to any competitor trying to replicate the complicated brain, they are fairly easygoing. You will not see DM ‘bet the company’ on any moonshot; Google’s cashflow isn’t going anywhere, and slow and steady wins the race.
OA, lacking anything like DM’s long-term funding from Google or its enormous headcount, is making a startup-like bet that they know an important truth which is a secret: “the scaling hypothesis is true” and so simple DRL algorithms like PPO on top of large simple architectures like RNNs or Transformers can emerge and meta-learn their way to powerful capabilities, enabling further funding for still more compute & scaling, in a virtuous cycle. And if OA is wrong to trust in the God of Straight Lines On Graphs, well, they never could compete with DM directly using DM’s favored approach, and were always going to be an also-ran footnote.
While all of this hypothetically can be replicated relatively easily (never underestimate the amount of tweaking and special sauce it takes) by competitors if they wished (the necessary amounts of compute budgets are still trivial in terms of Big Science or other investments like AlphaGo or AlphaStar or Waymo, after all), said competitors lack the very most important thing, which no amount of money or GPUs can ever cure: the courage of their convictions. They are too hidebound and deeply philosophically wrong to ever admit fault and try to overtake OA until it’s too late. This might seem absurd, but look at the repeated criticism of OA every time they release a new example of the scaling hypothesis, from GPT-1 to Dactyl to OA5 to GPT-2 to iGPT to GPT-3… (When faced with the choice between having to admit all their fancy hard work is a dead-end, swallow the bitter lesson, and start budgeting tens of millions of compute, or instead writing a tweet explaining how, “actually, GPT-3 shows that scaling is a dead end and it’s just imitation intelligence”—most people will get busy on the tweet!)
What I’ll be watching for is whether orgs beyond ‘the usual suspects’ (MS ZeRO, Nvidia, Salesfore, Allen, DM/GB, Connor/EleutherAI, FAIR) start participating or if they continue to dismiss scaling.
Feels worth pasting in this other comment of yours from last week, which dovetails well with this:
Personally, these two comments have kicked me into thinking about theories of AI in the same context as also-ran theories of physics like vortex atoms or the Great Debate. It really is striking how long one person with a major prior success to their name can push for a theory when the evidence is being stacked against it.
A bit closer to home than DM and GB, it also feels like a lot of AI safety people have missed the mark. It’s hard for me to criticise too loudly because, well, ‘AI anxiety’ doesn’t show up in my diary until June 3rd (and that’s with a link to your May newsletter). But a lot of AI safety work increasingly looks like it’d help make a hypothetical kind of AI safe, rather than helping with the prosaic ones we’re actually building.
I’m committing something like the peso problem here in that lots of safety work was—is—influenced by worries about the worst-case world, where something self-improving bootstraps itself out of something entirely innocuous. In that sense we’re kind of fortunate that we’ve ended up with a bloody language model fire-alarm of all things, but I can’t claim that helps me sleep at night.
I’m imagining a tiny AI Safety organization, circa 2010, that focused on how to achieve probable alignment for scaled-up versions of that year’s state-of-the-art AI designs. It’s interesting to ask whether that organization would have achieved more or less than MIRI has, in terms of generalizable work and in terms of field-building.
Certainly it would have resulted in a lot of work that was initially successful but ultimately dead-end. But maybe early concrete results would have attracted more talent/attention/respect/funding, and the org could have thrown that at DL once it began to win the race.
On the other hand, maybe committing to 2010′s AI paradigm would have made them a laughingstock by 2015, and killed the field. Maybe the org would have too much inertia to pivot, and it would have taken away the oxygen for anyone else to do DL-compatible AI safety work. Maybe it would have stated its problems less clearly, inviting more philosophical confusion and even more hangers-on answering the wrong questions.
Or, worst, maybe it would have made a juicy target for a hostile takeover. Compare what happened to nanotechnology research (and nanotech safety research) when too much money got in too early—savvy academics and industry representatives exiled Drexler from the field he founded so that they could spend the federal dollars on regular materials science and call it nanotechnology.
One thing they could have achieved was dataset and leaderboard creation (MSCOCO, GLUE, and imagenet for example). These have tended to focus and help research and persist in usefulness for some time, as long as they are chosen wisely.
Predicting and extrapolating human preferences is a task which is part of nearly every AI Alignment strategy. Yet we have few datasets for it, the only ones I found are https://github.com/iterative/aita_dataset, https://www.moralmachine.net/
So this hypothetical ML Engineering approach to alignment might have achieved some simple wins like that.
EDIT Something like this was just released Aligning AI With Shared Human Values
I think there are many reasons a researcher might still prioritize non-prosaic AI safety work. Off the top of my head:
You think prosaic AI safety is so doomed that you’re optimizing for worlds in which AGI takes a long time, even if you think it’s probably soon.
There’s a skillset gap or other such cost, such that reorienting would decrease your productivity by some factor (say, .6) for an extended period of time. The switch only becomes worth it in expectation once you’ve become sufficiently confident AGI will be prosaic.
Disagreement about prosaic AGI probabilities.
Lack of clear opportunities to contribute to prosaic AGI safety / shovel-ready projects (the severity of this depends on how agentic the researcher is).
Entirely seriously: I can never decide whether the drunkard’s search is a parable about the wisdom in looking under the streetlight, or the wisdom of hunting around in the dark.
I think the drunkard’s search is about the wisdom of improving your tools. Sure, spend some time out looking, but let’s spend a lot of time making better streetlights and flashlights, etc.
In the Gwern quote, what does “Even the dates are more or less correct!” refer to? Which dates were predicted for what?
Look at, for example, Moravec. His extrapolation assumes that supercomputer will not be made available for AI work until AI work has already been proven successful (correct) and that AI will have to wait for hardware to become so powerful that even a grad student can afford it with $1k (also correct, see AlexNet), and extrapolating from ~1998, estimates:
Guess what year today is.
Last year it only took Google Brain half a year to make a Transformer 8x larger than GPT-2 (the T5). And they concluded that model size is a key component of progress. So I won’t be surprised if they release something with a trillion parameters this year.
Thinking about this a bit more, do you have any insight on Tesla? I can believe that it’s outside DM and GB’s culture to run with the scaling hypothesis, but watching Karpathy’s presentations (which I think is the only public information on their AI program?) I get the sense they’re well beyond $10m/run by now. Considering that self-driving is still not there—and once upon a time I’d have expected driving to be easier than Harry Potter parodies—it suggests that language is special in some way. Information density? Rich, diff’able reward signal?
Self driving is very unforgiving of mistakes. The text generation on the other hand doesn’t have similar failure conditions and bad content can be easily fixed.
Tesla publishes nothing and I only know a little from Karpathy’s occasional talks, which are as much about PR (to keep Tesla owners happy and investing in FSD, presumably) & recruiting as anything else. But their approach seems heavily focused on supervised learning in CNNs and active learning using their fleet to collect new images, and to have nothing to do with AGI plans. They don’t seem to even be using DRL much. It is extremely unlikely that Tesla is going to be relevant to AGI or progress in the field in general given their secrecy and domain-specific work. (I’m not sure how well they’re doing even at self-driving cars—I keep reading about people dying when their Tesla runs into a stationary object on a highway in the middle of the day, which you’d think they’d’ve solved by now...)
I’m pretty sure I remember hearing they use unsupervised learning to form their 3D model of their local environment, and that’s the most important part, no?
Curious if you have updated on this at all, given AI Day announcements?
They still running into stationary objects? The hardware is cool, sure, but unclear how much good it’s doing them...
I believe that is referring to the baseline driver assistance system, and not the advanced “full self driving” one (that has to be paid for separately). Though it’s hard to tell that level of detail from a mainstream media report.
hey man wanna watch this language model drive my car
I just realized with a start that this is _absolutely_ going to happen. We are going to, in the not-too-distant-future see a GPT-x (or similar) be ported to a Tesla and drive it.
It frustrates me that there are not enough people IRL I can excitedly talk about how big of a deal this is.
Can you explain why GPT-x would be well-suited to that modality?
Presumably, because with a big-enough X, we can generate text descriptions of scenes from cameras and feed them in to get driving output more easily than the seemingly fairly slow process to directly train a self-driving system that is safe. And if GPT-X is effectively magic, that’s enough.
I’m not sure I buy it, though. I think that once people agree that scaling just works, we’ll end up scaling the NNs used for self driving instead, and just feed them much more training data.
There might be some architectures that are more scaleable then others. As far as I understand the present models for self driving have for the most part a lot of hardcoded elements. That might make them more complicated to scale.
Agreed, but I suspect that replacing those hard-coded elements will get easier over time as well.
Andrej Karpathy talks about exactly that in a recent presentation: https://youtu.be/hx7BXih7zx8?t=1118
My hypothesis: Language models work by being huge. Tesla can’t use huge models because they are limited by the size of the computers on their cars. They could make bigger computers, but then that would cost too much per car and drain the battery too much (e.g. a 10x bigger computer would cut dozens of miles off the range and also add $9,000 to the car price, at least.)
[EDIT: oops, I thought you were talking about the direct power consumption of the computation, not the extra hardware weight. My bad.]
It’s not about the power consumption.
The air conditioner in your car uses 3 kW, and GPT-3 takes 0.4 kWH for 100 pages of output—thus a dedicated computer on AC power could produce 700 pages per hour, going substantially faster than AI Dungeon (literally and metaphorically). So a model as large as GPT-3 could run on the electricity of a car.
The hardware would be more expensive, of course. But that’s different.
Huh, thanks—I hadn’t run the numbers myself, so this is a good wake-up call for me. I was going off what Elon said. (He said multiple times that power efficiency was an important design constraint on their hardware because otherwise it would reduce the range of the car too much.) So now I’m just confused. Maybe Elon had the hardware weight in mind, but still...
Maybe the real problem is just that it would add too much to the price of the car?
Yes. GPU/ASICs in a car will have to sit idle almost all the time, so the costs of running a big model on it will be much higher than in the cloud.
Re hardware limit: flagging the implicit assumption here that network speeds are spotty/unreliable enough that you can’t or are unwilling to safely do hybrid on-device/cloud processing for the important parts of self-driving cars.
(FWIW I think the assumption is probably correct).
After 2 years, any updates on your opinion of DM, GB and FAIR’s scaling stance? Would you consider any of them fully “scale-pilled”?
Both DM/GB have moved enormously towards scaling since May 2020, and there are a number of enthusiastic scaling proponents inside both in addition to the obvious output of things like Chinchilla or PaLM. (Good for them, not that I really expected otherwise given that stuff just kept happening and happening after GPT-3.) This happened fairly quickly for DM (given when Gopher was apparently started), and maybe somewhat slower for GB despite Dean’s history & enthusiasm. (I still think MoEs were a distraction.) I don’t know enough about the internal dynamics to say if they are fully scale-pilled, but scaling has worked so well, even in crazy applications like dropping language models into robotics planning (SayCan), that critics are in pell-mell retreat and people are getting away with publishing manifestos like “reward is enough” or openly saying on Twitter “scaling is all you need”. I expect that top-down organizational constraints are probably now a bigger deal: I’m far from the first person to note that DM/GB seem unable to ship (publicly visible) products and researchers keep fleeing for startups where they can be more like OA in actually shipping.
FAIR puzzles me because FAIR researchers are certainly not stupid or blind, FB continues to make large investments in hardware like their new GPU cluster, and the most interesting FAIR research is strongly scaling flavored, like their unsupervised work on audio/video, so you’d think they’d’ve caught up. But FB is also experiencing heavy weather while Zuckerberg seems to be aiming it all at ‘metaverse’ applications (which leads away from DRL) and further, FAIR has recently been somehow broken up & everyone reorganized (?). Meanwhile, of course, Yann LeCun continues saying things like ‘general intelligence doesn’t exist’, scoffing at scaling, and proposing elaborately engineered modular neuroscience-based AGI paradigms. So I guess it looks like they’re grudingly backing their way into scaling work simply because they are forced to if they want any results worth publishing or systems which can meet Zuckerberg’s Five Year Plans, but one could not call FAIR scaling-pilled. Scaling enthusiasts probably feel chilled about proposing any explicit scaling research or mentioning the reasons for it being important, which will shut down anything daring.
If you extrapolated those straight lines further, doesn’t it mean that even small businesses will be able to afford training their own quadrillion-parameter-models just a few years after Google?
What makes you think there will be small businesses at that point, or that anyone would care what these hypothetical small businesses may or may not be doing?
So the God of Straight Lines dissolves into a puff of smoke at just the right time to bring about AI doom? Seems awfully convenient.
Thanks for this, I’ll be sharing it on /r/slatestarcodex and Hacker News (rationalist discords too if it comes up).
I’m not sure it’s good for this comment to get a lot of attention? OpenAI is more altruism-oriented than a typical AI research group, and this is essentially a persuasive essay for why other groups should compete with them.
‘Why the hell has our competitor got this transformative capability that we don’t?’ is not a hard thought to have, especially among tech executives. I would be very surprised if there wasn’t a running battle over long-term perspectives on AI in the C-suite of both Google Brain and DeepMind.
If you do want to think along these lines though, the bigger question for me is why OpenAI released the API now, and gave concrete warning of the transformative capabilities they intend to deploy in six? twelve? months’ time. ‘Why the hell has our competitor got this transformative capability that we don’t?’ is not a hard thought now, but it that’s largely because the API was a piece of compelling evidence thrust in all of our faces.
Maybe they didn’t expect it to latch into the dev-community consciousness like it has, or for it to be quite as compelling a piece of evidence as it’s turned out to be. Maybe it just seemed like a cool thing to do and in-line with their culture. Maybe it’s an investor demo for how things will be monetised in future, which will enable the $10bn punt they need to keep abreast of Google.
I think the fact that’s it’s not a hard thought to have is not too much evidence about whether other orgs will change approach. It takes a lot to turn the ship.
Consider how easy it would be to have the thought, “Electric cars are the future, we should switch to making electric cars.” any time in the last 15 years. And yet, look at how slow traditional automakers have been to switch.
Indeed. No one seriously doubted that the future was not gas, but always at a sufficiently safe remove that they didn’t have to do anything themselves beyond a minor side R&D program, because there was no fire alarm. (“God, grant me [electrification] and [the scaling hypothesis] - but not yet!”)
It has already got some spread. Michael Nielsen shared it on Twitter (126 likes and 29 RTs as at writing).
Is it more than 30% likely that in the short term (say 5 years), Google isn’t wrong? If you applied massive scale to the AI algorithms of 1997, you would get better performance, but would your result be economically useful? Is it possible we’re in a similar situation today where the real-world applications of AI are already good-enough and additional performance is less useful than the money spent on extra compute? (self-driving cars is perhaps the closest example: clearly it would be economically valuable, but what if the compute to train it would cost 20 billion US dollars? Your competitors will catch up eventually, could you make enough profit in the interim to pay for that compute?)
I’d say it’s at least 30% likely that’s the case! But if you believe that, you’d be pants-on-head loony not to drop a billion on the ‘residual’ 70% chance that you’ll be first to market on a world-changing trillion-dollar technology. VCs would sacrifice their firstborn for that kind of deal.