This is not particularly unexpected if you believed in the scaling hypothesis.
Cicero is not particularly unexpected to me, but my expectations here are not driven by the scaling hypothesis. The result achieved here was not achieved by adding more layers to a single AI engine, it was achieved by human designers who assembled several specialised AI engines by hand.
So I do not view this result as one that adds particularly strong evidence to the scaling hypothesis. I could equally well make the case that it adds more evidence to the alternative hypothesis, put forward by people like Gary Marcus, that scaling alone as the sole technique has run out of steam, and that the prevailing ML research paradigm needs to shift to a more hybrid approach of combining models. (The prevailing applied AI paradigm has of course always been that you usually need to combine models.)
Another way to explain my lack of surprise would be to say that Cicero is a just super-human board game playing engine that has been equipped with a voice synthesizer. But I might be downplaying the achievement here.
this is among the worser things you could be researching [...] There are… uh, not many realistic, beneficial applications for this work.
I have not read any of the authors’ or Meta’s messaging around this, so I am not sure if they make that point, but the sub-components of Cicero that somewhat competently and ‘honestly’ explain its currently intended moves seem to have beneficial applications too, if they were combined with an engine which is different from a game engine that absolutely wants to win and that can change it’s mind about moves to play later. This is a dual-use technology with both good and bad possible uses.
That being said, I agree that this is yet another regulatory wake-up call, if we would need one. As a group, AI researchers will not conveniently regulate themselves: they will move forward in creating more advanced dual-use technology, while openly acknowledging (see annex A.3 of the paper) that this technology might be used for both good and bad purposes downstream. So it is up to the rest of the world to make sure that these downstream uses are regulated.
Cicero is not particularly unexpected to me, but my expectations here are not driven by the scaling hypothesis. The result achieved here was not achieved by adding more layers to a single AI engine, it was achieved by human designers who assembled several specialised AI engines by hand. So I do not view this result as one that adds particularly strong evidence to the scaling hypothesis. I could equally well make the case that it adds more evidence to the alternative hypothesis.
Well, first I would point out that I predicted it would happen soon, while literally writing “The Scaling Hypothesis”, and the actual researchers involved did not, predicting it would take at least a decade; if it is not predicted by the scaling hypothesis, but is in fact predicted better by neurosymbolic views denying scaling, it’s curious that I was the one who said that and not Gary Marcus.
Second, it’s hard to see how a success using models which would have been among the largest NNs ever trained just 3 years ago, finetuned using 128-256 GPUs on a dataset which would be among the largest compute-clusters & datasets ever used in DL also up until a few years ago, requiring 80 CPU-cores & 8 GPUs at runtime (which would be &etc), is really a vindication for hand-engineered neurosymbolic approaches emphasizing the importance of symbolic reasoning and complicated modular approaches, as opposed to approaches emphasizing the importance of increasing data & compute for solving previously completely intractable problems. Nor do the results indicate that the modules are really all that critical. I do not think CICERO is going to need even an AlphaZero level of breakthrough to remove various parts, when eg. the elaborate filter system representing much of the ‘neurosymbolic’ part of the system is only removing 15.5% of all messages.
and that the prevailing ML research paradigm needs to shift to a more hybrid approach of combining models. (The prevailing applied AI paradigm has of course always been that you usually need to combine models.)
I regard this sort of argument as illegitimate because it is a fully-general counterargument which allows strategy-stealing and explains everything while predicting nothing: no matter how successful scaling is, it will always be possible (short of a system being strongly superhuman) to take a system and improve it in some way by injecting hand-engineering or borrowing other systems and lashing them together. This is fine for purely pragmatic purposes, but the problem is when one then claims it is a victory for ‘neurosymbolic programming’ or whatever catchphrase is in vogue. Somehow, the scaled-up system is never a true scotsman of scaling, but the modifications or hybrids are always true scotsmen of neurosymbolics… Those systems will then be surpassed a few years later by larger (and often simpler) systems, thereby refuting the previous argument—only for the same thing to happen again, ratcheting upwards. The question is not whether some hand-engineering can temporarily leapfrog the Bitter Lesson (it is quite explicitly acknowledged by Sutton and scaling proponents that you can gain constant factors by doing stuff which is not scaling), but whether it will progress faster. CICERO would only be a victory for neurosymbolic approaches if it had used nothing you couldn’t’ve done in, say, 2015.
This seems to be what you are doing here: you handwave away the use of BART and extremely CPU/GPU-intensive search as not a victory for scaling (why? it was not a foregone conclusion that a BART would be any good, nor that the search could be made to work, without far more extensive engineering of theory of mind and symbolic reasoning), and then only count the lashing-together of a few scaled-up components as a victory for neurosymbolic approaches instead.
Thanks, that does a lot to clarify your viewpoints. Your reply calls
for some further remarks.
I’ll start off by saying that I value your technology tracking writing
highly because you are one of those blogging technology trackers who
is able to look beyond the press releases and beyond the hype. But I
have the same high opinion of the writings of Gary Marcus.
This seems to be what you are doing here: you handwave away the use of BART and extremely CPU/GPU-intensive search as not a victory for scaling
For the record: I am not trying to handwave the
progress-via-hybrid-approaches hypothesis of Marcus into correctness.
The observations I am making here are much more in the ‘explains
everything while predicting nothing’ department.
I am observing out that both your progress-via-scaling hypothesis and
the progress-via-hybrid-approaches hypothesis of Marcus can be made to
explain the underlying Cicero facts here. I do not see this case as a
clear victory for either one of these hypotheses. What we have here
is an AI design that cleverly combines multiple components while also
being impressive in the scaling department.
Technology tracking is difficult, especially about the future.
The following observation may get to the core of how I may be
perceiving the elephant differently. I interpret an innovation like
GANs not as a triumph of scaling, but as a triumph of cleverly putting
two components together. I see GANs as an innovation that directly
contradicts the message of the Bitter Lesson paradigm, one that is
much more in the spirit of what Marcus proposes.
Here is what I find particularly interesting in Marcus. In pieces like
like The Next Decade in AI: Four Steps Towards Robust Artificial
Intelligence, Marcus is advancing
the hypothesis that the academic Bitter-Lesson AI field is in a
technology overhang: these people could make make a lot of progress on
their benchmarks very quickly, faster than mere neural net scaling
will allow, if they were to ignore the Bitter Lesson paradigm and
embrace a hybrid approach where the toolbox is much bigger than
general-purpose learning, ever-larger training sets, and more and more
compute. Sounds somewhat plausible to me.
If you put a medium or high probability on this overhang hypothesis of
Marcus, then you are in a world where very rapid AI progress might
happen, levels of AI progress much faster than those predicted by the
progress curves produced by Bitter Lesson AI research.
You seem to be advancing an alternative hypothesis, one where advances
made by clever hybrid approaches will always be replicated a few years
later by using a Bitter Lesson style monolithic deep neural net
trained with a massive dataset. This would conveniently restore the
validity of extrapolating Bitter Lesson driven progress curves, because you can use them
as an upper bound. We’ll see.
I am currently not primarily in the business of technology tracking, I
am an AI safety researcher working on safety solutions and regulation.
With that hat on, I will say the following.
Bitter-lesson style systems consisting of a single deep neural net,
especially if these systems are also model-free RL agents, have huge
disadvantages in the robustness, testability, and interpretability
departments. These disadvantages are endlessly talked about on this
web site of course. By contrast, systems built out of separate
components with legible interfaces between them are usually much more
robust, interpretable and testable. This is much less often mentioned
here.
In safety engineering for any high-risk application, I would usually
prefer to work with an AI system built out of many legible
sub-components, not with some deep neural net that happens to perform
equally or better on an in-training-distribution benchmark. So I
would like to see more academic AI research that ignores the Bitter
Lesson paradigm, and the paradigm that all AI research must be ML
research. I am pleased to say that a lot of academic and applied AI
researchers, at least in the part of the world where I live, never got
on board with these paradigms in the first place. To find their work,
you have look beyond conferences like NeurIPS.
I interpret an innovation like GANs not as a triumph of scaling, but as a triumph of cleverly putting two components together. I see GANs as an innovation that directly contradicts the message of the Bitter Lesson paradigm, one that is much more in the spirit of what Marcus proposes.
That’s an interesting example to pick: watching GAN research firsthand as it developed from scratch in 2014 to bust ~2020 played a considerable role in my thinking about the Bitter Lesson & Scaling Hypothesis & ‘blessings of scale’, so I regard GANs differently than you do, and have opinions on the subject.
First, GANs may not be new. Even if you do not entirely buy Schmidhuber’s claim that his predictability-minimization arch 30 years ago or whatever is identical to GANs (and better where not identical), there’s also that 2009 blog post or whatever by someone reinventing it. And given the distribution of multiple-discoveries, that implies there’s probably another 2 or 3 reinventions out there somewhere. If what matters about GANs is the ‘innovation’ of the idea, and not scaling, why were all the inventions so sterile and yielded so little until so late? (And why have such big innovations been so thoroughly superseded in their stronghold of image generation by approaches which look nothing at all like GANs and appear to owe little to them, like diffusion models?)
Second, GANs in their successful period were clearly a triumph of compute. All the biggest successes of GANs use a lot of compute historically: BigGAN is trained on literally a TPUv3-512 pod. None of the interesting success stories of GANs look like ‘train on a CPU with levels of compute available to a GOFAI researcher 20 years ago because scaling doesn’t matter’. Cases like StyleGAN where you seem to get good results from ‘just’ 4 GPU-months of compute (itself a pretty big amount) turn out to scale infamously badly to more complex datasets, and to be the product of literally hundreds or thousands of fullscale runs, and deadends. (The Karras group sometimes reports ‘total compute used during R&D’ in their papers, and it’s an impressively large multiple.) For all the efficiency of StyleGAN on faces, no one is ever going to make it work well on LAION-4b etc. (At Tensorfork, our StyleGAN2-ext worked mostly by throwing away as much of the original StyleGAN2 hand engineering as possible, which held it back, in favor of increasing n/param/compute in tandem. The later distilled ImageNet StyleGAN approach relies on simplifying the problem drastically by throwing out hard datapoints, and still only delivers bad results on ImageNet.) And that’s the most impressive ‘innovation’ in GAN architecture. This is further buttressed by the critical meta-science observations in GAN research that when you compared GAN archs on a fair basis, in a single codebase with equal hyperparam tuning etc, fully reproducibly, you typically found… arch made little difference to the top scores, and mostly just changed the variance of runs. (This parallels other field-survey replication efforts like in embedding research: results get better over time, which researchers claim reflect the sophistication of their architectures… and the gains disappear when you control for compute/n/param. The results were better—but just because later researchers used moar dakka, and either didn’t realize that all the fruits of their hard work was actually due to harder-working GPUs or quietly omitted that observation, in classic academic fashion. I feel like every time I get to ask a researcher who did something important in private how it really happened, I find out that the story told in the paper is basically a fairy tale for small children and feel like the angry math student who complained about Gauss, “he makes [his mathematics] like a fox, wiping out the traces in the sand with his tail.”)
Third, to further emphasize the triumph of GANs due to compute, their fall also appears to be due to compute too. The obsolescence of GANs is, as far as I can tell, due solely to historical happenstance. BigGAN never hit a ceiling; no one calculated scaling laws on BigGAN FIDs and showed it was doomed; BigGAN doesn’t get fatally less stable with scale, Brock demonstrated it gets more stable and scales to at least n = 0.3b without a problem. What happened was simply that the tiny handful of people who could and would do serious GAN scaling happened to leave the field (eg Brock) or run into non-fundamental problems which killed their efforts (eg Tensorfork), while the people who did continue DL scaling happened to not be GAN people but specialized in alternatives like autoregressive, VAE, and then diffusion models. So they scaled up all of those successfully, while no one tried with GANs, and now the ‘fact’ that ‘GANS don’t work’ is just a thing That Is Known: “It is well known GANs are unstable and cannot be trained at scale, which is why we use diffusion models instead...” (Saw a version of that in a paper literally yesterday.) There is never any relevant evidence given for that, just bare assertion or irrelevant cites. I have an essay I should finish explaining why GANs should probably be tried again.
Fourth, we can observe that this is not unique to GANs. Right now, it seems like pretty much any image arch you might care to try works. NeRF? A deep VAE? An autoregressive Transformer on pixels? An autoregressive on VAE tokens? A diffusion on pixels? A diffusion on latents? An autoencoder? Yeah sure, they all work, even (perhaps especially) the ones which don’t work on small datasets/compute. When I look at Imagen samples vs Parti samples, say, I can’t tell which one is the diffusion model and which one is the autoregressive Transformer. Conceptually, they have about as much in common as a fungus with a cockroach, but… they both work pretty darn well anyway. What do they share in common, besides ‘working’? Compute, data, and parameters. Lots and lots and lots of those. (Similarly for video generation.) I predict that if GANs get any real scaling-up effort put into improving them and making them run at similar scale, we will find that GANs work well too. (And will sample a heckuva lot faster than diffusion or AR models...)
‘Innovative’ archs are just not that important. An emphasis on arch, much less apologizing for ‘neurosymbolic’ approaches, struggles to explain this history. Meanwhile, an emphasis on scaling can cleanly explain why GANs succeeded at Goodfellow’s reinvention, why GANs fell out of fashion, why their rivals succeeded, and may yet predict why they get revived.
You seem to be advancing an alternative hypothesis, one where advances made by clever hybrid approaches will always be replicated a few years later by using a Bitter Lesson style monolithic deep neural net trained with a massive dataset. This would conveniently restore the validity of extrapolating Bitter Lesson driven progress curves, because you can use them as an upper bound. We’ll see.
We don’t need to see. Just look at the past, which has already happened.
Marcus is advancing the hypothesis that the academic Bitter-Lesson AI field is in a technology overhang: these people could make make a lot of progress on their benchmarks very quickly, faster than mere neural net scaling will allow
Yeah, he’s wrong about that because it’d only be true if academic markets were maximizing for long-term scaling (they don’t) rather than greedily myopically optimizing for elaborate architectures that grind out an improvement right now. The low-hanging fruit has already been plucked and everyone is jostling to grab a slightly higher-hanging fruit they can see, while the long-term scaling work gets ignored. This is why various efforts to take a breakthrough of scaling, like AlphaGo or GPT-3, and hand-engineer it, yield moderate results but nothing mindblowing like the original.
If you put a medium or high probability on this overhang hypothesis of Marcus, then you are in a world where very rapid AI progress might happen, levels of AI progress much faster than those predicted by the progress curves produced by Bitter Lesson AI research.
No. You’d predict that there’d potentially be more short-term progress (as researchers pluck the fruits exposed by the latest scaling bout) but then less long-term (as scaling stops, because everyone is off plucking but with ever diminishing returns).
Bitter-lesson style systems consisting of a single deep neural net, especially if these systems are also model-free RL agents, have huge disadvantages in the robustness, testability, and interpretability departments. These disadvantages are endlessly talked about on this web site of course. By contrast, systems built out of separate components with legible interfaces between them are usually much more robust, interpretable and testable. This is much less often mentioned here.
Here too I disagree. What we see with scaled up systems is increasingly interpretability of components and less issues with things like polysemanticity or relying on brittle shortcuts, increasingly generalizability and solving edge cases (almost by definition), increasing capabilities allowing for meaningful tests at all, and reasons to think that things like adversarial robustness will come ‘for free’ with scale (see isoperimetry). Just like with the Bitter Lesson for capability, at any fixed level of scale or capability, you can always bolt on more shackles and gadgets and gewgaws to get a ‘safer’ system, but you then are ever increasing risk of obsolescence and futility because a later scaled-up system may render your work completely irrelevant both in terms of economic deployment and in terms of what techniques you need to understand it and what safety properties you can attain. (Any system so simple it can be interpreted in a ‘symbolic’ classical way may inherently be too simple to solve any real problems—if the solutions to those problems were that simple, why weren’t they created symbolically before bringing DL into the picture...? ‘Bias/variance’ applies to safety just as much as anything else: a system which is too small and too dumb to genuinely understand things can be a lot more dangerous than a scaled-up system which does understand them but is shaky on how much it cares. Or more pointedly: hybrid systems which do not solve the problem, which is ‘all of them’ in the long run, cannot have any safety properties since they do not work.)
The same day that Cicero was announced, there was a friendly debate at the AACL conference on the topic “Is there more to NLP [natural language processing] than Deep Learning,” with four distinguished researchers trained some decades ago arguing the affirmative and four brilliant young researchers more recently trained arguing the negative. Cicero is perhaps a reminder that there is indeed a lot more to natural language processing than deep learning.
I am originally a CS researcher trained several decades ago, actually in the middle of an AI winter. That might explain our different viewpoints here. I also have a background in industrial research and applied AI, which has given me a lot of insight into the vast array of problems that academic research refuses to solve for you. More long-form thoughts about this are in my Demanding and Designing Aligned Cognitive Architectures.
From where I am standing, the scaling hype is wasting a lot of the minds of the younger generation, wasting their minds on the problem of improving ML benchmark scores under the unrealistic assumption that ML will have infinite clean training data. This situation does not fill me with as much existential dread as it does some other people on this forum, but anyway.
Related to our discussion earlier, I see that Marcus and Davis just
published a blog post: What does Meta AI’s Diplomacy-winning Cicero
Mean for
AI?.
In it, they argue, as you and I both would expect, that Cicero is a
neurosymbolic system, and that its design achieves its results by
several clever things beyond using more compute and more data alone.
I expect you would disagree with their analysis.
Thanks for the very detailed description of your view on GAN history
and sociology—very interesting.
You focus on the history of benchmark progress after DLL based GANs
were introduced as a new method for driving that progress. The point
I was trying to make is about a different moment in history: I am
perceiving that the original introduction of DLL based GANs was a
clear discontinuity.
First, GANs may not be new.
If you search wide enough for similar things, then no idea that works is really
new. Neural nets were also not new when the deep learning revolution
started.
I think your main thesis here is that academic researcher creativity
and cleverness, their ability to come up with unexpected architecture
improvements, has nothing to do with driving the pace of AI progress
forward:
This parallels other field-survey replication efforts like in embedding research: results get better over time, which researchers claim reflect the sophistication of their architectures… and the gains disappear when you control for compute/n/param.
Sorry, but you cannot use a simple control-for-compute/n/param
statistics approach to determine the truth of any hypothesis of how
clever researchers really were in coming up with innovations to keep
an observed scaling curve going. For all you know. these curves are
what they are because everybody has been deeply clever at the
architecture evolution/revolution level, or at the hyperparameter
tuning level. But maybe I am mainly skeptical of your statistical conclusions
here because you are are leaving things out of the short description of
the statistical analysis you refer to. So if you want can give me a
pointer to a more detailed statistical writeup, one that tries to
control for cleverness too, please do.
That being said, like you I perceive, in a more anecdotal form, that
true architectural innovation is absent from a lot of academic ML
work, or at least the academic ML work appearing in the so-called
‘top’ AI conferences that this forum often talks about. I mostly
attribute that to such academic ML only focusing on a very limited
set of big data / Bitter Lesson inspired benchmarks, benchmarks which
are not all that relevant to many types of AI improvements one would
like to see in the real world. In industry, where one often needs to
solve real-world problems beyond those which are fashionable in
academia, I have seen a lot more creativity in architectural
innovations than in the typical ML benchmark improvement paper. I see
a lot of that industry-type creativity in the Cicero paper too.
You mention that your compute-and-data-is-all-that-drives-progress opinion has
been informed by looking at things like GANs for image generation and
embedding research.
This progress in these sub-fields differs from the type of AI
technology progress that I would like to see much more of, as an AI
safety and alignment researcher. This also implies that I have
different opinion on what drives or should drive AI technology
progress.
One benchmark that interests me is an AI out-of-distribution
robustness benchmark where the model training happens on sample data
drawn from a first distribution, and the model evaluation happens on
sample data drawn from a different second distribution, only connected
to the first by having the two processes that generate them share some
deeper patterns like the laws of physics, or broad parameters of human
morality.
This kind of out-of-distribution robustness problem is one of the
themes of Marcus too, for the physics part at least. One of the key
arguments for the hybrid/neurosymbolic approach is that you will need
to (symbolically) encode some priors about these deeper patterns into
the AI, if you ever want it to perform well on such
out-of-distribution benchmarks.
Another argument for the neurosymbolic approach is that you often
simply do not have enough training data to get your model robust
enough if you start from a null prior, so you will need to compensate
for this by adding some priors. Having deeply polluted training data
also means you will need to add priors, or do lots of other tricks, to
get the model you really want. There is an intriguing possibility
that DNN based transfer learning might contribute to the type of
benchmarks I am interested in. This branch of research is usually
framed in a way where people do not picture the the second small
training data set being used in the transfer learning run as a prior,
but on a deeper level it is definitely a prior.
You have been arguing that symbolic+scaling is all we need to drive AI
progress, that there is no room for the neuro+symbolic+scaling
approach. This argument rests on a hidden assumption that many
academic AI researchers also like to make: the assumption that for all
AI application domains that you are interested in, you will never run
out of clean training data.
Doing academic AI research under the assumption that you always have infinite
clean training data assumption would be fine if such research were
confined to one small and humble sub-branch of academic AI. The
problem is that the actual branch of AI making this assumption is far
from small and humble. It in fact claims, via writings like the
Bitter Lesson, to be the sum total of what respectable academic AI
research should be all about. It is also the sub-branch that gets
almost all the hype and the press.
The availability of infinite clean training data assumption is of course true for
games that can be learned by self-play. It is less true for many
other things that we would like AI to be better at. The ‘top’
academic ML conferences are slowly waking up to this, but much too
slowly as far as I am concerned.
Cicero is not particularly unexpected to me, but my expectations here are not driven by the scaling hypothesis. The result achieved here was not achieved by adding more layers to a single AI engine, it was achieved by human designers who assembled several specialised AI engines by hand.
So I do not view this result as one that adds particularly strong evidence to the scaling hypothesis. I could equally well make the case that it adds more evidence to the alternative hypothesis, put forward by people like Gary Marcus, that scaling alone as the sole technique has run out of steam, and that the prevailing ML research paradigm needs to shift to a more hybrid approach of combining models. (The prevailing applied AI paradigm has of course always been that you usually need to combine models.)
Another way to explain my lack of surprise would be to say that Cicero is a just super-human board game playing engine that has been equipped with a voice synthesizer. But I might be downplaying the achievement here.
I have not read any of the authors’ or Meta’s messaging around this, so I am not sure if they make that point, but the sub-components of Cicero that somewhat competently and ‘honestly’ explain its currently intended moves seem to have beneficial applications too, if they were combined with an engine which is different from a game engine that absolutely wants to win and that can change it’s mind about moves to play later. This is a dual-use technology with both good and bad possible uses.
That being said, I agree that this is yet another regulatory wake-up call, if we would need one. As a group, AI researchers will not conveniently regulate themselves: they will move forward in creating more advanced dual-use technology, while openly acknowledging (see annex A.3 of the paper) that this technology might be used for both good and bad purposes downstream. So it is up to the rest of the world to make sure that these downstream uses are regulated.
Well, first I would point out that I predicted it would happen soon, while literally writing “The Scaling Hypothesis”, and the actual researchers involved did not, predicting it would take at least a decade; if it is not predicted by the scaling hypothesis, but is in fact predicted better by neurosymbolic views denying scaling, it’s curious that I was the one who said that and not Gary Marcus.
Second, it’s hard to see how a success using models which would have been among the largest NNs ever trained just 3 years ago, finetuned using 128-256 GPUs on a dataset which would be among the largest compute-clusters & datasets ever used in DL also up until a few years ago, requiring 80 CPU-cores & 8 GPUs at runtime (which would be &etc), is really a vindication for hand-engineered neurosymbolic approaches emphasizing the importance of symbolic reasoning and complicated modular approaches, as opposed to approaches emphasizing the importance of increasing data & compute for solving previously completely intractable problems. Nor do the results indicate that the modules are really all that critical. I do not think CICERO is going to need even an AlphaZero level of breakthrough to remove various parts, when eg. the elaborate filter system representing much of the ‘neurosymbolic’ part of the system is only removing 15.5% of all messages.
I regard this sort of argument as illegitimate because it is a fully-general counterargument which allows strategy-stealing and explains everything while predicting nothing: no matter how successful scaling is, it will always be possible (short of a system being strongly superhuman) to take a system and improve it in some way by injecting hand-engineering or borrowing other systems and lashing them together. This is fine for purely pragmatic purposes, but the problem is when one then claims it is a victory for ‘neurosymbolic programming’ or whatever catchphrase is in vogue. Somehow, the scaled-up system is never a true scotsman of scaling, but the modifications or hybrids are always true scotsmen of neurosymbolics… Those systems will then be surpassed a few years later by larger (and often simpler) systems, thereby refuting the previous argument—only for the same thing to happen again, ratcheting upwards. The question is not whether some hand-engineering can temporarily leapfrog the Bitter Lesson (it is quite explicitly acknowledged by Sutton and scaling proponents that you can gain constant factors by doing stuff which is not scaling), but whether it will progress faster. CICERO would only be a victory for neurosymbolic approaches if it had used nothing you couldn’t’ve done in, say, 2015.
This seems to be what you are doing here: you handwave away the use of BART and extremely CPU/GPU-intensive search as not a victory for scaling (why? it was not a foregone conclusion that a BART would be any good, nor that the search could be made to work, without far more extensive engineering of theory of mind and symbolic reasoning), and then only count the lashing-together of a few scaled-up components as a victory for neurosymbolic approaches instead.
Thanks, that does a lot to clarify your viewpoints. Your reply calls for some further remarks.
I’ll start off by saying that I value your technology tracking writing highly because you are one of those blogging technology trackers who is able to look beyond the press releases and beyond the hype. But I have the same high opinion of the writings of Gary Marcus.
For the record: I am not trying to handwave the progress-via-hybrid-approaches hypothesis of Marcus into correctness. The observations I am making here are much more in the ‘explains everything while predicting nothing’ department.
I am observing out that both your progress-via-scaling hypothesis and the progress-via-hybrid-approaches hypothesis of Marcus can be made to explain the underlying Cicero facts here. I do not see this case as a clear victory for either one of these hypotheses. What we have here is an AI design that cleverly combines multiple components while also being impressive in the scaling department.
Technology tracking is difficult, especially about the future.
The following observation may get to the core of how I may be perceiving the elephant differently. I interpret an innovation like GANs not as a triumph of scaling, but as a triumph of cleverly putting two components together. I see GANs as an innovation that directly contradicts the message of the Bitter Lesson paradigm, one that is much more in the spirit of what Marcus proposes.
Here is what I find particularly interesting in Marcus. In pieces like like The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence, Marcus is advancing the hypothesis that the academic Bitter-Lesson AI field is in a technology overhang: these people could make make a lot of progress on their benchmarks very quickly, faster than mere neural net scaling will allow, if they were to ignore the Bitter Lesson paradigm and embrace a hybrid approach where the toolbox is much bigger than general-purpose learning, ever-larger training sets, and more and more compute. Sounds somewhat plausible to me.
If you put a medium or high probability on this overhang hypothesis of Marcus, then you are in a world where very rapid AI progress might happen, levels of AI progress much faster than those predicted by the progress curves produced by Bitter Lesson AI research.
You seem to be advancing an alternative hypothesis, one where advances made by clever hybrid approaches will always be replicated a few years later by using a Bitter Lesson style monolithic deep neural net trained with a massive dataset. This would conveniently restore the validity of extrapolating Bitter Lesson driven progress curves, because you can use them as an upper bound. We’ll see.
I am currently not primarily in the business of technology tracking, I am an AI safety researcher working on safety solutions and regulation. With that hat on, I will say the following.
Bitter-lesson style systems consisting of a single deep neural net, especially if these systems are also model-free RL agents, have huge disadvantages in the robustness, testability, and interpretability departments. These disadvantages are endlessly talked about on this web site of course. By contrast, systems built out of separate components with legible interfaces between them are usually much more robust, interpretable and testable. This is much less often mentioned here.
In safety engineering for any high-risk application, I would usually prefer to work with an AI system built out of many legible sub-components, not with some deep neural net that happens to perform equally or better on an in-training-distribution benchmark. So I would like to see more academic AI research that ignores the Bitter Lesson paradigm, and the paradigm that all AI research must be ML research. I am pleased to say that a lot of academic and applied AI researchers, at least in the part of the world where I live, never got on board with these paradigms in the first place. To find their work, you have look beyond conferences like NeurIPS.
That’s an interesting example to pick: watching GAN research firsthand as it developed from scratch in 2014 to bust ~2020 played a considerable role in my thinking about the Bitter Lesson & Scaling Hypothesis & ‘blessings of scale’, so I regard GANs differently than you do, and have opinions on the subject.
First, GANs may not be new. Even if you do not entirely buy Schmidhuber’s claim that his predictability-minimization arch 30 years ago or whatever is identical to GANs (and better where not identical), there’s also that 2009 blog post or whatever by someone reinventing it. And given the distribution of multiple-discoveries, that implies there’s probably another 2 or 3 reinventions out there somewhere. If what matters about GANs is the ‘innovation’ of the idea, and not scaling, why were all the inventions so sterile and yielded so little until so late? (And why have such big innovations been so thoroughly superseded in their stronghold of image generation by approaches which look nothing at all like GANs and appear to owe little to them, like diffusion models?)
Second, GANs in their successful period were clearly a triumph of compute. All the biggest successes of GANs use a lot of compute historically: BigGAN is trained on literally a TPUv3-512 pod. None of the interesting success stories of GANs look like ‘train on a CPU with levels of compute available to a GOFAI researcher 20 years ago because scaling doesn’t matter’. Cases like StyleGAN where you seem to get good results from ‘just’ 4 GPU-months of compute (itself a pretty big amount) turn out to scale infamously badly to more complex datasets, and to be the product of literally hundreds or thousands of fullscale runs, and deadends. (The Karras group sometimes reports ‘total compute used during R&D’ in their papers, and it’s an impressively large multiple.) For all the efficiency of StyleGAN on faces, no one is ever going to make it work well on LAION-4b etc. (At Tensorfork, our StyleGAN2-ext worked mostly by throwing away as much of the original StyleGAN2 hand engineering as possible, which held it back, in favor of increasing n/param/compute in tandem. The later distilled ImageNet StyleGAN approach relies on simplifying the problem drastically by throwing out hard datapoints, and still only delivers bad results on ImageNet.) And that’s the most impressive ‘innovation’ in GAN architecture. This is further buttressed by the critical meta-science observations in GAN research that when you compared GAN archs on a fair basis, in a single codebase with equal hyperparam tuning etc, fully reproducibly, you typically found… arch made little difference to the top scores, and mostly just changed the variance of runs. (This parallels other field-survey replication efforts like in embedding research: results get better over time, which researchers claim reflect the sophistication of their architectures… and the gains disappear when you control for compute/n/param. The results were better—but just because later researchers used moar dakka, and either didn’t realize that all the fruits of their hard work was actually due to harder-working GPUs or quietly omitted that observation, in classic academic fashion. I feel like every time I get to ask a researcher who did something important in private how it really happened, I find out that the story told in the paper is basically a fairy tale for small children and feel like the angry math student who complained about Gauss, “he makes [his mathematics] like a fox, wiping out the traces in the sand with his tail.”)
Third, to further emphasize the triumph of GANs due to compute, their fall also appears to be due to compute too. The obsolescence of GANs is, as far as I can tell, due solely to historical happenstance. BigGAN never hit a ceiling; no one calculated scaling laws on BigGAN FIDs and showed it was doomed; BigGAN doesn’t get fatally less stable with scale, Brock demonstrated it gets more stable and scales to at least n = 0.3b without a problem. What happened was simply that the tiny handful of people who could and would do serious GAN scaling happened to leave the field (eg Brock) or run into non-fundamental problems which killed their efforts (eg Tensorfork), while the people who did continue DL scaling happened to not be GAN people but specialized in alternatives like autoregressive, VAE, and then diffusion models. So they scaled up all of those successfully, while no one tried with GANs, and now the ‘fact’ that ‘GANS don’t work’ is just a thing That Is Known: “It is well known GANs are unstable and cannot be trained at scale, which is why we use diffusion models instead...” (Saw a version of that in a paper literally yesterday.) There is never any relevant evidence given for that, just bare assertion or irrelevant cites. I have an essay I should finish explaining why GANs should probably be tried again.
Fourth, we can observe that this is not unique to GANs. Right now, it seems like pretty much any image arch you might care to try works. NeRF? A deep VAE? An autoregressive Transformer on pixels? An autoregressive on VAE tokens? A diffusion on pixels? A diffusion on latents? An autoencoder? Yeah sure, they all work, even (perhaps especially) the ones which don’t work on small datasets/compute. When I look at Imagen samples vs Parti samples, say, I can’t tell which one is the diffusion model and which one is the autoregressive Transformer. Conceptually, they have about as much in common as a fungus with a cockroach, but… they both work pretty darn well anyway. What do they share in common, besides ‘working’? Compute, data, and parameters. Lots and lots and lots of those. (Similarly for video generation.) I predict that if GANs get any real scaling-up effort put into improving them and making them run at similar scale, we will find that GANs work well too. (And will sample a heckuva lot faster than diffusion or AR models...)
‘Innovative’ archs are just not that important. An emphasis on arch, much less apologizing for ‘neurosymbolic’ approaches, struggles to explain this history. Meanwhile, an emphasis on scaling can cleanly explain why GANs succeeded at Goodfellow’s reinvention, why GANs fell out of fashion, why their rivals succeeded, and may yet predict why they get revived.
We don’t need to see. Just look at the past, which has already happened.
Yeah, he’s wrong about that because it’d only be true if academic markets were maximizing for long-term scaling (they don’t) rather than greedily myopically optimizing for elaborate architectures that grind out an improvement right now. The low-hanging fruit has already been plucked and everyone is jostling to grab a slightly higher-hanging fruit they can see, while the long-term scaling work gets ignored. This is why various efforts to take a breakthrough of scaling, like AlphaGo or GPT-3, and hand-engineer it, yield moderate results but nothing mindblowing like the original.
No. You’d predict that there’d potentially be more short-term progress (as researchers pluck the fruits exposed by the latest scaling bout) but then less long-term (as scaling stops, because everyone is off plucking but with ever diminishing returns).
Here too I disagree. What we see with scaled up systems is increasingly interpretability of components and less issues with things like polysemanticity or relying on brittle shortcuts, increasingly generalizability and solving edge cases (almost by definition), increasing capabilities allowing for meaningful tests at all, and reasons to think that things like adversarial robustness will come ‘for free’ with scale (see isoperimetry). Just like with the Bitter Lesson for capability, at any fixed level of scale or capability, you can always bolt on more shackles and gadgets and gewgaws to get a ‘safer’ system, but you then are ever increasing risk of obsolescence and futility because a later scaled-up system may render your work completely irrelevant both in terms of economic deployment and in terms of what techniques you need to understand it and what safety properties you can attain. (Any system so simple it can be interpreted in a ‘symbolic’ classical way may inherently be too simple to solve any real problems—if the solutions to those problems were that simple, why weren’t they created symbolically before bringing DL into the picture...? ‘Bias/variance’ applies to safety just as much as anything else: a system which is too small and too dumb to genuinely understand things can be a lot more dangerous than a scaled-up system which does understand them but is shaky on how much it cares. Or more pointedly: hybrid systems which do not solve the problem, which is ‘all of them’ in the long run, cannot have any safety properties since they do not work.)
Related to this, from the blog post What does Meta AI’s Diplomacy-winning Cicero Mean for AI?:
I am originally a CS researcher trained several decades ago, actually in the middle of an AI winter. That might explain our different viewpoints here. I also have a background in industrial research and applied AI, which has given me a lot of insight into the vast array of problems that academic research refuses to solve for you. More long-form thoughts about this are in my Demanding and Designing Aligned Cognitive Architectures.
From where I am standing, the scaling hype is wasting a lot of the minds of the younger generation, wasting their minds on the problem of improving ML benchmark scores under the unrealistic assumption that ML will have infinite clean training data. This situation does not fill me with as much existential dread as it does some other people on this forum, but anyway.
Related to our discussion earlier, I see that Marcus and Davis just published a blog post: What does Meta AI’s Diplomacy-winning Cicero Mean for AI?. In it, they argue, as you and I both would expect, that Cicero is a neurosymbolic system, and that its design achieves its results by several clever things beyond using more compute and more data alone. I expect you would disagree with their analysis.
Thanks for the very detailed description of your view on GAN history and sociology—very interesting.
You focus on the history of benchmark progress after DLL based GANs were introduced as a new method for driving that progress. The point I was trying to make is about a different moment in history: I am perceiving that the original introduction of DLL based GANs was a clear discontinuity.
If you search wide enough for similar things, then no idea that works is really new. Neural nets were also not new when the deep learning revolution started.
I think your main thesis here is that academic researcher creativity and cleverness, their ability to come up with unexpected architecture improvements, has nothing to do with driving the pace of AI progress forward:
Sorry, but you cannot use a simple control-for-compute/n/param statistics approach to determine the truth of any hypothesis of how clever researchers really were in coming up with innovations to keep an observed scaling curve going. For all you know. these curves are what they are because everybody has been deeply clever at the architecture evolution/revolution level, or at the hyperparameter tuning level. But maybe I am mainly skeptical of your statistical conclusions here because you are are leaving things out of the short description of the statistical analysis you refer to. So if you want can give me a pointer to a more detailed statistical writeup, one that tries to control for cleverness too, please do.
That being said, like you I perceive, in a more anecdotal form, that true architectural innovation is absent from a lot of academic ML work, or at least the academic ML work appearing in the so-called ‘top’ AI conferences that this forum often talks about. I mostly attribute that to such academic ML only focusing on a very limited set of big data / Bitter Lesson inspired benchmarks, benchmarks which are not all that relevant to many types of AI improvements one would like to see in the real world. In industry, where one often needs to solve real-world problems beyond those which are fashionable in academia, I have seen a lot more creativity in architectural innovations than in the typical ML benchmark improvement paper. I see a lot of that industry-type creativity in the Cicero paper too.
You mention that your compute-and-data-is-all-that-drives-progress opinion has been informed by looking at things like GANs for image generation and embedding research.
This progress in these sub-fields differs from the type of AI technology progress that I would like to see much more of, as an AI safety and alignment researcher. This also implies that I have different opinion on what drives or should drive AI technology progress.
One benchmark that interests me is an AI out-of-distribution robustness benchmark where the model training happens on sample data drawn from a first distribution, and the model evaluation happens on sample data drawn from a different second distribution, only connected to the first by having the two processes that generate them share some deeper patterns like the laws of physics, or broad parameters of human morality.
This kind of out-of-distribution robustness problem is one of the themes of Marcus too, for the physics part at least. One of the key arguments for the hybrid/neurosymbolic approach is that you will need to (symbolically) encode some priors about these deeper patterns into the AI, if you ever want it to perform well on such out-of-distribution benchmarks.
Another argument for the neurosymbolic approach is that you often simply do not have enough training data to get your model robust enough if you start from a null prior, so you will need to compensate for this by adding some priors. Having deeply polluted training data also means you will need to add priors, or do lots of other tricks, to get the model you really want. There is an intriguing possibility that DNN based transfer learning might contribute to the type of benchmarks I am interested in. This branch of research is usually framed in a way where people do not picture the the second small training data set being used in the transfer learning run as a prior, but on a deeper level it is definitely a prior.
You have been arguing that symbolic+scaling is all we need to drive AI progress, that there is no room for the neuro+symbolic+scaling approach. This argument rests on a hidden assumption that many academic AI researchers also like to make: the assumption that for all AI application domains that you are interested in, you will never run out of clean training data.
Doing academic AI research under the assumption that you always have infinite clean training data assumption would be fine if such research were confined to one small and humble sub-branch of academic AI. The problem is that the actual branch of AI making this assumption is far from small and humble. It in fact claims, via writings like the Bitter Lesson, to be the sum total of what respectable academic AI research should be all about. It is also the sub-branch that gets almost all the hype and the press.
The availability of infinite clean training data assumption is of course true for games that can be learned by self-play. It is less true for many other things that we would like AI to be better at. The ‘top’ academic ML conferences are slowly waking up to this, but much too slowly as far as I am concerned.