Related to our discussion earlier, I see that Marcus and Davis just
published a blog post: What does Meta AI’s Diplomacy-winning Cicero
Mean for
AI?.
In it, they argue, as you and I both would expect, that Cicero is a
neurosymbolic system, and that its design achieves its results by
several clever things beyond using more compute and more data alone.
I expect you would disagree with their analysis.
Thanks for the very detailed description of your view on GAN history
and sociology—very interesting.
You focus on the history of benchmark progress after DLL based GANs
were introduced as a new method for driving that progress. The point
I was trying to make is about a different moment in history: I am
perceiving that the original introduction of DLL based GANs was a
clear discontinuity.
First, GANs may not be new.
If you search wide enough for similar things, then no idea that works is really
new. Neural nets were also not new when the deep learning revolution
started.
I think your main thesis here is that academic researcher creativity
and cleverness, their ability to come up with unexpected architecture
improvements, has nothing to do with driving the pace of AI progress
forward:
This parallels other field-survey replication efforts like in embedding research: results get better over time, which researchers claim reflect the sophistication of their architectures… and the gains disappear when you control for compute/n/param.
Sorry, but you cannot use a simple control-for-compute/n/param
statistics approach to determine the truth of any hypothesis of how
clever researchers really were in coming up with innovations to keep
an observed scaling curve going. For all you know. these curves are
what they are because everybody has been deeply clever at the
architecture evolution/revolution level, or at the hyperparameter
tuning level. But maybe I am mainly skeptical of your statistical conclusions
here because you are are leaving things out of the short description of
the statistical analysis you refer to. So if you want can give me a
pointer to a more detailed statistical writeup, one that tries to
control for cleverness too, please do.
That being said, like you I perceive, in a more anecdotal form, that
true architectural innovation is absent from a lot of academic ML
work, or at least the academic ML work appearing in the so-called
‘top’ AI conferences that this forum often talks about. I mostly
attribute that to such academic ML only focusing on a very limited
set of big data / Bitter Lesson inspired benchmarks, benchmarks which
are not all that relevant to many types of AI improvements one would
like to see in the real world. In industry, where one often needs to
solve real-world problems beyond those which are fashionable in
academia, I have seen a lot more creativity in architectural
innovations than in the typical ML benchmark improvement paper. I see
a lot of that industry-type creativity in the Cicero paper too.
You mention that your compute-and-data-is-all-that-drives-progress opinion has
been informed by looking at things like GANs for image generation and
embedding research.
This progress in these sub-fields differs from the type of AI
technology progress that I would like to see much more of, as an AI
safety and alignment researcher. This also implies that I have
different opinion on what drives or should drive AI technology
progress.
One benchmark that interests me is an AI out-of-distribution
robustness benchmark where the model training happens on sample data
drawn from a first distribution, and the model evaluation happens on
sample data drawn from a different second distribution, only connected
to the first by having the two processes that generate them share some
deeper patterns like the laws of physics, or broad parameters of human
morality.
This kind of out-of-distribution robustness problem is one of the
themes of Marcus too, for the physics part at least. One of the key
arguments for the hybrid/neurosymbolic approach is that you will need
to (symbolically) encode some priors about these deeper patterns into
the AI, if you ever want it to perform well on such
out-of-distribution benchmarks.
Another argument for the neurosymbolic approach is that you often
simply do not have enough training data to get your model robust
enough if you start from a null prior, so you will need to compensate
for this by adding some priors. Having deeply polluted training data
also means you will need to add priors, or do lots of other tricks, to
get the model you really want. There is an intriguing possibility
that DNN based transfer learning might contribute to the type of
benchmarks I am interested in. This branch of research is usually
framed in a way where people do not picture the the second small
training data set being used in the transfer learning run as a prior,
but on a deeper level it is definitely a prior.
You have been arguing that symbolic+scaling is all we need to drive AI
progress, that there is no room for the neuro+symbolic+scaling
approach. This argument rests on a hidden assumption that many
academic AI researchers also like to make: the assumption that for all
AI application domains that you are interested in, you will never run
out of clean training data.
Doing academic AI research under the assumption that you always have infinite
clean training data assumption would be fine if such research were
confined to one small and humble sub-branch of academic AI. The
problem is that the actual branch of AI making this assumption is far
from small and humble. It in fact claims, via writings like the
Bitter Lesson, to be the sum total of what respectable academic AI
research should be all about. It is also the sub-branch that gets
almost all the hype and the press.
The availability of infinite clean training data assumption is of course true for
games that can be learned by self-play. It is less true for many
other things that we would like AI to be better at. The ‘top’
academic ML conferences are slowly waking up to this, but much too
slowly as far as I am concerned.
Related to our discussion earlier, I see that Marcus and Davis just published a blog post: What does Meta AI’s Diplomacy-winning Cicero Mean for AI?. In it, they argue, as you and I both would expect, that Cicero is a neurosymbolic system, and that its design achieves its results by several clever things beyond using more compute and more data alone. I expect you would disagree with their analysis.
Thanks for the very detailed description of your view on GAN history and sociology—very interesting.
You focus on the history of benchmark progress after DLL based GANs were introduced as a new method for driving that progress. The point I was trying to make is about a different moment in history: I am perceiving that the original introduction of DLL based GANs was a clear discontinuity.
If you search wide enough for similar things, then no idea that works is really new. Neural nets were also not new when the deep learning revolution started.
I think your main thesis here is that academic researcher creativity and cleverness, their ability to come up with unexpected architecture improvements, has nothing to do with driving the pace of AI progress forward:
Sorry, but you cannot use a simple control-for-compute/n/param statistics approach to determine the truth of any hypothesis of how clever researchers really were in coming up with innovations to keep an observed scaling curve going. For all you know. these curves are what they are because everybody has been deeply clever at the architecture evolution/revolution level, or at the hyperparameter tuning level. But maybe I am mainly skeptical of your statistical conclusions here because you are are leaving things out of the short description of the statistical analysis you refer to. So if you want can give me a pointer to a more detailed statistical writeup, one that tries to control for cleverness too, please do.
That being said, like you I perceive, in a more anecdotal form, that true architectural innovation is absent from a lot of academic ML work, or at least the academic ML work appearing in the so-called ‘top’ AI conferences that this forum often talks about. I mostly attribute that to such academic ML only focusing on a very limited set of big data / Bitter Lesson inspired benchmarks, benchmarks which are not all that relevant to many types of AI improvements one would like to see in the real world. In industry, where one often needs to solve real-world problems beyond those which are fashionable in academia, I have seen a lot more creativity in architectural innovations than in the typical ML benchmark improvement paper. I see a lot of that industry-type creativity in the Cicero paper too.
You mention that your compute-and-data-is-all-that-drives-progress opinion has been informed by looking at things like GANs for image generation and embedding research.
This progress in these sub-fields differs from the type of AI technology progress that I would like to see much more of, as an AI safety and alignment researcher. This also implies that I have different opinion on what drives or should drive AI technology progress.
One benchmark that interests me is an AI out-of-distribution robustness benchmark where the model training happens on sample data drawn from a first distribution, and the model evaluation happens on sample data drawn from a different second distribution, only connected to the first by having the two processes that generate them share some deeper patterns like the laws of physics, or broad parameters of human morality.
This kind of out-of-distribution robustness problem is one of the themes of Marcus too, for the physics part at least. One of the key arguments for the hybrid/neurosymbolic approach is that you will need to (symbolically) encode some priors about these deeper patterns into the AI, if you ever want it to perform well on such out-of-distribution benchmarks.
Another argument for the neurosymbolic approach is that you often simply do not have enough training data to get your model robust enough if you start from a null prior, so you will need to compensate for this by adding some priors. Having deeply polluted training data also means you will need to add priors, or do lots of other tricks, to get the model you really want. There is an intriguing possibility that DNN based transfer learning might contribute to the type of benchmarks I am interested in. This branch of research is usually framed in a way where people do not picture the the second small training data set being used in the transfer learning run as a prior, but on a deeper level it is definitely a prior.
You have been arguing that symbolic+scaling is all we need to drive AI progress, that there is no room for the neuro+symbolic+scaling approach. This argument rests on a hidden assumption that many academic AI researchers also like to make: the assumption that for all AI application domains that you are interested in, you will never run out of clean training data.
Doing academic AI research under the assumption that you always have infinite clean training data assumption would be fine if such research were confined to one small and humble sub-branch of academic AI. The problem is that the actual branch of AI making this assumption is far from small and humble. It in fact claims, via writings like the Bitter Lesson, to be the sum total of what respectable academic AI research should be all about. It is also the sub-branch that gets almost all the hype and the press.
The availability of infinite clean training data assumption is of course true for games that can be learned by self-play. It is less true for many other things that we would like AI to be better at. The ‘top’ academic ML conferences are slowly waking up to this, but much too slowly as far as I am concerned.