This seems like a fairly important paper by Deepmind regarding generalization (and lack of it in current transformer models): https://arxiv.org/abs/2311.00871
Here’s an excerpt on transformers potentially not really being able to generalize beyond training data:
Our contributions are as follows:
We pretrain transformer models for in-context learning using a mixture of multiple distinct function classes and characterize the model selection behavior exhibited.
We study the in-context learning behavior of the pretrained transformer model on functions that are “out-of-distribution” from the function classes in the pretraining data.
In the regimes studied, we find strong evidence that the model can perform model selection among pretrained function classes during in-context learning at little extra statistical cost, but limited evidence that the models’ in-context learning behavior is capable of generalizing beyond their pretraining data.
i predict this kind of view of non magicalness of (2023 era) LMs will become more and more accepted, and this has implications on what kinds of alignment experiments are actually valuable (see my comment on the reversal curse paper). not an argument for long (50 year+) timelines, but is an argument for medium (10 year) timelines rather than 5 year timelines
Together our results highlight that the impressive ICL abilities of high-capacity sequence models may be more closely tied to the coverage of their pretraining data mixtures than inductive biases that create fundamental generalization capabilities.
i used to call this something like “tackling the OOD generalization problem by simply making the distribution so wide that it encompasses anything you might want to use it on”
I’d say my major takeaways, assuming this research scales (it was only done on GPT-2, and we already knew it couldn’t generalize.)
Gary Marcus was right about LLMs mostly not reasoning outside the training distribution, and this updates me more towards “LLMs probably aren’t going to be godlike, or be nearly as impactful as LW say it is.”
Be more skeptical of AI progress leading to big things, and in general unless reality can simply be memorized, scaling probably won’t work to automate the economy. More generally, this updates me towards longer timelines, and longer tails on those timelines.
Be slightly more pessimistic on AI safety, since LLMs have a bunch of nice properties, and future AI probably will have less nice properties, though alignment optimism mostly doesn’t depend on LLMs.
AI governance gets a lucky break, since they only have to regulate misuse, and even though their threat model isn’t likely or even probable to be realized, it’s still nice that we don’t have to deal with the disruptive effects of AI now.
I am sharing this since I think it will change your view on how much to update on this paper (I should have shared this initially). Here’s what the paper author said on X:
Clarifying two things:
Model is simple transformer for science, not a language model (or large by standards today)
The model can learn new tasks (via in-context learning), but can’t generalize to new task families
I would be thrilled if this work was important for understanding AI safety and fairness, but it is the start of a scientific direction, not ready for policy conclusions. Understanding what task families a true LLM is capable of would be fascinating and more relevant to policy!
So, with that, I said:
I hastily thought the paper was using language models, so I think it’s important to share this. A follow-up paper using a couple of ‘true’ LLMs at different model scales would be great. Is it just interpolation? How far can the models extrapolate?
In retrospect, I probably should have updated much less than I did, I thought that it was actually testing a real LLM, which makes me less confident in the paper.
Should have responded long ago, but responding now.
This line of research makes me question one thing: “Is the alignment community over-updating on how scale impacts generalization?”
It remains to be seen how well models will generalize outside of their training distribution (interpolation vs extrapolation).
In other words, when people say that GPT-4 (and other LLMs) can generalize, I think they need to be more careful about what they really mean. Is it doing interpolation or extrapolation? Meaning, yes, GPT-4 can do things like write a completely new poem, but poems and related stuff were in its training distribution! So, you can say it is generalizing, but I think it’s a much weaker form of generalization than what people really imply when they say generalization. A stronger form of generalization would be an AI that can do completely new tasks that are actually outside of its training distribution.
Now, at this point, you might say, “yes, but we know that LLMs learn functions and algorithms to do tasks, and as you scale up and compress more and more data, you will uncover more meta-algorithms that can do this kind of extrapolation/tasks outside of the training distribution.”
Well, two things:
It remains to be seen when or if this will happen in the current paradigm (no matter how much you scale up).
It’s not clear to me how well things like induction heads continue to work on things that are outside of their training distribution. If they don’t adapt well, then it may be the same thing for other algorithms. What this would mean in practice, I’m not sure. I’ve been looking at relevant papers, but haven’t foundan answeryet.
This brings me to another point: it also remains to be seen how much it will matter in practice, given that models are trained on so much data and things like online learning are coming. Scaffolding specialized AI models, and new innovations might make such a limitation not big of a deal if there is one.
Also, perhaps most of the important capabilities come from interpolation. Perhaps intelligence is largely just interpolation? You just need to interpolate and push the boundaries of capability one step at a time, iteratively, like a scientist conducting experiments would. You just need to integrate knowledge as you interact with the world.
But what of brilliant insights from our greatest minds? Is it just recursive interpolation+small_external_interactions? Is there something else they are doing to get brilliant insights? Would AGI still ultimately be limited in the same way (even if it can run many of these genius patterns in parallel)?
Some evidence this is not so fundamental, and we should expect a (or many) phase transition(s) to more generalizing in context learning as we increase the log number of tasks.
My hot take is that this paper’s prominence is a consequence of importance hacking (I’m not accusing the authors in particular). Zero or near-zero relevance to LLMs.
Authors get a yellow card for abusing the word ‘model’ twice in the title alone.
This seems like a fairly important paper by Deepmind regarding generalization (and lack of it in current transformer models): https://arxiv.org/abs/2311.00871
Here’s an excerpt on transformers potentially not really being able to generalize beyond training data:
i predict this kind of view of non magicalness of (2023 era) LMs will become more and more accepted, and this has implications on what kinds of alignment experiments are actually valuable (see my comment on the reversal curse paper). not an argument for long (50 year+) timelines, but is an argument for medium (10 year) timelines rather than 5 year timelines
also this quote from the abstract is great:
i used to call this something like “tackling the OOD generalization problem by simply making the distribution so wide that it encompasses anything you might want to use it on”
I’d say my major takeaways, assuming this research scales (it was only done on GPT-2, and we already knew it couldn’t generalize.)
Gary Marcus was right about LLMs mostly not reasoning outside the training distribution, and this updates me more towards “LLMs probably aren’t going to be godlike, or be nearly as impactful as LW say it is.”
Be more skeptical of AI progress leading to big things, and in general unless reality can simply be memorized, scaling probably won’t work to automate the economy. More generally, this updates me towards longer timelines, and longer tails on those timelines.
Be slightly more pessimistic on AI safety, since LLMs have a bunch of nice properties, and future AI probably will have less nice properties, though alignment optimism mostly doesn’t depend on LLMs.
AI governance gets a lucky break, since they only have to regulate misuse, and even though their threat model isn’t likely or even probable to be realized, it’s still nice that we don’t have to deal with the disruptive effects of AI now.
I am sharing this since I think it will change your view on how much to update on this paper (I should have shared this initially). Here’s what the paper author said on X:
So, with that, I said:
To which @Jozdien replied:
In retrospect, I probably should have updated much less than I did, I thought that it was actually testing a real LLM, which makes me less confident in the paper.
Should have responded long ago, but responding now.
Title: Is the alignment community over-updating on how scale impacts generalization?
So, apparently, there’s a rebuttal to the recent Google generalization paper (and also worth pointing out it wasn’t done with language models, just sinoïsodal functions, not language):
But then, the paper author responds:
This line of research makes me question one thing: “Is the alignment community over-updating on how scale impacts generalization?”
It remains to be seen how well models will generalize outside of their training distribution (interpolation vs extrapolation).
In other words, when people say that GPT-4 (and other LLMs) can generalize, I think they need to be more careful about what they really mean. Is it doing interpolation or extrapolation? Meaning, yes, GPT-4 can do things like write a completely new poem, but poems and related stuff were in its training distribution! So, you can say it is generalizing, but I think it’s a much weaker form of generalization than what people really imply when they say generalization. A stronger form of generalization would be an AI that can do completely new tasks that are actually outside of its training distribution.
Now, at this point, you might say, “yes, but we know that LLMs learn functions and algorithms to do tasks, and as you scale up and compress more and more data, you will uncover more meta-algorithms that can do this kind of extrapolation/tasks outside of the training distribution.”
Well, two things:
It remains to be seen when or if this will happen in the current paradigm (no matter how much you scale up).
It’s not clear to me how well things like induction heads continue to work on things that are outside of their training distribution. If they don’t adapt well, then it may be the same thing for other algorithms. What this would mean in practice, I’m not sure. I’ve been looking at relevant papers, but haven’t found an answer yet.
This brings me to another point: it also remains to be seen how much it will matter in practice, given that models are trained on so much data and things like online learning are coming. Scaffolding specialized AI models, and new innovations might make such a limitation not big of a deal if there is one.
Also, perhaps most of the important capabilities come from interpolation. Perhaps intelligence is largely just interpolation? You just need to interpolate and push the boundaries of capability one step at a time, iteratively, like a scientist conducting experiments would. You just need to integrate knowledge as you interact with the world.
But what of brilliant insights from our greatest minds? Is it just recursive interpolation+small_external_interactions? Is there something else they are doing to get brilliant insights? Would AGI still ultimately be limited in the same way (even if it can run many of these genius patterns in parallel)?
Or perhaps as @Nora Belrose mentioned to me: “Perhaps we should queer the interpolation-extrapolation distinction.”
Some evidence this is not so fundamental, and we should expect a (or many) phase transition(s) to more generalizing in context learning as we increase the log number of tasks.
My hot take is that this paper’s prominence is a consequence of importance hacking (I’m not accusing the authors in particular). Zero or near-zero relevance to LLMs.
Authors get a yellow card for abusing the word ‘model’ twice in the title alone.