I contest that there’s very little reason to expect “undesired, covert, and consistent-across-situations inner goals” to crop up in [LLMs as trained today] to begin with
As someone who consider deceptive alignment a concern: fully agree. (With the caveat, of course, that it’s because I don’t expect LLMs to scale to AGI.)
I think there’s in general a lot of speaking-past-each-other in alignment, and what precisely people mean by “problem X will appear if we continue advancing/scaling” is one of them.
Like, of course a new problem won’t appear if we just keep doing the exact same thing that we’ve already been doing. Except “the exact same thing” is actually some equivalence class of approaches/architectures/training processes, but which equivalence class people mean can differ.
For example:
Person A, who’s worried about deceptive alignment, can have “scaling LLMs arbitrarily far” defined as this proven-safe equivalence class of architectures. So when they say they’re worried about capability advancement bringing in new problems, what they mean is “if we move beyond the LLM paradigm, deceptive alignment may appear”.
Person B, hearing the first one, might model them as instead defining “LLMs trained with N amount of compute” as the proven-safe architecture class, and so interpret their words as “if we keep scaling LLMs beyond N, they may suddenly develop this new problem”. Which, on B’s model of how LLMs work, may seem utterly ridiculous.
And the tricky thing is that Person A likely equates the “proven-safe” architecture class with the “doesn’t scale to AGI” architecture class – so they actually expect the major AI labs to venture outside that class, the moment they realize its limitations. Person B, conversely, might disagree, might think those classes are different, and that safely limited models can scale to AGI/interesting capabilities. (As a Person A in this situation, I think Person B’s model of cognition is confused; but that’s a different topic.)
Which is all important disconnects to watch out for.
(Uh, caveat: I think some people actually are worried about scaled-up LLMs exhibiting deceptive alignment, even without architectural changes. But I would delineate this cluster of views from the one I put myself in, and which I outline there. And, likewise, I expect that the other people I would tentatively classify as belonging to this cluster – Eliezer, Nate, John – mostly aren’t worried about just-scaling-the-LLMs to be leading to deceptive alignment.)
I think it’s important to put more effort into tracking such definitional issues, though. People end up overstating things because they round off their interlocutors’ viewpoint to their own. For instance if person C asks “is it safe to scale generative language pre-training and ChatGPT-style DPO arbitrarily far?”, when person D then rounds this off to “is it safe to make transformer-based LLMs as powerful as possible?” and explains that “no, because instrumental convergence and compression priors”, this is probably just false for the original meaning of the statement.
If this repeatedly happens to the point of generating a consensus for the false claim, then that can push the alignment community severely off track.
LLMs will soon scale beyond the available natural text data, and generation of synthetic data is some sort of change of architecture, potentially a completely different source of capabilities. So scaling LLMs without change of architecture much further is an expectation about something counterfactual. It makes sense as a matter of theory, but it’s not relevant for forecasting.
The main result is that up to 4 repetitions are about as good as unique data,
and for up to about 16 repetitions there is still meaningful improvement.
Let’s take 50T tokens as an estimate for available text data
(as an anchor, there’s a filtered and deduplicated
CommonCrawl dataset RedPajama-Data-v2
with 30T tokens).
Repeated 4 times, it can make good use of 1e28 FLOPs (with a dense transformer),
and repeated 16 times, suboptimal but meaningful use of 2e29 FLOPs.
So this is close but not lower than what can be put to use within a few years.
Thanks for pushing back on the original claim.
Three points: how much compute is going into a training run,
how much natural text data it wants, and how much data is available.
For training compute, there are claims of multi-billion dollar runs being
plausible and possibly planned in 2-5 years.
Eyeballing various trends and GPU shipping numbers and revenues,
it looks like about 3 OOMs of compute scaling is possible
before industrial capacity constrains the trend and the scaling slows down.
This assumes that there are no overly dramatic profits from AI
(which might lead to finding ways of scaling supply chains faster than usual),
and no overly dramatic lack of new capabilities with further scaling
(which would slow down investment in scaling).
That gives about 1e28-1e29 FLOPs at the slowdown in 4-6 years.
At 1e28 FLOPs, Chinchilla scaling asks for 200T-250T tokens.
Various sparsity techniques increase effective compute,
asking for even more tokens
(when optimizing loss given fixed hardware compute). Edit 15 Dec: I no longer endorse this point, based on scaling laws for training on repeated data.
On the outside, there are 20M-150M accessible books,
some text from video,
and 1T web pages of extremely dubious uniqueness and quality.
That might give about 100T tokens, if LLMs are used to curate?
There’s some discussion (incl. comments) here,
this is the figure I’m most uncertain about.
In practice, absent good synthetic data,
I expect multimodality to fill the gap,
but that’s not going to be as useful as good text
for improving chatbot competence.
(Possibly the issue with the original claim in the grandparent
is what I meant by “soon”.)
As someone who consider deceptive alignment a concern: fully agree. (With the caveat, of course, that it’s because I don’t expect LLMs to scale to AGI.)
I think there’s in general a lot of speaking-past-each-other in alignment, and what precisely people mean by “problem X will appear if we continue advancing/scaling” is one of them.
Like, of course a new problem won’t appear if we just keep doing the exact same thing that we’ve already been doing. Except “the exact same thing” is actually some equivalence class of approaches/architectures/training processes, but which equivalence class people mean can differ.
For example:
Person A, who’s worried about deceptive alignment, can have “scaling LLMs arbitrarily far” defined as this proven-safe equivalence class of architectures. So when they say they’re worried about capability advancement bringing in new problems, what they mean is “if we move beyond the LLM paradigm, deceptive alignment may appear”.
Person B, hearing the first one, might model them as instead defining “LLMs trained with N amount of compute” as the proven-safe architecture class, and so interpret their words as “if we keep scaling LLMs beyond N, they may suddenly develop this new problem”. Which, on B’s model of how LLMs work, may seem utterly ridiculous.
And the tricky thing is that Person A likely equates the “proven-safe” architecture class with the “doesn’t scale to AGI” architecture class – so they actually expect the major AI labs to venture outside that class, the moment they realize its limitations. Person B, conversely, might disagree, might think those classes are different, and that safely limited models can scale to AGI/interesting capabilities. (As a Person A in this situation, I think Person B’s model of cognition is confused; but that’s a different topic.)
Which is all important disconnects to watch out for.
(Uh, caveat: I think some people actually are worried about scaled-up LLMs exhibiting deceptive alignment, even without architectural changes. But I would delineate this cluster of views from the one I put myself in, and which I outline there. And, likewise, I expect that the other people I would tentatively classify as belonging to this cluster – Eliezer, Nate, John – mostly aren’t worried about just-scaling-the-LLMs to be leading to deceptive alignment.)
I think it’s important to put more effort into tracking such definitional issues, though. People end up overstating things because they round off their interlocutors’ viewpoint to their own. For instance if person C asks “is it safe to scale generative language pre-training and ChatGPT-style DPO arbitrarily far?”, when person D then rounds this off to “is it safe to make transformer-based LLMs as powerful as possible?” and explains that “no, because instrumental convergence and compression priors”, this is probably just false for the original meaning of the statement.
If this repeatedly happens to the point of generating a consensus for the false claim, then that can push the alignment community severely off track.
LLMs will soon scale beyond the available natural text data, and generation of synthetic data is some sort of change of architecture, potentially a completely different source of capabilities. So scaling LLMs without change of architecture much further is an expectation about something counterfactual. It makes sense as a matter of theory, but it’s not relevant for forecasting.
Edit 15 Dec: No longer endorsed based on scaling laws for training on repeated data.
Bold claim. Want to make any concrete predictions so that I can register my different beliefs?
I’ve now changed my mind based on
N Muennighoff et al. (2023) Scaling Data-Constrained Language Models
The main result is that up to 4 repetitions are about as good as unique data, and for up to about 16 repetitions there is still meaningful improvement. Let’s take 50T tokens as an estimate for available text data (as an anchor, there’s a filtered and deduplicated CommonCrawl dataset RedPajama-Data-v2 with 30T tokens). Repeated 4 times, it can make good use of 1e28 FLOPs (with a dense transformer), and repeated 16 times, suboptimal but meaningful use of 2e29 FLOPs. So this is close but not lower than what can be put to use within a few years. Thanks for pushing back on the original claim.
Three points: how much compute is going into a training run, how much natural text data it wants, and how much data is available. For training compute, there are claims of multi-billion dollar runs being plausible and possibly planned in 2-5 years. Eyeballing various trends and GPU shipping numbers and revenues, it looks like about 3 OOMs of compute scaling is possible before industrial capacity constrains the trend and the scaling slows down. This assumes that there are no overly dramatic profits from AI (which might lead to finding ways of scaling supply chains faster than usual), and no overly dramatic lack of new capabilities with further scaling (which would slow down investment in scaling). That gives about 1e28-1e29 FLOPs at the slowdown in 4-6 years.
At 1e28 FLOPs, Chinchilla scaling asks for 200T-250T tokens. Various sparsity techniques increase effective compute, asking for even more tokens (when optimizing loss given fixed hardware compute).
Edit 15 Dec: I no longer endorse this point, based on scaling laws for training on repeated data.
On the outside, there are 20M-150M accessible books, some text from video, and 1T web pages of extremely dubious uniqueness and quality. That might give about 100T tokens, if LLMs are used to curate? There’s some discussion (incl. comments) here, this is the figure I’m most uncertain about. In practice, absent good synthetic data, I expect multimodality to fill the gap, but that’s not going to be as useful as good text for improving chatbot competence. (Possibly the issue with the original claim in the grandparent is what I meant by “soon”.)