Thank you for the insightful comments!! I’ve added thoughts on Mechanisms 1 and 2 below. Some reactions to your scattered disagreements (my personal opinions; not Boaz’s):
I agree that extracting short-term modules from long-term systems is more likely than not to be extremely hard. (Also that we will have a better sense of the difficulty in the nearish future as more researchers work on this sort of task for current systems.)
I agree that the CEO point might be the weakest in the article. It seems very difficult to find high-quality evidence about the impact of intelligence on long-term strategic planning in complex systems, and this is a major source of my uncertainty about whether our thesis is true. Note that even if making CEOs smarter would improve their performance, it may still be the case that any intelligence boost is fully substitutable by augmentation with advanced short-term AI systems.
From published results I’ve seen (e.g. comparison of LSTMs vs Transformers in figure 7 of Kaplan et al., effects of architecture tweaks in other papers such as this one), architectural improvements (R&D) tend to have only a minimal effect on the exponent of scaling power laws; so the differences in the scaling laws could hypothetically be compensated for by increasing compute by a multiplicative constant. (Architecture choice can have a more significant effect on factors like parallelizability and stability of training.) I’m very curious whether you’ve seen results that suggest otherwise (I wouldn’t be surprised if this were the case, the examples I’ve seen are very limited, and I’d love to see more extensive studies), or whether you have more relevant intuition/evidence for there being no “floor” to hypothetically achievable scaling laws.
I agree that our argument should result in a quantitative adjustment to some folk’s estimated probability of catastrophe, rather than ruling out catastrophe entirely, and I agree that figuring out how to handle worst-case scenarios is very productive.
When you say “the AI systems charged with defending humans may instead join in to help disempower humanity”, are you supposing that these systems have long-term goals? (even more specifically, goals that lead them to cooperate with each other to disempower humanity?)
From published results I’ve seen (e.g. comparison of LSTMs vs Transformers in figure 7 of Kaplan et al., effects of architecture tweaks in other papers such as this one), architectural improvements (R&D) tend to have only a minimal effect on the exponent of scaling power laws; so the differences in the scaling laws could hypothetically be compensated for by increasing compute by a multiplicative constant. (Architecture choice can have a more significant effect on factors like parallelizability and stability of training.) I’m very curious whether you’ve seen results that suggest otherwise (I wouldn’t be surprised if this were the case, the examples I’ve seen are very limited, and I’d love to see more extensive studies), or whether you have more relevant intuition/evidence for there being no “floor” to hypothetically achievable scaling laws.
I usually think of the effects of R&D as multiplicative savings in compute, which sounds consistent with what you are saying.
For example, I think a conservative estimate might be that doubling R&D effort allows you to cut compute by a factor of 4. (The analogous estimate for semiconductor R&D is something like 30x cost reduction per 2x R&D increase.) These numbers are high enough to easily allow explosive growth until the returns start diminishing much faster.
When you say “the AI systems charged with defending humans may instead join in to help disempower humanity”, are you supposing that these systems have long-term goals? (even more specifically, goals that lead them to cooperate with each other to disempower humanity?)
Yes. I mean that if we have alignment problems such that all the most effective AI systems have long-term goals, and if all of those systems can get what they want together (e.g. because they care about reward), then to predict the outcome we should care about what would happen in a conflict between (those AIs) vs (everyone else).
So I expect in practice we need to resolve alignment problems well enough that there are approximately competitive systems without malign long-term goals.
Would you agree that the current paradigm is almost in direct contradiction to long-term goals? At the moment, to a first approximation, the power of our systems is proportional to the logarithm of their number of parameters, and again to a first approximation, we need to take a gradient step per parameter in training. So what it means is that if we have 100 Billion parameters, we need to make 100 Billion iterations where we evaluate some objective/loss/reward value and adapt the system accordingly. This means that we better find some loss function that we can evaluate on a relatively time-limited and bounded (input, output) pair rather than a very long interaction.
Would you agree that the current paradigm is almost in direct contradiction to long-term goals?
I agree with something similar, but not this exact claim.
I think this provides a headwind that makes AIs worse at complex skills where performance can only be evaluated over long horizons. But it’s not a strong argument against pursuing long-horizon goals or any simple long-horizon behaviors.(Superhuman competence at long horizon tasks doesn’t seem necessary for either of the mechanisms I’m suggesting.)
In particular, systems trained on lots of short-horizon datapoints can still learn a lot about how the world works at larger timescales. For example, existing LMs understand quite a bit about longer-horizon dynamics of the world despite being trained on next-token prediction. Such systems can make reasonable judgments about what actions would lead to effects in the longer run. As a result I’d expect smart systems can be quickly fine-tuned to pursue long-horizon goals (or might pursue them organically), even though they don’t have any complex cognitive abilities that don’t help improve loss on the short-horizon pre-training task.
Note that people concerned about AI safety often think about this concept under the same heading of horizon length. A relatively common view is that training cost scales roughly linearly with horizon length and so AI systems will be relatively bad at long-horizon tasks (and perhaps the timeline to transformative AI may be longer than you would think based on extrapolations from competent short-horizon behavior).
There are a few dissenting views: (i) almost all long-horizon tasks have rich feedback over short horizons if you know what to look for, so in practice things that feel like “long-horizon” behaviors aren’t really, (ii) although AI systems will be worse at long-horizon tasks, so are humans and so it’s unlikely to be a major comparative advantage for AIs, most of the things we think of as sophisticated long-horizon behavior are just short-horizon cognitive behaviors (like carrying out reasoning or iterating on plans) applied to a question about long-horizons.
(My take is that most planning and “3d chess” is basically short-horizon behavior applied to long-horizon questions, but there is an important and legitimate question about how much cognitive work like “forming new concepts” or “organizing information in your head” or “coming to deeply understand an area” effectively involves longer horizons.)
Are you making a forecast about the inability of AIs in, say, 2026 to operate mostly autonomously for long periods in diverse environments, fulfilling goals? I’d potentially be interested to place bets with you if so.
My forecast would be that an AI that operates autonomously for long periods would be composed of pieces that make human-interpretable progress in the short term. For example, a self-driving car will be able to eventually to drive to New York to Los Angeles, but I believe it would do so by decomposing the task into many small tasks of getting from point A to B. It would not do so by sending it out to the world (or even a simulated world) and repeatedly playing a game where it gets a reward if it reaches Los Angeles, and gets nothing if it doesn’t.
That sounds very different to me from “the current paradigm is almost in direct contradiction to long-term goals.” Maybe we agree after all. Possible remaining differences between us: Are you imagining 0 end-to-end training of the system, or just a small amount? For example, consider WebGPT and OpenAI’s more recent Minecraft agent. They were trained for many steps with unsupervised learning and then fine-tuned for a bit with end-to-end RL, if I recall correctly. Are you saying that insofar as AI opererates autonomously for more than, say, 100,000 serial forward passes, it’ll involve 0 end-to-end training? If so then I’d disagree and say it’ll probably involve some.
Probably our disagreements have more to do with how human-interpretable the resulting systems will be and how well-described they’ll be as having long-term goals. I’m bearish on the first and bullish on the second. (Note that I think they’ll mostly have short-term goals, like humans. But like humans, they’ll have at least some long-term goals.)
Probably our disagreements have more to do with how human-interpretable the resulting systems will be and how well-described they’ll be as having long-term goals. I’m bearish on the first and bullish on the second. (Note that I think they’ll mostly have short-term goals, like humans. But like humans, they’ll have at least some long-term goals.)
My high-level take on this disagreement:
If you train a system to accomplish tasks by taking a large number of small human-like steps, and then you finetune the system “a bit” based on actually achieving the goal, I think you will mostly get systems that pursue goals by taking human-like short-term steps but do so more competently. They will do things like avoiding strategies that don’t work in practice for them, and having better intuitions about how to select amongst many possible options; they won’t do things like carrying out long-term plans that aren’t made out of human-legible short-term steps.
I think that:
If these systems already understand a lot of human-illegible stuff, they might quickly start using it when fine-tuned and that may not always result in changes to legible plans.
If these systems start taking plans that achieve the goals in ways that humans don’t like, and then we fine-tune based on individual plan steps (while continuing to select for success), then we you are likely to train the system to obfuscate its scheme.
If you train long enough end-to-end you will get novel behaviors that can be scarier, although I think the computational cost for doing so may be very large
And overall I think there are enough threat models that we should be worried, and should try to develop machinery so that we don’t need to do the kind of training that could result in doom. But I also think the most likely scenario is more along the lines of what the OP is imagining, and we can stay significantly safer by e.g. having consensus at ML labs that #2 is likely to be scary and should be considered unacceptable. Ultimately what’s most important is probably understanding how to determine empirically which world you are in.
At the moment, to a first approximation, the power of our systems is proportional to the logarithm of their number of parameters, and again to a first approximation, we need to take a gradient step per parameter in training.
This is a bit of an unrelated aside, but I don’t think it’s so clear that “power” is logarithmic (or what power means).
One way we could try to measure this is via something like effective population. If N models with 2M parameters are as useful as kN models with M parameters, what is k? In cases where we can measure I think realistic values tend to be >4. That is, if you had a billion models with N parameters working together in a scientific community, I think you’d get more work out of 250 million models with 2N parameters, and so have great efficiency per unit of compute.
There’s still a question of how e.g. scientific output scales with population. One way you can measure it is by asking “If N people working for 2M years, is as useful as kN people working for M years, what is k?” where I think that you also tend to get numbers in the ballpark of 4, though this is even harder to measure than the question about models. But I think most economists would guess this is more like root(N) than log(N).
That still leaves the question of how scientific output scales with time spent thinking. In this case it seems more like an arbitrary choice of units for measuring “scientific output.” E.g. I think there’s a real sense in which each improvement to semiconductors takes exponentially more effort than the unit before. But the upshot of all of that is that if you spend 2x as many years, we expect to be able to build computers that are >10x more efficient. So its’ only really logarithmic if you measure “years of input” on a linear scale but “efficiency of output” on a logarithmic scale. Other domains beyond semiconductors grow less explosively quickly, but seem to have qualitatively similar behavior. See e.g. are ideas getting harder to find?
Quick comment (not sure it’s realted to any broader points): total compute for N models with 2M parameters is roughly 4NM^2 (since per Chinchilla, number of inference steps scales linearly with model size, and number of floating point operations also scales linearly, see also my calculations here). So an equal total compute cost would correspond to k=4.
What I was thinking when I said “power” is that it seems that in most BIG-Bench scales, if you put the y axis some measure of performance (e.g. accuracy) then it seems to scale as some linear or polynomial way in the log of parameters, and indeed I belive the graphs in that paper usually have log parameters in the X axis. It does seem that when we start to saturate performance (error tends to zero), the power laws kick in, and its more like inverse polynomial in the total number of parameters than their log.
Thank you for the insightful comments!! I’ve added thoughts on Mechanisms 1 and 2 below. Some reactions to your scattered disagreements (my personal opinions; not Boaz’s):
I agree that extracting short-term modules from long-term systems is more likely than not to be extremely hard. (Also that we will have a better sense of the difficulty in the nearish future as more researchers work on this sort of task for current systems.)
I agree that the CEO point might be the weakest in the article. It seems very difficult to find high-quality evidence about the impact of intelligence on long-term strategic planning in complex systems, and this is a major source of my uncertainty about whether our thesis is true. Note that even if making CEOs smarter would improve their performance, it may still be the case that any intelligence boost is fully substitutable by augmentation with advanced short-term AI systems.
From published results I’ve seen (e.g. comparison of LSTMs vs Transformers in figure 7 of Kaplan et al., effects of architecture tweaks in other papers such as this one), architectural improvements (R&D) tend to have only a minimal effect on the exponent of scaling power laws; so the differences in the scaling laws could hypothetically be compensated for by increasing compute by a multiplicative constant. (Architecture choice can have a more significant effect on factors like parallelizability and stability of training.) I’m very curious whether you’ve seen results that suggest otherwise (I wouldn’t be surprised if this were the case, the examples I’ve seen are very limited, and I’d love to see more extensive studies), or whether you have more relevant intuition/evidence for there being no “floor” to hypothetically achievable scaling laws.
I agree that our argument should result in a quantitative adjustment to some folk’s estimated probability of catastrophe, rather than ruling out catastrophe entirely, and I agree that figuring out how to handle worst-case scenarios is very productive.
When you say “the AI systems charged with defending humans may instead join in to help disempower humanity”, are you supposing that these systems have long-term goals? (even more specifically, goals that lead them to cooperate with each other to disempower humanity?)
I usually think of the effects of R&D as multiplicative savings in compute, which sounds consistent with what you are saying.
For example, I think a conservative estimate might be that doubling R&D effort allows you to cut compute by a factor of 4. (The analogous estimate for semiconductor R&D is something like 30x cost reduction per 2x R&D increase.) These numbers are high enough to easily allow explosive growth until the returns start diminishing much faster.
Yes. I mean that if we have alignment problems such that all the most effective AI systems have long-term goals, and if all of those systems can get what they want together (e.g. because they care about reward), then to predict the outcome we should care about what would happen in a conflict between (those AIs) vs (everyone else).
So I expect in practice we need to resolve alignment problems well enough that there are approximately competitive systems without malign long-term goals.
Would you agree that the current paradigm is almost in direct contradiction to long-term goals? At the moment, to a first approximation, the power of our systems is proportional to the logarithm of their number of parameters, and again to a first approximation, we need to take a gradient step per parameter in training. So what it means is that if we have 100 Billion parameters, we need to make 100 Billion iterations where we evaluate some objective/loss/reward value and adapt the system accordingly. This means that we better find some loss function that we can evaluate on a relatively time-limited and bounded (input, output) pair rather than a very long interaction.
I agree with something similar, but not this exact claim.
I think this provides a headwind that makes AIs worse at complex skills where performance can only be evaluated over long horizons. But it’s not a strong argument against pursuing long-horizon goals or any simple long-horizon behaviors.(Superhuman competence at long horizon tasks doesn’t seem necessary for either of the mechanisms I’m suggesting.)
In particular, systems trained on lots of short-horizon datapoints can still learn a lot about how the world works at larger timescales. For example, existing LMs understand quite a bit about longer-horizon dynamics of the world despite being trained on next-token prediction. Such systems can make reasonable judgments about what actions would lead to effects in the longer run. As a result I’d expect smart systems can be quickly fine-tuned to pursue long-horizon goals (or might pursue them organically), even though they don’t have any complex cognitive abilities that don’t help improve loss on the short-horizon pre-training task.
Note that people concerned about AI safety often think about this concept under the same heading of horizon length. A relatively common view is that training cost scales roughly linearly with horizon length and so AI systems will be relatively bad at long-horizon tasks (and perhaps the timeline to transformative AI may be longer than you would think based on extrapolations from competent short-horizon behavior).
There are a few dissenting views: (i) almost all long-horizon tasks have rich feedback over short horizons if you know what to look for, so in practice things that feel like “long-horizon” behaviors aren’t really, (ii) although AI systems will be worse at long-horizon tasks, so are humans and so it’s unlikely to be a major comparative advantage for AIs, most of the things we think of as sophisticated long-horizon behavior are just short-horizon cognitive behaviors (like carrying out reasoning or iterating on plans) applied to a question about long-horizons.
(My take is that most planning and “3d chess” is basically short-horizon behavior applied to long-horizon questions, but there is an important and legitimate question about how much cognitive work like “forming new concepts” or “organizing information in your head” or “coming to deeply understand an area” effectively involves longer horizons.)
Are you making a forecast about the inability of AIs in, say, 2026 to operate mostly autonomously for long periods in diverse environments, fulfilling goals? I’d potentially be interested to place bets with you if so.
My forecast would be that an AI that operates autonomously for long periods would be composed of pieces that make human-interpretable progress in the short term. For example, a self-driving car will be able to eventually to drive to New York to Los Angeles, but I believe it would do so by decomposing the task into many small tasks of getting from point A to B. It would not do so by sending it out to the world (or even a simulated world) and repeatedly playing a game where it gets a reward if it reaches Los Angeles, and gets nothing if it doesn’t.
That sounds very different to me from “the current paradigm is almost in direct contradiction to long-term goals.” Maybe we agree after all. Possible remaining differences between us: Are you imagining 0 end-to-end training of the system, or just a small amount? For example, consider WebGPT and OpenAI’s more recent Minecraft agent. They were trained for many steps with unsupervised learning and then fine-tuned for a bit with end-to-end RL, if I recall correctly. Are you saying that insofar as AI opererates autonomously for more than, say, 100,000 serial forward passes, it’ll involve 0 end-to-end training? If so then I’d disagree and say it’ll probably involve some.
Probably our disagreements have more to do with how human-interpretable the resulting systems will be and how well-described they’ll be as having long-term goals. I’m bearish on the first and bullish on the second. (Note that I think they’ll mostly have short-term goals, like humans. But like humans, they’ll have at least some long-term goals.)
My high-level take on this disagreement:
If you train a system to accomplish tasks by taking a large number of small human-like steps, and then you finetune the system “a bit” based on actually achieving the goal, I think you will mostly get systems that pursue goals by taking human-like short-term steps but do so more competently. They will do things like avoiding strategies that don’t work in practice for them, and having better intuitions about how to select amongst many possible options; they won’t do things like carrying out long-term plans that aren’t made out of human-legible short-term steps.
I think that:
If these systems already understand a lot of human-illegible stuff, they might quickly start using it when fine-tuned and that may not always result in changes to legible plans.
If these systems start taking plans that achieve the goals in ways that humans don’t like, and then we fine-tune based on individual plan steps (while continuing to select for success), then we you are likely to train the system to obfuscate its scheme.
If you train long enough end-to-end you will get novel behaviors that can be scarier, although I think the computational cost for doing so may be very large
And overall I think there are enough threat models that we should be worried, and should try to develop machinery so that we don’t need to do the kind of training that could result in doom. But I also think the most likely scenario is more along the lines of what the OP is imagining, and we can stay significantly safer by e.g. having consensus at ML labs that #2 is likely to be scary and should be considered unacceptable. Ultimately what’s most important is probably understanding how to determine empirically which world you are in.
This is a bit of an unrelated aside, but I don’t think it’s so clear that “power” is logarithmic (or what power means).
One way we could try to measure this is via something like effective population. If N models with 2M parameters are as useful as kN models with M parameters, what is k? In cases where we can measure I think realistic values tend to be >4. That is, if you had a billion models with N parameters working together in a scientific community, I think you’d get more work out of 250 million models with 2N parameters, and so have great efficiency per unit of compute.
There’s still a question of how e.g. scientific output scales with population. One way you can measure it is by asking “If N people working for 2M years, is as useful as kN people working for M years, what is k?” where I think that you also tend to get numbers in the ballpark of 4, though this is even harder to measure than the question about models. But I think most economists would guess this is more like root(N) than log(N).
That still leaves the question of how scientific output scales with time spent thinking. In this case it seems more like an arbitrary choice of units for measuring “scientific output.” E.g. I think there’s a real sense in which each improvement to semiconductors takes exponentially more effort than the unit before. But the upshot of all of that is that if you spend 2x as many years, we expect to be able to build computers that are >10x more efficient. So its’ only really logarithmic if you measure “years of input” on a linear scale but “efficiency of output” on a logarithmic scale. Other domains beyond semiconductors grow less explosively quickly, but seem to have qualitatively similar behavior. See e.g. are ideas getting harder to find?
Quick comment (not sure it’s realted to any broader points): total compute for N models with 2M parameters is roughly 4NM^2 (since per Chinchilla, number of inference steps scales linearly with model size, and number of floating point operations also scales linearly, see also my calculations here). So an equal total compute cost would correspond to k=4.
What I was thinking when I said “power” is that it seems that in most BIG-Bench scales, if you put the y axis some measure of performance (e.g. accuracy) then it seems to scale as some linear or polynomial way in the log of parameters, and indeed I belive the graphs in that paper usually have log parameters in the X axis. It does seem that when we start to saturate performance (error tends to zero), the power laws kick in, and its more like inverse polynomial in the total number of parameters than their log.