Shouting Boo just delays it a little and makes it more likely to be good instead of bad. (Currently is it quite likely to be bad).
I wouldn’t be nearly as confident as a lot of LWers here, and in particular I suspect this depends on some details and assumptions that aren’t made explicit here.
My biggest counterargument to the case that AI progress should be slowed down comes from an observation made by porby about a fundamental lack of a property we theorize about AI systems, and the one foundational assumption around AI risk:
Instrumental convergence, and it’s corollaries like powerseeking.
The important point is that current and most plausible future AI systems don’t have incentives to learn instrumental goals, and the type of AI that has enough space and has very few constraints, like RL with sufficiently unconstrained action spaces to learn instrumental goals is essentially useless for capabilities today, and the strongest RL agents use non-instrumental world models.
Thus, instrumental convergence for AI systems is fundamentally wrong, and given that this is the foundational assumption of why superhuman AI systems pose any risk that we couldn’t handle, a lot of other arguments for why we might to slow down AI, why the alignment problem is hard, and a lot of other discussion in the AI governance and technical safety spaces, especially on LW become unsound, because they’re reasoning from an uncertain foundation, and at worst are reasoning from a false premise to reach many false conclusions, like the argument that we should reduce AI progress.
Fundamentally, instrumental convergence being wrong would demand pretty vast changes to how we approach the AI topic, from alignment to safety and much more to come,
To be clear, the fact that I could only find a flaw within AI risk arguments because they were founded on false premises is actually better than many other failure modes, because it at least shows fundamentally strong locally valid reasoning on LW, rather than motivated reasoning or other biases that transforms true statements into false statements.
One particular case of the insight is that OpenAI and Anthropic were fundamentally right in their AI alignment plans, because they have managed to avoid instrumental convergence from being incentivized, and in particular LLMs can be extremely capable without being arbitrarily capable or having instrumental world models given resources.
I learned about the observation from this post below:
Porby talks about why AI isn’t incentivized to learn instrumental goals, but given how much this assumption gets used in AI discourse, sometimes implicitly, I think it’s of great importance that instrumental convergence is likely wrong.
I have other disagreements, but this is my deepest disagreement with your model (and other models around AI is especially dangerous).
EDIT: A new post on instrumental convergence came out, and it showed that many of the inferences made weren’t just unsound, but invalid, and in particular Nick Bostrom’s Superintelligence was wildly invalid in applying instrumental convergence to strong conclusions on AI risk.
I’m glad I asked, that was helpful! I agree that instrumental convergence is a huge crux; if I were convinced that e.g. it wasn’t going to happen until 15 years from now, and/or that the kinds of systems that might instrumentally converge were always going to be less economically/militarily/etc. competitive than other kinds of systems, that would indeed be a huge revolution in my thought and would completely change the way I think about AI and AI risks, and I’d become much more optimistic.
I’d especially read footnote 3, because it gave me a very important observation for why instrumental convergence is actually bad for capabilities, or at least not obviously good for capabilities and incentivized, especially with a lot of space to roam:
This also means that minimal-instrumentality training objectives may suffer from reduced capability compared to an optimization process where you had more open, but still correctly specified, bounds. This seems like a necessary tradeoff in a context where we don’t know how to correctly specify bounds.
Fortunately, this seems to still apply to capabilities at the moment- the expected result for using RL in a sufficiently unconstrained environment often ranges from “complete failure” to “insane useless crap.” It’s notable that some of the strongest RL agents are built off of a foundation of noninstrumental world models.
I don’t quite get this. I think sure, current models don’t have instrumental convergence because sure, they’re not general and don’t have all-encompassing world models that include themselves as objects into the world. But people are still working trying to build AGI. I wouldn’t have a problem with making ever smarter protein folders, or chip designers, or chess players. Such specialised AI will keep doing one and only one thing. I’m not entirely sure about ever smarter LLMs, as that seems like they’d get human-ish eventually; but since the goal of the LLM is to imitate humans, then I also think they wouldn’t get, by definition, qualitatively superhuman in their output (though they could be quantitively in the sheer speed at which they can work). But I could see the LLM simulated personas being instrumentally convergent at some point.
However, if someone succeeds at building AGI, and depending on what its architecture is, that doesn’t need to be true any more. People dream of AGI because they want it to automate work or to take over technological development, but by definition, that sort of usefulness belongs to something that can plan and pursue goals in the world, which means it has the potential to be instrumentally convergent. If the idea is “then let’s just not build AGI”, I 100% agree, but I don’t think all of the AI industry right now does.
The point I’m trying to make is that the types of AI that are best for capabilities, including some of the more general capabilities like say automating alignment research also don’t have that much space for instrumental convergence, and that matters because it’s very easy to get alignment research for free, as well as safe AI by default, without disturbing capabilities research, because the most unconstrained power seeking AIs are very incapable, and thus in practice the most capable AIs that can solve the full problem of alignment and safety are by default safe because instrumental convergence harms capabilities currently.
In essence, the AI systems that are both capable enough to do alignment and safety research on future AI systems and are instrumentally convergent is a much smaller subset of capable AIs, and enough space for extreme instrumental convergence harms capabilities today, so it’s not incentivized.
This matters because it’s much, much easier to bootstrap alignment and safety, and it means that OpenAI/Anthropic’s plans of automating alignment research have a good chance of working.
It’s not that we cannot lose or go extinct, but that it isn’t the default anymore, and in particular means that a lot of changes to how we do alignment research are necessary, as a first step. But the impact of the instrumental convergence assumption is so deep that even if it only is wrong up until a much later point of AI capability increases matters a lot more than you think.
EDIT: A footnote in porby’s post actually expresses it a bit cleaner than I said it, so here goes:
This also means that minimal-instrumentality training objectives may suffer from reduced capability compared to an optimization process where you had more open, but still correctly specified, bounds. This seems like a necessary tradeoff in a context where we don’t know how to correctly specify bounds.
Fortunately, this seems to still apply to capabilities at the moment- the expected result for using RL in a sufficiently unconstrained environment often ranges from “complete failure” to “insane useless crap.” It’s notable that some of the strongest RL agents are built off of a foundation of noninstrumental world models.
The fact that instrumental goals with very few constraints is actually useless compared to non-instrumentally convergent models is really helpful, as it means that a capable system is inherently easy to align and be safe by default, or equivalently there is a strong anti-correlation between capabilities and instrumental convergent goals.
I don’t understand why it helps that much if instrumental convergence isn’t expected. All it takes is one actor to deliberately make a bad agentic AI and you have all the problems, but with no “free energy” being taken out by slightly bad, less powerful AI beforehand that would be there if instrumental convergence happened. Slow takeoff seems to me to make much more of a difference.
I actually don’t think the distinction between slow and fast takeoff matters too much here, at least compared to what the lack of instrumental convergence offers us. The important part here is that AI misuse is a real problem, but this is importantly much more solvable, because misuse isn’t as convergent as the hypothesized instrumental convergence is. It matters, but this is a problem that relies on drastically different methods, and importantly still reduces the danger expected from AI.
I wouldn’t be nearly as confident as a lot of LWers here, and in particular I suspect this depends on some details and assumptions that aren’t made explicit here.
Well yeah, it depends on details and assumptions I didn’t make explicit—I wrote only four sentences!
If you have counterarguments to any of my claims I’d be interested to hear them, just in case they are new to me.
My biggest counterargument to the case that AI progress should be slowed down comes from an observation made by porby about a fundamental lack of a property we theorize about AI systems, and the one foundational assumption around AI risk:
Instrumental convergence, and it’s corollaries like powerseeking.
The important point is that current and most plausible future AI systems don’t have incentives to learn instrumental goals, and the type of AI that has enough space and has very few constraints, like RL with sufficiently unconstrained action spaces to learn instrumental goals is essentially useless for capabilities today, and the strongest RL agents use non-instrumental world models.
Thus, instrumental convergence for AI systems is fundamentally wrong, and given that this is the foundational assumption of why superhuman AI systems pose any risk that we couldn’t handle, a lot of other arguments for why we might to slow down AI, why the alignment problem is hard, and a lot of other discussion in the AI governance and technical safety spaces, especially on LW become unsound, because they’re reasoning from an uncertain foundation, and at worst are reasoning from a false premise to reach many false conclusions, like the argument that we should reduce AI progress.
Fundamentally, instrumental convergence being wrong would demand pretty vast changes to how we approach the AI topic, from alignment to safety and much more to come,
To be clear, the fact that I could only find a flaw within AI risk arguments because they were founded on false premises is actually better than many other failure modes, because it at least shows fundamentally strong locally valid reasoning on LW, rather than motivated reasoning or other biases that transforms true statements into false statements.
One particular case of the insight is that OpenAI and Anthropic were fundamentally right in their AI alignment plans, because they have managed to avoid instrumental convergence from being incentivized, and in particular LLMs can be extremely capable without being arbitrarily capable or having instrumental world models given resources.
I learned about the observation from this post below:
https://www.lesswrong.com/posts/EBKJq2gkhvdMg5nTQ/instrumentality-makes-agents-agenty
Porby talks about why AI isn’t incentivized to learn instrumental goals, but given how much this assumption gets used in AI discourse, sometimes implicitly, I think it’s of great importance that instrumental convergence is likely wrong.
I have other disagreements, but this is my deepest disagreement with your model (and other models around AI is especially dangerous).
EDIT: A new post on instrumental convergence came out, and it showed that many of the inferences made weren’t just unsound, but invalid, and in particular Nick Bostrom’s Superintelligence was wildly invalid in applying instrumental convergence to strong conclusions on AI risk.
I’m glad I asked, that was helpful! I agree that instrumental convergence is a huge crux; if I were convinced that e.g. it wasn’t going to happen until 15 years from now, and/or that the kinds of systems that might instrumentally converge were always going to be less economically/militarily/etc. competitive than other kinds of systems, that would indeed be a huge revolution in my thought and would completely change the way I think about AI and AI risks, and I’d become much more optimistic.
I’ll go read the post you linked.
I’d especially read footnote 3, because it gave me a very important observation for why instrumental convergence is actually bad for capabilities, or at least not obviously good for capabilities and incentivized, especially with a lot of space to roam:
I don’t quite get this. I think sure, current models don’t have instrumental convergence because sure, they’re not general and don’t have all-encompassing world models that include themselves as objects into the world. But people are still working trying to build AGI. I wouldn’t have a problem with making ever smarter protein folders, or chip designers, or chess players. Such specialised AI will keep doing one and only one thing. I’m not entirely sure about ever smarter LLMs, as that seems like they’d get human-ish eventually; but since the goal of the LLM is to imitate humans, then I also think they wouldn’t get, by definition, qualitatively superhuman in their output (though they could be quantitively in the sheer speed at which they can work). But I could see the LLM simulated personas being instrumentally convergent at some point.
However, if someone succeeds at building AGI, and depending on what its architecture is, that doesn’t need to be true any more. People dream of AGI because they want it to automate work or to take over technological development, but by definition, that sort of usefulness belongs to something that can plan and pursue goals in the world, which means it has the potential to be instrumentally convergent. If the idea is “then let’s just not build AGI”, I 100% agree, but I don’t think all of the AI industry right now does.
The point I’m trying to make is that the types of AI that are best for capabilities, including some of the more general capabilities like say automating alignment research also don’t have that much space for instrumental convergence, and that matters because it’s very easy to get alignment research for free, as well as safe AI by default, without disturbing capabilities research, because the most unconstrained power seeking AIs are very incapable, and thus in practice the most capable AIs that can solve the full problem of alignment and safety are by default safe because instrumental convergence harms capabilities currently.
In essence, the AI systems that are both capable enough to do alignment and safety research on future AI systems and are instrumentally convergent is a much smaller subset of capable AIs, and enough space for extreme instrumental convergence harms capabilities today, so it’s not incentivized.
This matters because it’s much, much easier to bootstrap alignment and safety, and it means that OpenAI/Anthropic’s plans of automating alignment research have a good chance of working.
It’s not that we cannot lose or go extinct, but that it isn’t the default anymore, and in particular means that a lot of changes to how we do alignment research are necessary, as a first step. But the impact of the instrumental convergence assumption is so deep that even if it only is wrong up until a much later point of AI capability increases matters a lot more than you think.
EDIT: A footnote in porby’s post actually expresses it a bit cleaner than I said it, so here goes:
The fact that instrumental goals with very few constraints is actually useless compared to non-instrumentally convergent models is really helpful, as it means that a capable system is inherently easy to align and be safe by default, or equivalently there is a strong anti-correlation between capabilities and instrumental convergent goals.
I don’t understand why it helps that much if instrumental convergence isn’t expected. All it takes is one actor to deliberately make a bad agentic AI and you have all the problems, but with no “free energy” being taken out by slightly bad, less powerful AI beforehand that would be there if instrumental convergence happened. Slow takeoff seems to me to make much more of a difference.
I actually don’t think the distinction between slow and fast takeoff matters too much here, at least compared to what the lack of instrumental convergence offers us. The important part here is that AI misuse is a real problem, but this is importantly much more solvable, because misuse isn’t as convergent as the hypothesized instrumental convergence is. It matters, but this is a problem that relies on drastically different methods, and importantly still reduces the danger expected from AI.