I don’t quite get this. I think sure, current models don’t have instrumental convergence because sure, they’re not general and don’t have all-encompassing world models that include themselves as objects into the world. But people are still working trying to build AGI. I wouldn’t have a problem with making ever smarter protein folders, or chip designers, or chess players. Such specialised AI will keep doing one and only one thing. I’m not entirely sure about ever smarter LLMs, as that seems like they’d get human-ish eventually; but since the goal of the LLM is to imitate humans, then I also think they wouldn’t get, by definition, qualitatively superhuman in their output (though they could be quantitively in the sheer speed at which they can work). But I could see the LLM simulated personas being instrumentally convergent at some point.
However, if someone succeeds at building AGI, and depending on what its architecture is, that doesn’t need to be true any more. People dream of AGI because they want it to automate work or to take over technological development, but by definition, that sort of usefulness belongs to something that can plan and pursue goals in the world, which means it has the potential to be instrumentally convergent. If the idea is “then let’s just not build AGI”, I 100% agree, but I don’t think all of the AI industry right now does.
The point I’m trying to make is that the types of AI that are best for capabilities, including some of the more general capabilities like say automating alignment research also don’t have that much space for instrumental convergence, and that matters because it’s very easy to get alignment research for free, as well as safe AI by default, without disturbing capabilities research, because the most unconstrained power seeking AIs are very incapable, and thus in practice the most capable AIs that can solve the full problem of alignment and safety are by default safe because instrumental convergence harms capabilities currently.
In essence, the AI systems that are both capable enough to do alignment and safety research on future AI systems and are instrumentally convergent is a much smaller subset of capable AIs, and enough space for extreme instrumental convergence harms capabilities today, so it’s not incentivized.
This matters because it’s much, much easier to bootstrap alignment and safety, and it means that OpenAI/Anthropic’s plans of automating alignment research have a good chance of working.
It’s not that we cannot lose or go extinct, but that it isn’t the default anymore, and in particular means that a lot of changes to how we do alignment research are necessary, as a first step. But the impact of the instrumental convergence assumption is so deep that even if it only is wrong up until a much later point of AI capability increases matters a lot more than you think.
EDIT: A footnote in porby’s post actually expresses it a bit cleaner than I said it, so here goes:
This also means that minimal-instrumentality training objectives may suffer from reduced capability compared to an optimization process where you had more open, but still correctly specified, bounds. This seems like a necessary tradeoff in a context where we don’t know how to correctly specify bounds.
Fortunately, this seems to still apply to capabilities at the moment- the expected result for using RL in a sufficiently unconstrained environment often ranges from “complete failure” to “insane useless crap.” It’s notable that some of the strongest RL agents are built off of a foundation of noninstrumental world models.
The fact that instrumental goals with very few constraints is actually useless compared to non-instrumentally convergent models is really helpful, as it means that a capable system is inherently easy to align and be safe by default, or equivalently there is a strong anti-correlation between capabilities and instrumental convergent goals.
I don’t quite get this. I think sure, current models don’t have instrumental convergence because sure, they’re not general and don’t have all-encompassing world models that include themselves as objects into the world. But people are still working trying to build AGI. I wouldn’t have a problem with making ever smarter protein folders, or chip designers, or chess players. Such specialised AI will keep doing one and only one thing. I’m not entirely sure about ever smarter LLMs, as that seems like they’d get human-ish eventually; but since the goal of the LLM is to imitate humans, then I also think they wouldn’t get, by definition, qualitatively superhuman in their output (though they could be quantitively in the sheer speed at which they can work). But I could see the LLM simulated personas being instrumentally convergent at some point.
However, if someone succeeds at building AGI, and depending on what its architecture is, that doesn’t need to be true any more. People dream of AGI because they want it to automate work or to take over technological development, but by definition, that sort of usefulness belongs to something that can plan and pursue goals in the world, which means it has the potential to be instrumentally convergent. If the idea is “then let’s just not build AGI”, I 100% agree, but I don’t think all of the AI industry right now does.
The point I’m trying to make is that the types of AI that are best for capabilities, including some of the more general capabilities like say automating alignment research also don’t have that much space for instrumental convergence, and that matters because it’s very easy to get alignment research for free, as well as safe AI by default, without disturbing capabilities research, because the most unconstrained power seeking AIs are very incapable, and thus in practice the most capable AIs that can solve the full problem of alignment and safety are by default safe because instrumental convergence harms capabilities currently.
In essence, the AI systems that are both capable enough to do alignment and safety research on future AI systems and are instrumentally convergent is a much smaller subset of capable AIs, and enough space for extreme instrumental convergence harms capabilities today, so it’s not incentivized.
This matters because it’s much, much easier to bootstrap alignment and safety, and it means that OpenAI/Anthropic’s plans of automating alignment research have a good chance of working.
It’s not that we cannot lose or go extinct, but that it isn’t the default anymore, and in particular means that a lot of changes to how we do alignment research are necessary, as a first step. But the impact of the instrumental convergence assumption is so deep that even if it only is wrong up until a much later point of AI capability increases matters a lot more than you think.
EDIT: A footnote in porby’s post actually expresses it a bit cleaner than I said it, so here goes:
The fact that instrumental goals with very few constraints is actually useless compared to non-instrumentally convergent models is really helpful, as it means that a capable system is inherently easy to align and be safe by default, or equivalently there is a strong anti-correlation between capabilities and instrumental convergent goals.