The point I’m trying to make is that the types of AI that are best for capabilities, including some of the more general capabilities like say automating alignment research also don’t have that much space for instrumental convergence, and that matters because it’s very easy to get alignment research for free, as well as safe AI by default, without disturbing capabilities research, because the most unconstrained power seeking AIs are very incapable, and thus in practice the most capable AIs that can solve the full problem of alignment and safety are by default safe because instrumental convergence harms capabilities currently.
In essence, the AI systems that are both capable enough to do alignment and safety research on future AI systems and are instrumentally convergent is a much smaller subset of capable AIs, and enough space for extreme instrumental convergence harms capabilities today, so it’s not incentivized.
This matters because it’s much, much easier to bootstrap alignment and safety, and it means that OpenAI/Anthropic’s plans of automating alignment research have a good chance of working.
It’s not that we cannot lose or go extinct, but that it isn’t the default anymore, and in particular means that a lot of changes to how we do alignment research are necessary, as a first step. But the impact of the instrumental convergence assumption is so deep that even if it only is wrong up until a much later point of AI capability increases matters a lot more than you think.
EDIT: A footnote in porby’s post actually expresses it a bit cleaner than I said it, so here goes:
This also means that minimal-instrumentality training objectives may suffer from reduced capability compared to an optimization process where you had more open, but still correctly specified, bounds. This seems like a necessary tradeoff in a context where we don’t know how to correctly specify bounds.
Fortunately, this seems to still apply to capabilities at the moment- the expected result for using RL in a sufficiently unconstrained environment often ranges from “complete failure” to “insane useless crap.” It’s notable that some of the strongest RL agents are built off of a foundation of noninstrumental world models.
The fact that instrumental goals with very few constraints is actually useless compared to non-instrumentally convergent models is really helpful, as it means that a capable system is inherently easy to align and be safe by default, or equivalently there is a strong anti-correlation between capabilities and instrumental convergent goals.
The point I’m trying to make is that the types of AI that are best for capabilities, including some of the more general capabilities like say automating alignment research also don’t have that much space for instrumental convergence, and that matters because it’s very easy to get alignment research for free, as well as safe AI by default, without disturbing capabilities research, because the most unconstrained power seeking AIs are very incapable, and thus in practice the most capable AIs that can solve the full problem of alignment and safety are by default safe because instrumental convergence harms capabilities currently.
In essence, the AI systems that are both capable enough to do alignment and safety research on future AI systems and are instrumentally convergent is a much smaller subset of capable AIs, and enough space for extreme instrumental convergence harms capabilities today, so it’s not incentivized.
This matters because it’s much, much easier to bootstrap alignment and safety, and it means that OpenAI/Anthropic’s plans of automating alignment research have a good chance of working.
It’s not that we cannot lose or go extinct, but that it isn’t the default anymore, and in particular means that a lot of changes to how we do alignment research are necessary, as a first step. But the impact of the instrumental convergence assumption is so deep that even if it only is wrong up until a much later point of AI capability increases matters a lot more than you think.
EDIT: A footnote in porby’s post actually expresses it a bit cleaner than I said it, so here goes:
The fact that instrumental goals with very few constraints is actually useless compared to non-instrumentally convergent models is really helpful, as it means that a capable system is inherently easy to align and be safe by default, or equivalently there is a strong anti-correlation between capabilities and instrumental convergent goals.