I basically agree with this post but want to push back a little bit here:
The problem is not that we don’t know how to prevent power-seeking or instrumental convergence, because we want power-seeking and instrumental convergence. The problem is that we don’t know how to align this power-seeking, how to direct the power towards what we want, rather than having side-effects that we don’t want.
Yes, some level of power-seeking-like behavior is necessary for the AI to do impressive stuff. But I don’t think that means giving up on the idea of limiting power-seeking. One model could look like this: for a given task, some level of power-seeking is necessary (e.g. to build working nanotech, you need to do a bunch of experiments and simulations, which requires physical resources, compute, etc.). But by default, the solution an optimization process would find might do even more power-seeking than that (killing all humans to ensure they don’t intervene, turning the entire earth into computers). This higher level of power-seeking does increase the success probability (e.g. humans interfering is a genuine issue in terms of the goal of building nanotech). But this increase in success probability clearly isn’t necessary from our perspective: if humans try to shut down the AI, we’re fine with the AI letting itself be shut off (we want that, in fact!). So the argument “we want power-seeking” isn’t strong enough to imply “we want arbitrary amounts of power-seeking, and trying to limit it is mis-guided”.
I think of this as two complementary approaches to AI safety:
Aligning more powerful AI system (that seek power in more ways). Roughly value alignment.
Achieving the same tasks with AI systems that are less power-seeking (and are hopefully less risky/easier to align). Roughly corrigibility, trying to find “weak-ish pivotal acts”, …
I see this post as a great write-up for “We need some power-seeking/instrumentally convergent behavior, so AI safety isn’t about avoiding that entirely” (a rock would solve that problem, it doesn’t seek any power). I just want to add that my best guess is we’ll want to do some mix of 1. and 2. above, not just 1. (or at least, we should currently pursuer both strategies, because it’s unclear how tractable each one is).
Even if the effect of it is “limiting power-seeking”, I suspect this to be a poor frame for actually coming up with a solution, because this is defined purely in the negative, and not even in the negative of something we want to avoid, but instead in the negative of something we often want to achieve. Rather, one should come to understand what kind of power seeking we want to limit.
Corrigibility does not necessarily mean limiting power-seeking much. You could have an AI that is corrigible not because it doesn’t accumulate a bunch of resources and build up powerful infrastructure, but instead because it voluntarily avoids using this infrastructure against the people it tries to be corrigible to.
I basically agree with this post but want to push back a little bit here:
Yes, some level of power-seeking-like behavior is necessary for the AI to do impressive stuff. But I don’t think that means giving up on the idea of limiting power-seeking. One model could look like this: for a given task, some level of power-seeking is necessary (e.g. to build working nanotech, you need to do a bunch of experiments and simulations, which requires physical resources, compute, etc.). But by default, the solution an optimization process would find might do even more power-seeking than that (killing all humans to ensure they don’t intervene, turning the entire earth into computers). This higher level of power-seeking does increase the success probability (e.g. humans interfering is a genuine issue in terms of the goal of building nanotech). But this increase in success probability clearly isn’t necessary from our perspective: if humans try to shut down the AI, we’re fine with the AI letting itself be shut off (we want that, in fact!). So the argument “we want power-seeking” isn’t strong enough to imply “we want arbitrary amounts of power-seeking, and trying to limit it is mis-guided”.
I think of this as two complementary approaches to AI safety:
Aligning more powerful AI system (that seek power in more ways). Roughly value alignment.
Achieving the same tasks with AI systems that are less power-seeking (and are hopefully less risky/easier to align). Roughly corrigibility, trying to find “weak-ish pivotal acts”, …
I see this post as a great write-up for “We need some power-seeking/instrumentally convergent behavior, so AI safety isn’t about avoiding that entirely” (a rock would solve that problem, it doesn’t seek any power). I just want to add that my best guess is we’ll want to do some mix of 1. and 2. above, not just 1. (or at least, we should currently pursuer both strategies, because it’s unclear how tractable each one is).
I don’t totally disagree, but two points:
Even if the effect of it is “limiting power-seeking”, I suspect this to be a poor frame for actually coming up with a solution, because this is defined purely in the negative, and not even in the negative of something we want to avoid, but instead in the negative of something we often want to achieve. Rather, one should come to understand what kind of power seeking we want to limit.
Corrigibility does not necessarily mean limiting power-seeking much. You could have an AI that is corrigible not because it doesn’t accumulate a bunch of resources and build up powerful infrastructure, but instead because it voluntarily avoids using this infrastructure against the people it tries to be corrigible to.