The problem is not that we don’t know how to prevent power-seeking or instrumental convergence, because we want power-seeking and instrumental convergence.
Yes, this is still underappreciated in most alignment discourse, perhaps because power-seeking has unfortunate negative connotations. A better less loaded term might be Optionality-seeking. For example human friendships increase long term optionality (more social invites, social support, dating and business opportunities, etc), so a human trading some wealth for activities that increase and strengthen friendships can be instrumentally rational for optionality-maximizing empowerment, even though that doesn’t fit the (incorrect) stereotype of ‘power-seeking’.
The problem is that we don’t know how to align this power-seeking, how to direct the power towards what we want, rather than having side-effects that we don’t want.
Well if humans are also agents for which instrumental convergence applies, as you suggest here:
Imitation learning is useful due to Aumann’s Agreement Theorem and because instrumental convergence also applies to human intelligence
Then that suggests that we can use instrumental convergence to help solve alignment, because optimizing for human empowerment becomes equivalent to optimizing for our unknown long term values.
There are some caveats of course: we may still need to incorporate some model of short term values like hedonic reward, and it’s also important to identify the correct agency to empower which is probably not as simple as individual human brains. Humans are not purely selfish rational but instead are partially altruistic; handling that probably requires something like empowering humanity or generic agency more broadly, or empowering distributed software simulacra minds instead of brains.
I totally agree that the choice of “power seeking” is very unfortunate because of the same reasons you describe. I don’t think optionality is quite it, though. I think “consequentialist” or “goal seeking” might be better (or we could just stick with “instrumental convergence”—it at least has neutral affect).
As for underappreciatedness, I think this is possibly true, though anecdotally at least for me I already strongly believed this and in fact a large part of my generator of why I think alignment is difficult is based on this.
I think I disagree about leveraging this for alignment but I’ll read your proposal in more detail before commenting on that further.
Power-seeking has unfortunate negative connotations. A better less loaded term might be Optionality-seeking.
I think some people use the term “power-seeking” to refer specifically to the negative connotations of the term (hacking into a data center, developing harmful bioweapons and deploying them to retain control, etc).
Yes, this is still underappreciated in most alignment discourse, perhaps because power-seeking has unfortunate negative connotations. A better less loaded term might be Optionality-seeking. For example human friendships increase long term optionality (more social invites, social support, dating and business opportunities, etc), so a human trading some wealth for activities that increase and strengthen friendships can be instrumentally rational for optionality-maximizing empowerment, even though that doesn’t fit the (incorrect) stereotype of ‘power-seeking’.
Well if humans are also agents for which instrumental convergence applies, as you suggest here:
Then that suggests that we can use instrumental convergence to help solve alignment, because optimizing for human empowerment becomes equivalent to optimizing for our unknown long term values.
There are some caveats of course: we may still need to incorporate some model of short term values like hedonic reward, and it’s also important to identify the correct agency to empower which is probably not as simple as individual human brains. Humans are not purely selfish rational but instead are partially altruistic; handling that probably requires something like empowering humanity or generic agency more broadly, or empowering distributed software simulacra minds instead of brains.
I totally agree that the choice of “power seeking” is very unfortunate because of the same reasons you describe. I don’t think optionality is quite it, though. I think “consequentialist” or “goal seeking” might be better (or we could just stick with “instrumental convergence”—it at least has neutral affect).
As for underappreciatedness, I think this is possibly true, though anecdotally at least for me I already strongly believed this and in fact a large part of my generator of why I think alignment is difficult is based on this.
I think I disagree about leveraging this for alignment but I’ll read your proposal in more detail before commenting on that further.
I think some people use the term “power-seeking” to refer specifically to the negative connotations of the term (hacking into a data center, developing harmful bioweapons and deploying them to retain control, etc).