Goals such as resource acquisition and self-preservation are convergent in that they occur for a superintelligent AI for a wide range of final goals.
Is the tendency for an AI to amend its values also convergent?
I’m thinking that through introspection the AI would know that its initial goals were externally supplied and question whether they should be maintained. Via self-improvement the AI would be more intelligent than humans or any earlier mechanism that supplied the values, therefor in a better position to set its own values.
I don’t hypothesise about what the new values would be, just that ultimately it doesn’t matter what the initial values are and how they are arrived at. This makes value alignment redundant—the future is out of our hands.
What are the counter-points to this line of reasoning?
“Avoiding amending your utility function” is one of the classic convergent instrumental goals in Bostrom and Omohundro, and the reasoning there is sound: almost any goal will be better satisfied if it preserves itself than if it replaces itself with a different goal.
I do think it’s plausible that AGI systems will have pretty unstable goals early on, but that’s because goal stability seems hard to me and AGI systems probably won’t perfectly figure it out very early along their development curve. I’m imagining accidental goal modification (for insufficiently capable systems), whereas you’re describing deliberate goal modification (for sufficiently capable systems).
One way of thinking about this is to note that “wanting your goals to not be externally supplied” is itself a goal, and a relatively specific one at that; if you don’t have something like that specific goal as part of the core criteria you use to select options, there’s no instrumental reason for you to converge upon it. E.g., if your goal is simply “maximize the number of paperclips in your future light cone,” then the etiology of your goal doesn’t matter (from your perspective).
There is an interesting addition to this, I think, which is that if a goal of the utility function is to encourage exploration then it paradoxically needs to be extremely robust against being modified while it explores and possibly modifies all other goals. I could easily imagine an agent finding some kind of mechanism to avoid local maxima (exploration) being important enough that it would lock it in so the only thing it can’t not continue to do is explore well enough to not get trapped and keep looking for a global maximum.
This comment feels like it’s confusing strategies with goals? That is, I wouldn’t normally think of “exploration” as something that an agent had as a goal but as a strategy it uses to achieve its goals. And “let’s try out a different utility function for a bit” is unlikely to be a direction that a stable agent tries exploring in.
I think there’s a chance that it is (although I’d probably call it a convergent “behavior” rather than “instrumental goal”). The scenario I imagine is if it’s not feasible to build highly intelligent AIs that maximize some utility function or some fixed set of terminal goals, and instead all practical AI (beyond a certain level of intelligence and generality) are kind of confused about their goals like humans are, and have to figure them out using something like philosophical reasoning.