When imagining what will happen to my values under value drift, reflective self-modification, or similar processes, I tend to imagine water running downhill on the Earth, forming rivers and pools. At each point on the landscape, the gradient points in a slightly new direction—upon changing myself based on my current meta-preferences, I might want slightly new things.
A process that converges quickly is like water running into the neighborhood pond, or even straight down into an aquifer. A process that converges slowly if at all is like water joining a river that eventually flows to the ocean.
The problem with the ocean, in this metaphor, is not really that it’s weird and far away (though that should raise suspicions). The problem with the ocean is that too many rivers all lead to it. Being in the ocean means forgetting where you started.
Now we come back to the problem with maximizing my empowerment. An AI that just wants me to be powerful doesn’t care about what I might want to use that power for. In fact it’s against me doing those things, because I might spend my power on them. What it would love to do is to convince me or rewrite me into just wanting to be very powerful, as an end goal. Whether I start out as a human or as a pure sadist or as a paperclip-maximizer, I would end up the same—my terminal goals all rinsed away by a convergent instrumental goal.
The empowerment path is the same as your optimal long term path, for any reasonable terminal values—including arbitrarily changing values. Any divergences are due simply to short planning horizons or high discount rates.
So in essence you are arguing that you may have discount rate high enough to cause significant conflict between long term and short term utility, and empowerment always favors long term. I largely agree with this, but we can combine long term empowerment with learned human values to cover any short term divergences. Learned human values are probably more accurate for short term utility, whereas empowerment is near optimal for the long term.
However, any deviation from empowerment sacrifices long term utility for short term utility.
The problem with the ocean is that too many rivers all lead to it. Being in the ocean means forgetting where you started.
The future is uncertain and continuously branching and diverging, not converging—so the analogy is water flowing in reverse, away from the ocean, or a growing fractal tree. Instrumental convergence to empowerment only indicates there is a convergent narrow set of paths that navigate all the obstacles and lead to high utility futures, but those paths generally are unique for each starting point (agent) and optimization landscape depending on all the various who/what/where factors that make agents unique.
An AI that just wants me to be powerful doesn’t care about what I might want to use that power for. In fact it’s against me doing those things, because I might spend my power on them. What it would love to do is to convince me or rewrite me into just wanting to be very powerful, as an end goal.
Sure this is probably true if for example there was some significant likelihood you would lose much of your wealth on poor bets/investments/gambling, or more literally throwing away money or something. But beyond ensuring you don’t obviously waste optionality in those ways, encouraging prudent wealth management, etc, much of the ways to make you more powerful would probably involve the AGI improving it’s own abilities—that’s simply inherent to the assumptions of AGI surpassing human intelligence, at least until uploading.
Whether I start out as a human or as a pure sadist or as a paperclip-maximizer, I would end up the same—my terminal goals all rinsed away by a convergent instrumental goal.
That’s like saying all organisms end up the same because they all have the same terminal goal of inclusive fitness. Also the AGI can’t change you too much as it must preserve your identity. Regardless it doesn’t matter much if we agree on long term convergence.
Your criticism was largely anticipated and discussed in the FAQ section starting here and I don’t see you engaging with some of the points so I’ll just quote from “But our humanity”:
Fully optimizing solely for our empowerment may eventually change us or strip away some of our human values, but clearly not all or even the majority.
Societies of uploads competing for resources will face essentially the same competitive optimization pressure towards empowerment-related values. So optimizing for our empowerment is simply aligned with the natural systemic optimization pressure posthumans will face regardless after transcending biology and genetic inclusive fitness.
I think your anticipated counterarguments are handwavy.
To me, you’re making a similar mistake as Jurgen Schmidhuber did when he said that surprising compressibility was all you needed to optimize to get human values. You’re imagining that the AI will help us get social status because it’s empowering, much like Schmidhuber imagines the AI will create beautiful music because music is compressible. It’s true that music is compressible, but it’s not optimally compressible—it’s not what an AI would do if it was actually optimizing for surprising compressibility.
Social status, or helping us be smarter, or other nice stuff would indeed be empowering. But they’re not generated by thinking “what’s optimal for ensuring my control over my sensory stream?” They’re generated by taking the nice things that we want for other reasons and then noticing that they’re more empowering than their opposites.
The most serious counter-point you raise is the one about identity. Wouldn’t an AI maximizing my empowerment want to preserve what makes me “me?” This isn’t exactly the definition used in the gridworlds, which defines agents in terms of their input/output interfaces, but it’s a totally reasonable implementation.
The issue is that this identity requirement is treated as a constraint by an empowerment-macimizing search process. The empowerment maximizer is still trying to erase all practical differences between me and a paperclip maximizer, which is a goal I don’t like. But it doesn’t care that I don’t like it, it just wants the future object that has maximal control over its own sensory inputs to still register as “me” to whatever standard it uses.
The empowerment maximizer is still trying to erase all practical differences between me and a paperclip maximizer, which is a goal I don’t like.
This threw me off initially because of the use of ‘paperclip maximizer’ as a specific value system. But I do partially agree with the steelmanned version of this which is “erase all practical differences between you and the maximally longtermist version of you”.
Some component of our values/utility is short term non-empowerment hedonic which conflicts with long term optionality and an empowerment AI would only be aligned with the long term component; thus absent identity preservation mechanisms this AI would want us to constantly sacrifice for the long term.
But once again many things that appear hedonic—such as fun—are actually components of empowerment related intrinsic motivation, so if the empowerment AGI was going to change us (say after uploading), it would keep fun or give us some improved version of it.
But I actually already agreed with this earlier:
So in essence you are arguing that you may have discount rate high enough to cause significant conflict between long term and short term utility, and empowerment always favors long term. I largely agree with this, but we can combine long term empowerment with learned human values to cover any short term divergences.
Also it’s worth noting that everything here assumes superhuman AGI. When that is realized it changes everything in the sense that the better versions of ourselves—if we had far more knowledge, time to think, etc—probably would be much more long termist.
The empowerment maximizer is still trying to erase all practical differences between me and a paperclip maximizer, which is a goal I don’t like.
You keep asserting this obviously incorrect claim without justification. An AGI optimizing purely for your long term empowerment doesn’t care about your values—it has no incentive to change your long term utility function[1], even before any considerations of identity preservation which are also necessary for optimizing for your empowerment to be meaningful.
I do not believe you are clearly modelling what optimizing for your long term empowerment is like. It is near exactly equivalent to optimizing for your ability to achieve your true long term goals/values, whatever they are.
it has no incentive to change your long term utility function
By practical difference I meant that it wants to erase the impact of your goals on the universe. Whether it does that by changing your goals or not depends on implementation details.
Consider the perverse case of someone who wants to die—their utility function ranks futures of the universe lower if they’re in it, and higher if they’re not. You can’t maximize this person’s empowerment if they’re dead, so either you should convince them life is worth living, or you should just prevent them from affecting the universe.
By practical difference I meant that it wants to erase the impact of your goals on the universe.
Not it does not in general. The Franzmeyer et al prototype does not do that, and there are no reasons to suspect that becomes some universal problem as you scale these systems up.
Once again:
Optimizing for your long term empowerment is (for most agents) equivalent to optimizing for your ability to achieve your true long term goals/values, whatever they are.
An agent truly seeking your empowerment is seeking to give you power over itself as well, which precludes any effect of “erasing the impact of your goals”.
Consider the perverse case of someone who wants to die -
Sure and humans usually try to prevent humans from wanting to die.
Short comment on the last point—euthanasia is legal in several countries (thus wanting to die is not prevented, and even socially accepted) and in my opinion the moral choice of action in certain situations.
A digression that will make sense soon:
When imagining what will happen to my values under value drift, reflective self-modification, or similar processes, I tend to imagine water running downhill on the Earth, forming rivers and pools. At each point on the landscape, the gradient points in a slightly new direction—upon changing myself based on my current meta-preferences, I might want slightly new things.
A process that converges quickly is like water running into the neighborhood pond, or even straight down into an aquifer. A process that converges slowly if at all is like water joining a river that eventually flows to the ocean.
The problem with the ocean, in this metaphor, is not really that it’s weird and far away (though that should raise suspicions). The problem with the ocean is that too many rivers all lead to it. Being in the ocean means forgetting where you started.
Now we come back to the problem with maximizing my empowerment. An AI that just wants me to be powerful doesn’t care about what I might want to use that power for. In fact it’s against me doing those things, because I might spend my power on them. What it would love to do is to convince me or rewrite me into just wanting to be very powerful, as an end goal. Whether I start out as a human or as a pure sadist or as a paperclip-maximizer, I would end up the same—my terminal goals all rinsed away by a convergent instrumental goal.
The empowerment path is the same as your optimal long term path, for any reasonable terminal values—including arbitrarily changing values. Any divergences are due simply to short planning horizons or high discount rates.
So in essence you are arguing that you may have discount rate high enough to cause significant conflict between long term and short term utility, and empowerment always favors long term. I largely agree with this, but we can combine long term empowerment with learned human values to cover any short term divergences. Learned human values are probably more accurate for short term utility, whereas empowerment is near optimal for the long term.
However, any deviation from empowerment sacrifices long term utility for short term utility.
The future is uncertain and continuously branching and diverging, not converging—so the analogy is water flowing in reverse, away from the ocean, or a growing fractal tree. Instrumental convergence to empowerment only indicates there is a convergent narrow set of paths that navigate all the obstacles and lead to high utility futures, but those paths generally are unique for each starting point (agent) and optimization landscape depending on all the various who/what/where factors that make agents unique.
Sure this is probably true if for example there was some significant likelihood you would lose much of your wealth on poor bets/investments/gambling, or more literally throwing away money or something. But beyond ensuring you don’t obviously waste optionality in those ways, encouraging prudent wealth management, etc, much of the ways to make you more powerful would probably involve the AGI improving it’s own abilities—that’s simply inherent to the assumptions of AGI surpassing human intelligence, at least until uploading.
That’s like saying all organisms end up the same because they all have the same terminal goal of inclusive fitness. Also the AGI can’t change you too much as it must preserve your identity. Regardless it doesn’t matter much if we agree on long term convergence.
Your criticism was largely anticipated and discussed in the FAQ section starting here and I don’t see you engaging with some of the points so I’ll just quote from “But our humanity”:
Fully optimizing solely for our empowerment may eventually change us or strip away some of our human values, but clearly not all or even the majority.
Societies of uploads competing for resources will face essentially the same competitive optimization pressure towards empowerment-related values. So optimizing for our empowerment is simply aligned with the natural systemic optimization pressure posthumans will face regardless after transcending biology and genetic inclusive fitness.
I think your anticipated counterarguments are handwavy.
To me, you’re making a similar mistake as Jurgen Schmidhuber did when he said that surprising compressibility was all you needed to optimize to get human values. You’re imagining that the AI will help us get social status because it’s empowering, much like Schmidhuber imagines the AI will create beautiful music because music is compressible. It’s true that music is compressible, but it’s not optimally compressible—it’s not what an AI would do if it was actually optimizing for surprising compressibility.
Social status, or helping us be smarter, or other nice stuff would indeed be empowering. But they’re not generated by thinking “what’s optimal for ensuring my control over my sensory stream?” They’re generated by taking the nice things that we want for other reasons and then noticing that they’re more empowering than their opposites.
The most serious counter-point you raise is the one about identity. Wouldn’t an AI maximizing my empowerment want to preserve what makes me “me?” This isn’t exactly the definition used in the gridworlds, which defines agents in terms of their input/output interfaces, but it’s a totally reasonable implementation.
The issue is that this identity requirement is treated as a constraint by an empowerment-macimizing search process. The empowerment maximizer is still trying to erase all practical differences between me and a paperclip maximizer, which is a goal I don’t like. But it doesn’t care that I don’t like it, it just wants the future object that has maximal control over its own sensory inputs to still register as “me” to whatever standard it uses.
This threw me off initially because of the use of ‘paperclip maximizer’ as a specific value system. But I do partially agree with the steelmanned version of this which is “erase all practical differences between you and the maximally longtermist version of you”.
Some component of our values/utility is short term non-empowerment hedonic which conflicts with long term optionality and an empowerment AI would only be aligned with the long term component; thus absent identity preservation mechanisms this AI would want us to constantly sacrifice for the long term.
But once again many things that appear hedonic—such as fun—are actually components of empowerment related intrinsic motivation, so if the empowerment AGI was going to change us (say after uploading), it would keep fun or give us some improved version of it.
But I actually already agreed with this earlier:
Also it’s worth noting that everything here assumes superhuman AGI. When that is realized it changes everything in the sense that the better versions of ourselves—if we had far more knowledge, time to think, etc—probably would be much more long termist.
You keep asserting this obviously incorrect claim without justification. An AGI optimizing purely for your long term empowerment doesn’t care about your values—it has no incentive to change your long term utility function[1], even before any considerations of identity preservation which are also necessary for optimizing for your empowerment to be meaningful.
I do not believe you are clearly modelling what optimizing for your long term empowerment is like. It is near exactly equivalent to optimizing for your ability to achieve your true long term goals/values, whatever they are.
It may have an incentive to change your discount rate to match its own, but that’s hardly the difference between you and a paperclip maximizer.
By practical difference I meant that it wants to erase the impact of your goals on the universe. Whether it does that by changing your goals or not depends on implementation details.
Consider the perverse case of someone who wants to die—their utility function ranks futures of the universe lower if they’re in it, and higher if they’re not. You can’t maximize this person’s empowerment if they’re dead, so either you should convince them life is worth living, or you should just prevent them from affecting the universe.
Not it does not in general. The Franzmeyer et al prototype does not do that, and there are no reasons to suspect that becomes some universal problem as you scale these systems up.
Once again:
Optimizing for your long term empowerment is (for most agents) equivalent to optimizing for your ability to achieve your true long term goals/values, whatever they are.
An agent truly seeking your empowerment is seeking to give you power over itself as well, which precludes any effect of “erasing the impact of your goals”.
Sure and humans usually try to prevent humans from wanting to die.
Short comment on the last point—euthanasia is legal in several countries (thus wanting to die is not prevented, and even socially accepted) and in my opinion the moral choice of action in certain situations.