I was actually thinking about this issue recently but I hadn’t heard of the term “optimizer’s curse” yet. Regardless there is a pretty straight forward solution related to empowerment.
The key issue is uncertainty in value estimates. But why do these primarily arise? For most agents and real world complex decisions the primary source of value uncertainty is model prediction uncertainty, which only compounds over time. However the fact that uncertainty increases farther into the planning horizon also then implies that uncertainty for specific future dates decreases as time advances and those dates get closer to the present.
The problem with the optimizer’s curse flows from the broken assumption that state utility should use the deterministic maximum utility of successor states. Take something like your example where we have 10 future states B0. B1 .. B9, with noisy expected utility estimates. But we aren’t actually making the decision now, as that is a future decision. Instead say in the current moment the agent is deciding between 2 current options A0 and A1 - neither of which has immediate utility, but A0 enables all of the B options while A1 enables only one: B9. Now let’s say that B9 happens to have the highest predicted expected utility—but that is naturally only due to prediction noise. Computing the chained EU of A states based on max discounted utility (as in bellman recursion) results in the same utility for A0 and A1, which is clearly wrong.
One simple potential fix for this is to use a softmax rather than max, which then naturally strongly favors A0 as it enables more future successor options—and would usually even favor A0 if it only enabled the first 9 B options.
At the more abstract level the solution is to recognize the value of future information and how that causes intermediate states with higher optionality to have more value—because those states benefit the most from future information. In fact at some point optionality/empowerment becomes the entirety of the corrected utility function, which is just another way of arriving at instrumental convergence to empowerment.
Interestingly AIXI also makes all the right decisions here, even though it is an arg maxer, but only because it considers the full distribution of all possible world-models and only chooses the max expected utility decision after averaging out over all world-models. So it chooses A0 because in most worlds the winning pathways go through A0.
In fact at some point optionality/empowerment becomes the entirety of the corrected utility function, which is just another way of arriving at instrumental convergence to empowerment.
Yep.
Applying these lessons to human utility functions results in the realization that external empowerment is almost all we need.
Empowerment is difficult in alignment contexts because humans are not rational utility maximizers. You might risk empowering humans to make a mistake.
Also taken too far you run into problems with Eudaimonia. We probably wouldn’t want AI to remove all challenge.
I mostly agree with that tradeoff—a perfect humanity empowering agent could still result in sub-optimal futures if it empowers us and we then make mistakes, vs what could be achieved by a theoretical fully aligned sovereign. But that really doesn’t seem so bad, and also may not be likely as empowering us probably entails helping us with better future modeling.
In practice the closest we may get to a fully aligned sovereign is some form of uploading, because practical strong optimization in full alignment with our brain’s utility function probably requires extracting and empowering functional equivalents to much of the brain’s valence/value circuits.
So the ideal scenario is probably AI that helps us upload and then hands over power.
“that” here refers to “a perfect humanity empowering agent” which hands power over to humanity. In that sense it’s not that different from us advancing without AI. So if you think that’s extremely bad because you are assuming only a narrow subset of humanity is empowered, well, that isn’t what I meant by “a perfect humanity empowering agent”. If you still think that’s extremely bad even if humanity is empowered broadly then you seem to just think that humanity advancing without AI would be extremely bad. In that case I think you are expecting too much of your AI and we have more fundamental disagreements.
If you still think that’s extremely bad even if humanity is empowered broadly then you seem to just think that humanity advancing without AI would be extremely bad. In that case I think you are expecting too much of your AI and we have more fundamental disagreements.
Humans usually put up lots of restrictions that reduce empowerment in favor of safety. I think we can be excessive about such restrictions, but I don’t think they are always a bad idea, and instead think that if you totally removed them, you would probably make the world much worse. Examples of things that seem like a good idea to me:
Putting up fences to prevent falling off stairs, even though this disempowers you from jumping down the stairs.
Some restrictions on sale of dangerous drugs.
Electrical sockets are designed to not lead to exposed high-voltage wires.
And the above are just things that are mainly designed to protect you from yourself. If we also count disempowering people to prevent them from harming others, then I support bans and limits on many kinds of weapon sales, and I think it would be absolutely terrible if an AI taught people a simple way to build a nuke in their garage.
Your examples are just examples of empowerment tradeoffs.
Fences that prevent you from falling off stairs can be empowering because death or disability are (maximally, and extremely) disempowering.
Same with drugs and sockets. Precomitting to a restriction on your future ability to use some dangerous addictive drug can increase empowerment, because addiction is highly disempowering. I don’t think you are correctly modelling long term empowerment.
I think in order to generally model this as disempowering, you need a model of human irrationality, as if you instead model humans as rational utility maximizers, we wouldn’t make major simple avoidable mistakes that we would need protection from.
But modelling human irrationality seems like a difficult and ill-posed problem, which contains most of the difficulty of the alignment problem.
The difficulties this leads to in practice is what to do when writing “empowerment” into the the utility function from your AI; how do you specify that it is human-level rationality that must be empowered, rather than ideal utility maximizers?
My comment began as a discourse of why practical agents are not really utility argmaxers (due to the optimizer’s curse).
You do not need to model human irrationality and it is generally a mistake to do so.
Consider a child who doesn’t understand that the fence is to prevent them from falling off stairs. It would be a mistake to optimize for the child’s empowerment using their limited irrational world model. It is correct to use the AI’s more powerful world model for computing empowerment, which results in putting up the fence (or equivalent) in situations where the AI models that as preventing the child from death or disability.
Also taken too far you run into problems with Eudaimonia. We probably wouldn’t want AI to remove all challenge.
I usually don’t consider this a problem, since I have different atomic building blocks for my value set.
However, if I was going to criticize it, I’d criticize the fact that inner-alignment issues incentivize it to deceive us.
It’s still an advance. If the core claims are correct, then it solves the entire outer alignment problem in one go, including Goodhart problems.
Now I get the skepticalness of this solution, because from the outside view, someone (solving a major problem with pet theory) almost never happens, and a lot of the efforts have turned out not to work.
Now I get the skepticalness of this solution, because from the outside view, someone (solving a major problem with pet theory) almost never happens, and a lot of the efforts have turned out not to work.
If you are talking about external empowerment I wasn’t the first to write up that concept—that credit goes to Franzmeyer et al.[1] Admittedly my conception is a little different and my writeup focuses more on the longer term consequences, but they have the core idea there.
If you are talking about how empowerment arises naturally from just using correct decision making under uncertainty in situations where you have future value of information that improves subsequent future value estimates—that idea may be more novel and I’ll probably write it up if it isn’t so novel that it has non-epsilon AI capability value. (Some quick google searches reveals some related ‘soft’ decision RL approaches that seem similar)
I was actually thinking about this issue recently but I hadn’t heard of the term “optimizer’s curse” yet. Regardless there is a pretty straight forward solution related to empowerment.
The key issue is uncertainty in value estimates. But why do these primarily arise? For most agents and real world complex decisions the primary source of value uncertainty is model prediction uncertainty, which only compounds over time. However the fact that uncertainty increases farther into the planning horizon also then implies that uncertainty for specific future dates decreases as time advances and those dates get closer to the present.
The problem with the optimizer’s curse flows from the broken assumption that state utility should use the deterministic maximum utility of successor states. Take something like your example where we have 10 future states B0. B1 .. B9, with noisy expected utility estimates. But we aren’t actually making the decision now, as that is a future decision. Instead say in the current moment the agent is deciding between 2 current options A0 and A1 - neither of which has immediate utility, but A0 enables all of the B options while A1 enables only one: B9. Now let’s say that B9 happens to have the highest predicted expected utility—but that is naturally only due to prediction noise. Computing the chained EU of A states based on max discounted utility (as in bellman recursion) results in the same utility for A0 and A1, which is clearly wrong.
One simple potential fix for this is to use a softmax rather than max, which then naturally strongly favors A0 as it enables more future successor options—and would usually even favor A0 if it only enabled the first 9 B options.
At the more abstract level the solution is to recognize the value of future information and how that causes intermediate states with higher optionality to have more value—because those states benefit the most from future information. In fact at some point optionality/empowerment becomes the entirety of the corrected utility function, which is just another way of arriving at instrumental convergence to empowerment.
Interestingly AIXI also makes all the right decisions here, even though it is an arg maxer, but only because it considers the full distribution of all possible world-models and only chooses the max expected utility decision after averaging out over all world-models. So it chooses A0 because in most worlds the winning pathways go through A0.
Applying these lessons to human utility functions results in the realization that external empowerment is almost all we need.
Yep.
Empowerment is difficult in alignment contexts because humans are not rational utility maximizers. You might risk empowering humans to make a mistake.
Also taken too far you run into problems with Eudaimonia. We probably wouldn’t want AI to remove all challenge.
I mostly agree with that tradeoff—a perfect humanity empowering agent could still result in sub-optimal futures if it empowers us and we then make mistakes, vs what could be achieved by a theoretical fully aligned sovereign. But that really doesn’t seem so bad, and also may not be likely as empowering us probably entails helping us with better future modeling.
In practice the closest we may get to a fully aligned sovereign is some form of uploading, because practical strong optimization in full alignment with our brain’s utility function probably requires extracting and empowering functional equivalents to much of the brain’s valence/value circuits.
So the ideal scenario is probably AI that helps us upload and then hands over power.
It seems potentially extremely bad to me, since power could cause e.g. death, maiming or torture if wielded wrong.
“that” here refers to “a perfect humanity empowering agent” which hands power over to humanity. In that sense it’s not that different from us advancing without AI. So if you think that’s extremely bad because you are assuming only a narrow subset of humanity is empowered, well, that isn’t what I meant by “a perfect humanity empowering agent”. If you still think that’s extremely bad even if humanity is empowered broadly then you seem to just think that humanity advancing without AI would be extremely bad. In that case I think you are expecting too much of your AI and we have more fundamental disagreements.
Humans usually put up lots of restrictions that reduce empowerment in favor of safety. I think we can be excessive about such restrictions, but I don’t think they are always a bad idea, and instead think that if you totally removed them, you would probably make the world much worse. Examples of things that seem like a good idea to me:
Putting up fences to prevent falling off stairs, even though this disempowers you from jumping down the stairs.
Some restrictions on sale of dangerous drugs.
Electrical sockets are designed to not lead to exposed high-voltage wires.
And the above are just things that are mainly designed to protect you from yourself. If we also count disempowering people to prevent them from harming others, then I support bans and limits on many kinds of weapon sales, and I think it would be absolutely terrible if an AI taught people a simple way to build a nuke in their garage.
Your examples are just examples of empowerment tradeoffs.
Fences that prevent you from falling off stairs can be empowering because death or disability are (maximally, and extremely) disempowering.
Same with drugs and sockets. Precomitting to a restriction on your future ability to use some dangerous addictive drug can increase empowerment, because addiction is highly disempowering. I don’t think you are correctly modelling long term empowerment.
I think in order to generally model this as disempowering, you need a model of human irrationality, as if you instead model humans as rational utility maximizers, we wouldn’t make major simple avoidable mistakes that we would need protection from.
But modelling human irrationality seems like a difficult and ill-posed problem, which contains most of the difficulty of the alignment problem.
The difficulties this leads to in practice is what to do when writing “empowerment” into the the utility function from your AI; how do you specify that it is human-level rationality that must be empowered, rather than ideal utility maximizers?
My comment began as a discourse of why practical agents are not really utility argmaxers (due to the optimizer’s curse).
You do not need to model human irrationality and it is generally a mistake to do so.
Consider a child who doesn’t understand that the fence is to prevent them from falling off stairs. It would be a mistake to optimize for the child’s empowerment using their limited irrational world model. It is correct to use the AI’s more powerful world model for computing empowerment, which results in putting up the fence (or equivalent) in situations where the AI models that as preventing the child from death or disability.
Likewise for the other scenarios.
I usually don’t consider this a problem, since I have different atomic building blocks for my value set.
However, if I was going to criticize it, I’d criticize the fact that inner-alignment issues incentivize it to deceive us.
It’s still an advance. If the core claims are correct, then it solves the entire outer alignment problem in one go, including Goodhart problems.
Now I get the skepticalness of this solution, because from the outside view, someone (solving a major problem with pet theory) almost never happens, and a lot of the efforts have turned out not to work.
If you are talking about external empowerment I wasn’t the first to write up that concept—that credit goes to Franzmeyer et al.[1] Admittedly my conception is a little different and my writeup focuses more on the longer term consequences, but they have the core idea there.
If you are talking about how empowerment arises naturally from just using correct decision making under uncertainty in situations where you have future value of information that improves subsequent future value estimates—that idea may be more novel and I’ll probably write it up if it isn’t so novel that it has non-epsilon AI capability value. (Some quick google searches reveals some related ‘soft’ decision RL approaches that seem similar)
Franzmeyer, Tim, Mateusz Malinowski, and João F. Henriques. “Learning Altruistic Behaviours in Reinforcement Learning without External Rewards.” arXiv preprint arXiv:2107.09598 (2021).