In fact at some point optionality/empowerment becomes the entirety of the corrected utility function, which is just another way of arriving at instrumental convergence to empowerment.
Yep.
Applying these lessons to human utility functions results in the realization that external empowerment is almost all we need.
Empowerment is difficult in alignment contexts because humans are not rational utility maximizers. You might risk empowering humans to make a mistake.
Also taken too far you run into problems with Eudaimonia. We probably wouldn’t want AI to remove all challenge.
I mostly agree with that tradeoff—a perfect humanity empowering agent could still result in sub-optimal futures if it empowers us and we then make mistakes, vs what could be achieved by a theoretical fully aligned sovereign. But that really doesn’t seem so bad, and also may not be likely as empowering us probably entails helping us with better future modeling.
In practice the closest we may get to a fully aligned sovereign is some form of uploading, because practical strong optimization in full alignment with our brain’s utility function probably requires extracting and empowering functional equivalents to much of the brain’s valence/value circuits.
So the ideal scenario is probably AI that helps us upload and then hands over power.
“that” here refers to “a perfect humanity empowering agent” which hands power over to humanity. In that sense it’s not that different from us advancing without AI. So if you think that’s extremely bad because you are assuming only a narrow subset of humanity is empowered, well, that isn’t what I meant by “a perfect humanity empowering agent”. If you still think that’s extremely bad even if humanity is empowered broadly then you seem to just think that humanity advancing without AI would be extremely bad. In that case I think you are expecting too much of your AI and we have more fundamental disagreements.
If you still think that’s extremely bad even if humanity is empowered broadly then you seem to just think that humanity advancing without AI would be extremely bad. In that case I think you are expecting too much of your AI and we have more fundamental disagreements.
Humans usually put up lots of restrictions that reduce empowerment in favor of safety. I think we can be excessive about such restrictions, but I don’t think they are always a bad idea, and instead think that if you totally removed them, you would probably make the world much worse. Examples of things that seem like a good idea to me:
Putting up fences to prevent falling off stairs, even though this disempowers you from jumping down the stairs.
Some restrictions on sale of dangerous drugs.
Electrical sockets are designed to not lead to exposed high-voltage wires.
And the above are just things that are mainly designed to protect you from yourself. If we also count disempowering people to prevent them from harming others, then I support bans and limits on many kinds of weapon sales, and I think it would be absolutely terrible if an AI taught people a simple way to build a nuke in their garage.
Your examples are just examples of empowerment tradeoffs.
Fences that prevent you from falling off stairs can be empowering because death or disability are (maximally, and extremely) disempowering.
Same with drugs and sockets. Precomitting to a restriction on your future ability to use some dangerous addictive drug can increase empowerment, because addiction is highly disempowering. I don’t think you are correctly modelling long term empowerment.
I think in order to generally model this as disempowering, you need a model of human irrationality, as if you instead model humans as rational utility maximizers, we wouldn’t make major simple avoidable mistakes that we would need protection from.
But modelling human irrationality seems like a difficult and ill-posed problem, which contains most of the difficulty of the alignment problem.
The difficulties this leads to in practice is what to do when writing “empowerment” into the the utility function from your AI; how do you specify that it is human-level rationality that must be empowered, rather than ideal utility maximizers?
My comment began as a discourse of why practical agents are not really utility argmaxers (due to the optimizer’s curse).
You do not need to model human irrationality and it is generally a mistake to do so.
Consider a child who doesn’t understand that the fence is to prevent them from falling off stairs. It would be a mistake to optimize for the child’s empowerment using their limited irrational world model. It is correct to use the AI’s more powerful world model for computing empowerment, which results in putting up the fence (or equivalent) in situations where the AI models that as preventing the child from death or disability.
Also taken too far you run into problems with Eudaimonia. We probably wouldn’t want AI to remove all challenge.
I usually don’t consider this a problem, since I have different atomic building blocks for my value set.
However, if I was going to criticize it, I’d criticize the fact that inner-alignment issues incentivize it to deceive us.
It’s still an advance. If the core claims are correct, then it solves the entire outer alignment problem in one go, including Goodhart problems.
Now I get the skepticalness of this solution, because from the outside view, someone (solving a major problem with pet theory) almost never happens, and a lot of the efforts have turned out not to work.
Now I get the skepticalness of this solution, because from the outside view, someone (solving a major problem with pet theory) almost never happens, and a lot of the efforts have turned out not to work.
If you are talking about external empowerment I wasn’t the first to write up that concept—that credit goes to Franzmeyer et al.[1] Admittedly my conception is a little different and my writeup focuses more on the longer term consequences, but they have the core idea there.
If you are talking about how empowerment arises naturally from just using correct decision making under uncertainty in situations where you have future value of information that improves subsequent future value estimates—that idea may be more novel and I’ll probably write it up if it isn’t so novel that it has non-epsilon AI capability value. (Some quick google searches reveals some related ‘soft’ decision RL approaches that seem similar)
Yep.
Empowerment is difficult in alignment contexts because humans are not rational utility maximizers. You might risk empowering humans to make a mistake.
Also taken too far you run into problems with Eudaimonia. We probably wouldn’t want AI to remove all challenge.
I mostly agree with that tradeoff—a perfect humanity empowering agent could still result in sub-optimal futures if it empowers us and we then make mistakes, vs what could be achieved by a theoretical fully aligned sovereign. But that really doesn’t seem so bad, and also may not be likely as empowering us probably entails helping us with better future modeling.
In practice the closest we may get to a fully aligned sovereign is some form of uploading, because practical strong optimization in full alignment with our brain’s utility function probably requires extracting and empowering functional equivalents to much of the brain’s valence/value circuits.
So the ideal scenario is probably AI that helps us upload and then hands over power.
It seems potentially extremely bad to me, since power could cause e.g. death, maiming or torture if wielded wrong.
“that” here refers to “a perfect humanity empowering agent” which hands power over to humanity. In that sense it’s not that different from us advancing without AI. So if you think that’s extremely bad because you are assuming only a narrow subset of humanity is empowered, well, that isn’t what I meant by “a perfect humanity empowering agent”. If you still think that’s extremely bad even if humanity is empowered broadly then you seem to just think that humanity advancing without AI would be extremely bad. In that case I think you are expecting too much of your AI and we have more fundamental disagreements.
Humans usually put up lots of restrictions that reduce empowerment in favor of safety. I think we can be excessive about such restrictions, but I don’t think they are always a bad idea, and instead think that if you totally removed them, you would probably make the world much worse. Examples of things that seem like a good idea to me:
Putting up fences to prevent falling off stairs, even though this disempowers you from jumping down the stairs.
Some restrictions on sale of dangerous drugs.
Electrical sockets are designed to not lead to exposed high-voltage wires.
And the above are just things that are mainly designed to protect you from yourself. If we also count disempowering people to prevent them from harming others, then I support bans and limits on many kinds of weapon sales, and I think it would be absolutely terrible if an AI taught people a simple way to build a nuke in their garage.
Your examples are just examples of empowerment tradeoffs.
Fences that prevent you from falling off stairs can be empowering because death or disability are (maximally, and extremely) disempowering.
Same with drugs and sockets. Precomitting to a restriction on your future ability to use some dangerous addictive drug can increase empowerment, because addiction is highly disempowering. I don’t think you are correctly modelling long term empowerment.
I think in order to generally model this as disempowering, you need a model of human irrationality, as if you instead model humans as rational utility maximizers, we wouldn’t make major simple avoidable mistakes that we would need protection from.
But modelling human irrationality seems like a difficult and ill-posed problem, which contains most of the difficulty of the alignment problem.
The difficulties this leads to in practice is what to do when writing “empowerment” into the the utility function from your AI; how do you specify that it is human-level rationality that must be empowered, rather than ideal utility maximizers?
My comment began as a discourse of why practical agents are not really utility argmaxers (due to the optimizer’s curse).
You do not need to model human irrationality and it is generally a mistake to do so.
Consider a child who doesn’t understand that the fence is to prevent them from falling off stairs. It would be a mistake to optimize for the child’s empowerment using their limited irrational world model. It is correct to use the AI’s more powerful world model for computing empowerment, which results in putting up the fence (or equivalent) in situations where the AI models that as preventing the child from death or disability.
Likewise for the other scenarios.
I usually don’t consider this a problem, since I have different atomic building blocks for my value set.
However, if I was going to criticize it, I’d criticize the fact that inner-alignment issues incentivize it to deceive us.
It’s still an advance. If the core claims are correct, then it solves the entire outer alignment problem in one go, including Goodhart problems.
Now I get the skepticalness of this solution, because from the outside view, someone (solving a major problem with pet theory) almost never happens, and a lot of the efforts have turned out not to work.
If you are talking about external empowerment I wasn’t the first to write up that concept—that credit goes to Franzmeyer et al.[1] Admittedly my conception is a little different and my writeup focuses more on the longer term consequences, but they have the core idea there.
If you are talking about how empowerment arises naturally from just using correct decision making under uncertainty in situations where you have future value of information that improves subsequent future value estimates—that idea may be more novel and I’ll probably write it up if it isn’t so novel that it has non-epsilon AI capability value. (Some quick google searches reveals some related ‘soft’ decision RL approaches that seem similar)
Franzmeyer, Tim, Mateusz Malinowski, and João F. Henriques. “Learning Altruistic Behaviours in Reinforcement Learning without External Rewards.” arXiv preprint arXiv:2107.09598 (2021).