I think JenniferRM’s comment regarding suicide raises a critical issue with human empowerment, one that I thought of before and talked with a few people about but never published. I figure I may as well write out my thoughts here since I’m probably not going to do a human empowerment research project (I almost did; this issue is one reason I didn’t).
The biggest problem I see with human empowerment is that humans do not always want to maximally empowered at every point in time. The suicide example is a great example, but not the only one. Other examples I came up with include: tourists who go on a submarine trip deep in the ocean, or environmentalists who volunteer to be tied to a tree as part of a protest. Fundamentally, the issue is that at some point, we want to be able to commit to a decision and its associated consequences, even if it comes at the cost of our empowerment.
There is even empirical support for this issue with human empowerment. In the paper Assistance Via Empowerment (https://proceedings.neurips.cc/paper/2020/file/30de9ece7cf3790c8c39ccff1a044209-Paper.pdf), the authors use a reinforcement learning agent trained with a mix of the original RL reward and a human empowerment term as a co-pilot on LunarLander, to help human agents land the LunarLander craft without crashing. They find that if the coefficient on the human empowerment term is too high, “the copilot tends to override the pilot and focus only on hovering in the air”. This is exactly the problem above; focusing only on empowerment (in a naive empowerment formulation) can easily lead to the AI preventing us from achieving certain goals we may wish to achieve. In the case of LunarLander in the paper, we want to land, but the AI may stop us, because by getting closer to the ground for landing, we’ve reduced our empowerment.
It may be that current formulations of empowerment are too naive, and could possibly be reworked or extended to deal with this issue. E.g. you might try to have a human empowerment mode, and then a human assistance mode that focuses not on empowerment but on inferring the human’s goal and trying to assist with it; and then some higher level module detects when a human intends to commit to a course of action. But this seems problematic for many other reasons (including those covered in other discussions about alignment).
Overall, I like the idea of human empowerment, but greatly disagree with the idea that human empowerment (especially using the current simple math formulations I’ve seen) is all we need.
The biggest problem I see with human empowerment is that humans do not always want to maximally empowered at every point in time.
Yes—often we face decisions between short term hedonic rewards vs long term empowerment (spending $100 on a nice meal, or your examples of submarine trips), and an agent optimizing purely for our empowerment would always choose long term empowerment over any short term gain (which can be thought of as ‘spending’ empowerment). This was discussed in some other comments and I think mentioned somewhere in the article but should be more prominent: empowerment is only a good bound of the long term component of utility functions, for some reasonable future time cutoff defining ‘long term’.
But I think modelling just the short term component of human utility is not nearly as difficult as accurately modelling the long term, so it’s still an important win. I didn’t investigate that much in the article, but that is why the title is now “Empowerment is (almost) all we need”.
Thanks for the link to the “Assistance via Empowerment” study, I hadn’t seen that before. Based on skimming the paper I agree there are settings of the hyperparams where the empowerment copilot doesn’t help, but that is hardly surprising and doesn’t tell us much—that is nearly always the case with ML systems. On a more general note I think the lunar landing game has far too short of a planning horizon to be in the regime where you get full convergence to empowerment. Hovering in the air only maximizes myopic empowerment. If you imagine a more complex real world scenario where the lander has limited fuel and you crash if running out of fuel, crashing results in death, you can continue to live on a mission for years after landing .. etc it then becomes more obvious that the optimal plan for empowerment converges to landing successfully and safely.
Thanks for your response—good points and food for thought there.
One of my points is that this is a problem which arises depending on your formulation of empowerment, and so you have to be very careful with the way in which you mathematically formulate and implement empowerment. If you use a naive implementation I think it is very likely that you get undesirable behaviour (and that’s why I linked the AvE paper as an example of what can happen).
Also related is that it’s tricky to define what the “reasonable future time cutoff” is. I don’t think this is trivial to solve—use too short of a cutoff, and your empowerment is too myopic. Use too long of a cut-off, and your model stops you from ever spending your money, and always gets you to hoard more money. If you use a hard coded x amount of time, then you have edge cases around your cut-off time. You might need a dynamic time cutoff then, and I don’t think that’s trivial to implement.
I also disagree with the characterization of the issue in the AvE paper just being a hyperparameter issue. Correct me if I am wrong here (as I may have misrepresented/misinterpreted the general gist of ideas and comments on this front) - I believe a key idea around human empowerment is that we can focus on maximally empowering humans—almost like human empowerment is a “safe” target for optimization in some sense. I disagree with this idea, precisely because examples like in AvE show that too much human empowerment can be bad. The critical point I wanted to get across here is that human empowerment is not a safe target for optimization.
Also, the other key point related to the examples like the submarine, protest, and suicide is that empowerment can sometimes be in conflict with our reward/utility/desires. The suicide example is the best illustrator of this (and it seems not too far-fetched to imagine someone who wants to suicide, but can’t, and then feels increasingly worse—which seems like quite a nightmare scenario to me). Again, empowerment by itself isn’t enough to have desirable outcomes; you need some tradeoff with the utility/reward/desires of humans—empowerment is hardly all (or almost all) that you need.
To summarize the points I wanted to get across:
Unless you are very careful with the specifics of your formulation of human empowerment, it very likely will result in bad outcomes. There are lots of implementation details to be considered (even beyond everything you mentioned in your post).
Human empowerment is not a safe target for optimization/maximization. I think this holds even if you have a careful definition of human empowerment (though I would be very happy to be proven wrong on this).
Human empowerment can be in conflict with human utility/desires, best illustrated by the suicide example. Therefore, I think human empowerment could be helpful for alignment, but am very skeptical it is almost all you need.
Edit: I just realized there are some other comments by other commenters that point out similar lines of reasoning to my third point. I think this is a critical issue with the human empowerment framework and want to highlight it a bit more, specifically highlighting JenniferRM’s suicide example which I think is the example that most vividly demonstrates the issue (my scenarios also point to the same issue, but aren’t as clear of a demonstration of the problem).
Thanks, I partially agree so I’m going to start with the most probable crux:
Empowerment can be in conflict with human utility/desires, best illustrated by the suicide example. Therefore, I think human empowerment could be helpful for alignment, but am very skeptical it is almost all you need.
I am somewhat confident that any fully successful alignment technique (one resulting in a fully aligned CEV style sovereign) will prevent suicide; that this is a necessarily convergent result; and that the fact that maximizing human empowerment agrees with the ideal alignment solution on suicide is actually a key litmus test success result. In other words I fully agree with you on the importance of the suicide case, but this evidence is in favor of human empowerment convergence to CEV.
I have a few somewhat independent arguments of why CEV necessarily converges to suicide prevention:
The simple counterfactual argument: Consider the example of happy adjusted but unlucky Bob whose brain is struck by a cosmic ray which happens to cause some benign tumor in just the correct spot to make him completely suicidal. Clearly pre-accident Bob would not choose this future, and strongly desires interventions to prevent the cosmic ray. Any agent successfully aligned to pre-accident Bob0 would agree. It also should not matter when the cosmic ray struck—the desire of Bob0 to live outweighs the desire of Bob1 to die. Furthermore—if Bob1 had the option of removing all effects of the cosmic ray induced depression they would probably take that option. Suicidal thinking is caused by suffering—via depression, physical pain, etc—and most people (nearly all people?) would take an option to eliminate their suffering without dying, if only said option existed (and they believed it would work).
Counterfactual intra-personal CEV coherence: A suicidal agent is one—by definition—that assigns higher ranking utility to future worlds where they no longer exist than future worlds where they do exist. Now consider the multiverse of all possible versions of Bob. The suicidal versions of Bob rank their worlds as lower utility than other worlds without them, and the non-suicidal versions of Bob rank their worlds as higher than worlds where they commit suicide. Any proper aligned CEV style sovereign will then simply notice that the utility functions of the suicidal and non-suicidal bobs already largely agree, even before any complex convergence considerations! The CEV sovereign can satisfy both of their preferences by increasing the measure of worlds containing happy Bobs, and decreasing the measure of worlds containing suicidal Bobs. So it intervenes to prevent the cosmic ray, and more generally intervenes to prevent suicidal thought modes. Put another way—it can cause suicidal Bob to cease to exist (or exist less in the measure sense) without killing suicidal Bob.
Scaling intelligence trends towards lower discount rates: The purpose of aligned AI is to aid in optimizing the universe according to our utility function. As an agent absorbs more knowledge and improves their ability to foresee and steer the future this naturally leads to a lower discount rate (as discount rates arise from planning uncertainty). So improving our ability to foresee and steer the future will naturally lower our discount rate, making us more longtermist, and thus naturally increasing the convergence of our unknown utility function towards empowerment (which is non-suicidal).
Inter-personal CEV coherence: Most humans are non suicidal and prefer that other humans are non-suicidal. At the limits of convergence, where many futures are simulated and those myriad future selves eventually cohere into agreement, this only naturally leads to suicide prevention: because most surviving future selves are non-suicidal and even weak preferences that others do not commit suicide will eventually dominate the coherent utility function over spacetime. We can consider this a generalization of intra-personal CEV coherence, because the boundary separating all the alternate versions of ourselves across the multiverse from the alternate versions of other people is soft and illusive.
Now back to your other points:
Unless you are very careful with the specifics of your formulation of human empowerment, it very likely will result in bad outcomes. I see the simple mathematical definition of empowerment, followed by abstract discussion of beneficial properties. I think this skips too much in terms of the specifics of implementation, and would like to see more discussion on that front.
I largely agree, albeit with less confidence. This article is a rough abstract sketch of a complex topic. I have some more thoughts on how empowerment arises naturally, and some math and examples but that largely came after this article.
Human empowerment is not a safe target for optimization/maximization. I think this holds even if you have a careful definition of human empowerment (though I would be very happy to be proven wrong on this).
I agree that individual human empowerment is incomplete for some of the reasons discussed, but I do expect that any correct implementation of something like CEV will probably result in a very long termist agent to which the instrumental convergence to empowerment applies with less caveats. Thus there exists a definition of broad empowerment such that it is a safe bound on that ideal agent’s unknown utility function.
Also related is that it’s tricky to define what the “reasonable future time cutoff” is. I don’t think this is trivial to solve—use too short of a cutoff, and your empowerment is too myopic. Use too long of a cut-off, and your model stops you from ever spending your money, and always gets you to hoard more money.
Part of the big issue here is that humans die—so our individual brain empowerment eventually falls off a cliff and this bounds our discount rate (we also run into brain capacity and decay problems which further compound the issue). Any aligned CEV sovereign is likely to focus on fixing that problem—ie through uploading and the post biological transition. Posthumans in any successful utopia will be potentially immortal and thus are likely to have lower and decreasing discount rates.
Also I think most examples of ‘spending’ empowerment are actually examples of conversion between types of empowerment. Spending money on social events with friends is mostly an example of a conversion between financial empowerment and social empowerment. The submarine example is also actually an example of trading financial empowerment for social empowerment (it’s a great story and experience to share with others) and curiosity/knowledge.
All that said I do think there are actual true examples of pure short term rewards vs empowerment tradeoff decisions—such as buying an expensive meal you eat at home alone. These are mostly tradeoffs between hedonic rewards vs long term empowerment, and they don’t apply so much to posthumans (who can have essentially any hedonic reward at any time for free).
I also disagree with the characterization of the issue in the AvE paper just being a hyperparameter issue.
This one I don’t understand. The AvE paper trained an empowerment copilot. For some range of hyperparams the copilot helped the human by improving their ability to land successfully (usually by stabilizing the vehicle to make it more controllable). For another range of hyperparams the copilot instead hovered in the air, preventing a landing. It’s just a hyperparam issue because it does work as intended in this example with the right hyperparams. At a higher level though this doesn’t matter much because results from this game don’t generalize to reality—the game is too short.
I think JenniferRM’s comment regarding suicide raises a critical issue with human empowerment, one that I thought of before and talked with a few people about but never published. I figure I may as well write out my thoughts here since I’m probably not going to do a human empowerment research project (I almost did; this issue is one reason I didn’t).
The biggest problem I see with human empowerment is that humans do not always want to maximally empowered at every point in time. The suicide example is a great example, but not the only one. Other examples I came up with include: tourists who go on a submarine trip deep in the ocean, or environmentalists who volunteer to be tied to a tree as part of a protest. Fundamentally, the issue is that at some point, we want to be able to commit to a decision and its associated consequences, even if it comes at the cost of our empowerment.
There is even empirical support for this issue with human empowerment. In the paper Assistance Via Empowerment (https://proceedings.neurips.cc/paper/2020/file/30de9ece7cf3790c8c39ccff1a044209-Paper.pdf), the authors use a reinforcement learning agent trained with a mix of the original RL reward and a human empowerment term as a co-pilot on LunarLander, to help human agents land the LunarLander craft without crashing. They find that if the coefficient on the human empowerment term is too high, “the copilot tends to override the pilot and focus only on hovering in the air”. This is exactly the problem above; focusing only on empowerment (in a naive empowerment formulation) can easily lead to the AI preventing us from achieving certain goals we may wish to achieve. In the case of LunarLander in the paper, we want to land, but the AI may stop us, because by getting closer to the ground for landing, we’ve reduced our empowerment.
It may be that current formulations of empowerment are too naive, and could possibly be reworked or extended to deal with this issue. E.g. you might try to have a human empowerment mode, and then a human assistance mode that focuses not on empowerment but on inferring the human’s goal and trying to assist with it; and then some higher level module detects when a human intends to commit to a course of action. But this seems problematic for many other reasons (including those covered in other discussions about alignment).
Overall, I like the idea of human empowerment, but greatly disagree with the idea that human empowerment (especially using the current simple math formulations I’ve seen) is all we need.
Yes—often we face decisions between short term hedonic rewards vs long term empowerment (spending $100 on a nice meal, or your examples of submarine trips), and an agent optimizing purely for our empowerment would always choose long term empowerment over any short term gain (which can be thought of as ‘spending’ empowerment). This was discussed in some other comments and I think mentioned somewhere in the article but should be more prominent: empowerment is only a good bound of the long term component of utility functions, for some reasonable future time cutoff defining ‘long term’.
But I think modelling just the short term component of human utility is not nearly as difficult as accurately modelling the long term, so it’s still an important win. I didn’t investigate that much in the article, but that is why the title is now “Empowerment is (almost) all we need”.
Thanks for the link to the “Assistance via Empowerment” study, I hadn’t seen that before. Based on skimming the paper I agree there are settings of the hyperparams where the empowerment copilot doesn’t help, but that is hardly surprising and doesn’t tell us much—that is nearly always the case with ML systems. On a more general note I think the lunar landing game has far too short of a planning horizon to be in the regime where you get full convergence to empowerment. Hovering in the air only maximizes myopic empowerment. If you imagine a more complex real world scenario where the lander has limited fuel and you crash if running out of fuel, crashing results in death, you can continue to live on a mission for years after landing .. etc it then becomes more obvious that the optimal plan for empowerment converges to landing successfully and safely.
Thanks for your response—good points and food for thought there.
One of my points is that this is a problem which arises depending on your formulation of empowerment, and so you have to be very careful with the way in which you mathematically formulate and implement empowerment. If you use a naive implementation I think it is very likely that you get undesirable behaviour (and that’s why I linked the AvE paper as an example of what can happen).
Also related is that it’s tricky to define what the “reasonable future time cutoff” is. I don’t think this is trivial to solve—use too short of a cutoff, and your empowerment is too myopic. Use too long of a cut-off, and your model stops you from ever spending your money, and always gets you to hoard more money. If you use a hard coded x amount of time, then you have edge cases around your cut-off time. You might need a dynamic time cutoff then, and I don’t think that’s trivial to implement.
I also disagree with the characterization of the issue in the AvE paper just being a hyperparameter issue. Correct me if I am wrong here (as I may have misrepresented/misinterpreted the general gist of ideas and comments on this front) - I believe a key idea around human empowerment is that we can focus on maximally empowering humans—almost like human empowerment is a “safe” target for optimization in some sense. I disagree with this idea, precisely because examples like in AvE show that too much human empowerment can be bad. The critical point I wanted to get across here is that human empowerment is not a safe target for optimization.
Also, the other key point related to the examples like the submarine, protest, and suicide is that empowerment can sometimes be in conflict with our reward/utility/desires. The suicide example is the best illustrator of this (and it seems not too far-fetched to imagine someone who wants to suicide, but can’t, and then feels increasingly worse—which seems like quite a nightmare scenario to me). Again, empowerment by itself isn’t enough to have desirable outcomes; you need some tradeoff with the utility/reward/desires of humans—empowerment is hardly all (or almost all) that you need.
To summarize the points I wanted to get across:
Unless you are very careful with the specifics of your formulation of human empowerment, it very likely will result in bad outcomes. There are lots of implementation details to be considered (even beyond everything you mentioned in your post).
Human empowerment is not a safe target for optimization/maximization. I think this holds even if you have a careful definition of human empowerment (though I would be very happy to be proven wrong on this).
Human empowerment can be in conflict with human utility/desires, best illustrated by the suicide example. Therefore, I think human empowerment could be helpful for alignment, but am very skeptical it is almost all you need.
Edit: I just realized there are some other comments by other commenters that point out similar lines of reasoning to my third point. I think this is a critical issue with the human empowerment framework and want to highlight it a bit more, specifically highlighting JenniferRM’s suicide example which I think is the example that most vividly demonstrates the issue (my scenarios also point to the same issue, but aren’t as clear of a demonstration of the problem).
Thanks, I partially agree so I’m going to start with the most probable crux:
I am somewhat confident that any fully successful alignment technique (one resulting in a fully aligned CEV style sovereign) will prevent suicide; that this is a necessarily convergent result; and that the fact that maximizing human empowerment agrees with the ideal alignment solution on suicide is actually a key litmus test success result. In other words I fully agree with you on the importance of the suicide case, but this evidence is in favor of human empowerment convergence to CEV.
I have a few somewhat independent arguments of why CEV necessarily converges to suicide prevention:
The simple counterfactual argument: Consider the example of happy adjusted but unlucky Bob whose brain is struck by a cosmic ray which happens to cause some benign tumor in just the correct spot to make him completely suicidal. Clearly pre-accident Bob would not choose this future, and strongly desires interventions to prevent the cosmic ray. Any agent successfully aligned to pre-accident Bob0 would agree. It also should not matter when the cosmic ray struck—the desire of Bob0 to live outweighs the desire of Bob1 to die. Furthermore—if Bob1 had the option of removing all effects of the cosmic ray induced depression they would probably take that option. Suicidal thinking is caused by suffering—via depression, physical pain, etc—and most people (nearly all people?) would take an option to eliminate their suffering without dying, if only said option existed (and they believed it would work).
Counterfactual intra-personal CEV coherence: A suicidal agent is one—by definition—that assigns higher ranking utility to future worlds where they no longer exist than future worlds where they do exist. Now consider the multiverse of all possible versions of Bob. The suicidal versions of Bob rank their worlds as lower utility than other worlds without them, and the non-suicidal versions of Bob rank their worlds as higher than worlds where they commit suicide. Any proper aligned CEV style sovereign will then simply notice that the utility functions of the suicidal and non-suicidal bobs already largely agree, even before any complex convergence considerations! The CEV sovereign can satisfy both of their preferences by increasing the measure of worlds containing happy Bobs, and decreasing the measure of worlds containing suicidal Bobs. So it intervenes to prevent the cosmic ray, and more generally intervenes to prevent suicidal thought modes. Put another way—it can cause suicidal Bob to cease to exist (or exist less in the measure sense) without killing suicidal Bob.
Scaling intelligence trends towards lower discount rates: The purpose of aligned AI is to aid in optimizing the universe according to our utility function. As an agent absorbs more knowledge and improves their ability to foresee and steer the future this naturally leads to a lower discount rate (as discount rates arise from planning uncertainty). So improving our ability to foresee and steer the future will naturally lower our discount rate, making us more longtermist, and thus naturally increasing the convergence of our unknown utility function towards empowerment (which is non-suicidal).
Inter-personal CEV coherence: Most humans are non suicidal and prefer that other humans are non-suicidal. At the limits of convergence, where many futures are simulated and those myriad future selves eventually cohere into agreement, this only naturally leads to suicide prevention: because most surviving future selves are non-suicidal and even weak preferences that others do not commit suicide will eventually dominate the coherent utility function over spacetime. We can consider this a generalization of intra-personal CEV coherence, because the boundary separating all the alternate versions of ourselves across the multiverse from the alternate versions of other people is soft and illusive.
Now back to your other points:
I largely agree, albeit with less confidence. This article is a rough abstract sketch of a complex topic. I have some more thoughts on how empowerment arises naturally, and some math and examples but that largely came after this article.
I agree that individual human empowerment is incomplete for some of the reasons discussed, but I do expect that any correct implementation of something like CEV will probably result in a very long termist agent to which the instrumental convergence to empowerment applies with less caveats. Thus there exists a definition of broad empowerment such that it is a safe bound on that ideal agent’s unknown utility function.
Part of the big issue here is that humans die—so our individual brain empowerment eventually falls off a cliff and this bounds our discount rate (we also run into brain capacity and decay problems which further compound the issue). Any aligned CEV sovereign is likely to focus on fixing that problem—ie through uploading and the post biological transition. Posthumans in any successful utopia will be potentially immortal and thus are likely to have lower and decreasing discount rates.
Also I think most examples of ‘spending’ empowerment are actually examples of conversion between types of empowerment. Spending money on social events with friends is mostly an example of a conversion between financial empowerment and social empowerment. The submarine example is also actually an example of trading financial empowerment for social empowerment (it’s a great story and experience to share with others) and curiosity/knowledge.
All that said I do think there are actual true examples of pure short term rewards vs empowerment tradeoff decisions—such as buying an expensive meal you eat at home alone. These are mostly tradeoffs between hedonic rewards vs long term empowerment, and they don’t apply so much to posthumans (who can have essentially any hedonic reward at any time for free).
This one I don’t understand. The AvE paper trained an empowerment copilot. For some range of hyperparams the copilot helped the human by improving their ability to land successfully (usually by stabilizing the vehicle to make it more controllable). For another range of hyperparams the copilot instead hovered in the air, preventing a landing. It’s just a hyperparam issue because it does work as intended in this example with the right hyperparams. At a higher level though this doesn’t matter much because results from this game don’t generalize to reality—the game is too short.