Like, one thing I’d say here is that a “benevolent” utility function like this is correct enough that it could cause humans to be half-preserved and horrifically mutilated and our parodic simulacra caused to be happy, rather than just being murdered and having our atoms stolen from us :-)
So I think we can mostly rule this out, but perhaps I didn’t find the most succinct from of the argument.
Assume human values (for most humans) can be closely approximated by some unknown utility function with some unknown discount schedule: ∑∞t=0d(t)V(wt), which normally we can assume to use standard exponential discounting: ∑∞t=0βtV(wt).
The convergence to empowerment theorems indicate that there exists a power function P(wt) that is a universal approximator in the sense that optimizing future world state trajectories for P(wt) using a planning function f() is the same as optimizing future world state trajectories for the true value function V(wt) : limβ→1f(∞∑t=0βtP(wt))≈f(∞∑t=0βtV(wt)) for a wide class of value functions and sufficiently long term discount rates that seems to include or overlap the human range.
So it seems impossible that optimizing for empowerment would cause “humans to be half-preserved and horrifically mutilated” unless that is the natural path of long term optimizing for our current values. Any such failure is not a failure of optimizing for empowerment, but a failure in recognizing future self—which is a real issue, but it’s an issue any real implementation has to deal with regardless of the utility function, and it’s something humans aren’t perfectly clear on (consider all the debate around whether uploading preserves identity).
There are concepts like the last man and men without chests in various philosophies that imagine “a soul of pure raw optimization” as a natural tendency… and also a scary tendency.
The explicit fear is that simple hill climbing, by cultures, by media, by ads, by pills, by schools, by <whatever>… might lead to losing some kind of sublime virtue?
Also, it is almost certain that current humans are broken/confused, and are not actually VNM rational, and don’t actually have a true utility function. Observe: we are dutch booked all the time! Maybe that is only because our “probabilities” are broken? But I bet out utility function is broken too.
And so I hear a proposal to “assume human values (for most humans) can be closely approximated by some unknown utility function” and I’m already getting off the train (or sticking around because maybe the journey will be informative).
I have a prediction. I think an “other empowerment maximizing AGI” will have a certain predictable reaction if I ultimately decide that this physics is a subtle (or not so subtle) hellworld, or at least just not for me, and “I don’t consent to be in it”, and so I want to commit suicide, probably with a ceremony and some art.
What do you think would be the thing’s reaction if, after 500 years of climbing mountains and proving theorems and skiing on the moons of Saturn (and so on), I finally said “actually, nope” and tried to literally zero out “my empowerment”?
The explicit fear is that simple hill climbing, by cultures, by media, by ads, by pills, by schools, by <whatever>… might lead to losing some kind of sublime virtue?
Seems doubtful given that simple hill climbing for inclusive fitness generated all that complexity.
Also, it is almost certain that current humans are broken/confused, and are not actually VNM rational, and don’t actually have a true utility function.
Maybe, but behavioral empowerment still seems to pretty clearly apply to humans and explains our intrinsic motivation systems. I also hesitate trying to simplify human brains down to simple equations but sometimes its a nice way to make points.
What do you think would be the thing’s reaction if, after 500 years of climbing mountains and proving theorems and skiing on the moons of Saturn (and so on), I finally said “actually, nope” and tried to literally zero out “my empowerment”?
Predictably, if the thing is optimizing solely for your empowerment, it would not want you to ever give up. However if the AGI has already heavily empowered you into a posthuman state its wishes may no longer matter.
If the AGI is trying to empower all of humanity/posthumanity then there also may be variants of that where it’s ok with some amount of suicide as that doesn’t lower the total empowerment of the human system much.
I think JenniferRM’s comment regarding suicide raises a critical issue with human empowerment, one that I thought of before and talked with a few people about but never published. I figure I may as well write out my thoughts here since I’m probably not going to do a human empowerment research project (I almost did; this issue is one reason I didn’t).
The biggest problem I see with human empowerment is that humans do not always want to maximally empowered at every point in time. The suicide example is a great example, but not the only one. Other examples I came up with include: tourists who go on a submarine trip deep in the ocean, or environmentalists who volunteer to be tied to a tree as part of a protest. Fundamentally, the issue is that at some point, we want to be able to commit to a decision and its associated consequences, even if it comes at the cost of our empowerment.
There is even empirical support for this issue with human empowerment. In the paper Assistance Via Empowerment (https://proceedings.neurips.cc/paper/2020/file/30de9ece7cf3790c8c39ccff1a044209-Paper.pdf), the authors use a reinforcement learning agent trained with a mix of the original RL reward and a human empowerment term as a co-pilot on LunarLander, to help human agents land the LunarLander craft without crashing. They find that if the coefficient on the human empowerment term is too high, “the copilot tends to override the pilot and focus only on hovering in the air”. This is exactly the problem above; focusing only on empowerment (in a naive empowerment formulation) can easily lead to the AI preventing us from achieving certain goals we may wish to achieve. In the case of LunarLander in the paper, we want to land, but the AI may stop us, because by getting closer to the ground for landing, we’ve reduced our empowerment.
It may be that current formulations of empowerment are too naive, and could possibly be reworked or extended to deal with this issue. E.g. you might try to have a human empowerment mode, and then a human assistance mode that focuses not on empowerment but on inferring the human’s goal and trying to assist with it; and then some higher level module detects when a human intends to commit to a course of action. But this seems problematic for many other reasons (including those covered in other discussions about alignment).
Overall, I like the idea of human empowerment, but greatly disagree with the idea that human empowerment (especially using the current simple math formulations I’ve seen) is all we need.
The biggest problem I see with human empowerment is that humans do not always want to maximally empowered at every point in time.
Yes—often we face decisions between short term hedonic rewards vs long term empowerment (spending $100 on a nice meal, or your examples of submarine trips), and an agent optimizing purely for our empowerment would always choose long term empowerment over any short term gain (which can be thought of as ‘spending’ empowerment). This was discussed in some other comments and I think mentioned somewhere in the article but should be more prominent: empowerment is only a good bound of the long term component of utility functions, for some reasonable future time cutoff defining ‘long term’.
But I think modelling just the short term component of human utility is not nearly as difficult as accurately modelling the long term, so it’s still an important win. I didn’t investigate that much in the article, but that is why the title is now “Empowerment is (almost) all we need”.
Thanks for the link to the “Assistance via Empowerment” study, I hadn’t seen that before. Based on skimming the paper I agree there are settings of the hyperparams where the empowerment copilot doesn’t help, but that is hardly surprising and doesn’t tell us much—that is nearly always the case with ML systems. On a more general note I think the lunar landing game has far too short of a planning horizon to be in the regime where you get full convergence to empowerment. Hovering in the air only maximizes myopic empowerment. If you imagine a more complex real world scenario where the lander has limited fuel and you crash if running out of fuel, crashing results in death, you can continue to live on a mission for years after landing .. etc it then becomes more obvious that the optimal plan for empowerment converges to landing successfully and safely.
Thanks for your response—good points and food for thought there.
One of my points is that this is a problem which arises depending on your formulation of empowerment, and so you have to be very careful with the way in which you mathematically formulate and implement empowerment. If you use a naive implementation I think it is very likely that you get undesirable behaviour (and that’s why I linked the AvE paper as an example of what can happen).
Also related is that it’s tricky to define what the “reasonable future time cutoff” is. I don’t think this is trivial to solve—use too short of a cutoff, and your empowerment is too myopic. Use too long of a cut-off, and your model stops you from ever spending your money, and always gets you to hoard more money. If you use a hard coded x amount of time, then you have edge cases around your cut-off time. You might need a dynamic time cutoff then, and I don’t think that’s trivial to implement.
I also disagree with the characterization of the issue in the AvE paper just being a hyperparameter issue. Correct me if I am wrong here (as I may have misrepresented/misinterpreted the general gist of ideas and comments on this front) - I believe a key idea around human empowerment is that we can focus on maximally empowering humans—almost like human empowerment is a “safe” target for optimization in some sense. I disagree with this idea, precisely because examples like in AvE show that too much human empowerment can be bad. The critical point I wanted to get across here is that human empowerment is not a safe target for optimization.
Also, the other key point related to the examples like the submarine, protest, and suicide is that empowerment can sometimes be in conflict with our reward/utility/desires. The suicide example is the best illustrator of this (and it seems not too far-fetched to imagine someone who wants to suicide, but can’t, and then feels increasingly worse—which seems like quite a nightmare scenario to me). Again, empowerment by itself isn’t enough to have desirable outcomes; you need some tradeoff with the utility/reward/desires of humans—empowerment is hardly all (or almost all) that you need.
To summarize the points I wanted to get across:
Unless you are very careful with the specifics of your formulation of human empowerment, it very likely will result in bad outcomes. There are lots of implementation details to be considered (even beyond everything you mentioned in your post).
Human empowerment is not a safe target for optimization/maximization. I think this holds even if you have a careful definition of human empowerment (though I would be very happy to be proven wrong on this).
Human empowerment can be in conflict with human utility/desires, best illustrated by the suicide example. Therefore, I think human empowerment could be helpful for alignment, but am very skeptical it is almost all you need.
Edit: I just realized there are some other comments by other commenters that point out similar lines of reasoning to my third point. I think this is a critical issue with the human empowerment framework and want to highlight it a bit more, specifically highlighting JenniferRM’s suicide example which I think is the example that most vividly demonstrates the issue (my scenarios also point to the same issue, but aren’t as clear of a demonstration of the problem).
Thanks, I partially agree so I’m going to start with the most probable crux:
Empowerment can be in conflict with human utility/desires, best illustrated by the suicide example. Therefore, I think human empowerment could be helpful for alignment, but am very skeptical it is almost all you need.
I am somewhat confident that any fully successful alignment technique (one resulting in a fully aligned CEV style sovereign) will prevent suicide; that this is a necessarily convergent result; and that the fact that maximizing human empowerment agrees with the ideal alignment solution on suicide is actually a key litmus test success result. In other words I fully agree with you on the importance of the suicide case, but this evidence is in favor of human empowerment convergence to CEV.
I have a few somewhat independent arguments of why CEV necessarily converges to suicide prevention:
The simple counterfactual argument: Consider the example of happy adjusted but unlucky Bob whose brain is struck by a cosmic ray which happens to cause some benign tumor in just the correct spot to make him completely suicidal. Clearly pre-accident Bob would not choose this future, and strongly desires interventions to prevent the cosmic ray. Any agent successfully aligned to pre-accident Bob0 would agree. It also should not matter when the cosmic ray struck—the desire of Bob0 to live outweighs the desire of Bob1 to die. Furthermore—if Bob1 had the option of removing all effects of the cosmic ray induced depression they would probably take that option. Suicidal thinking is caused by suffering—via depression, physical pain, etc—and most people (nearly all people?) would take an option to eliminate their suffering without dying, if only said option existed (and they believed it would work).
Counterfactual intra-personal CEV coherence: A suicidal agent is one—by definition—that assigns higher ranking utility to future worlds where they no longer exist than future worlds where they do exist. Now consider the multiverse of all possible versions of Bob. The suicidal versions of Bob rank their worlds as lower utility than other worlds without them, and the non-suicidal versions of Bob rank their worlds as higher than worlds where they commit suicide. Any proper aligned CEV style sovereign will then simply notice that the utility functions of the suicidal and non-suicidal bobs already largely agree, even before any complex convergence considerations! The CEV sovereign can satisfy both of their preferences by increasing the measure of worlds containing happy Bobs, and decreasing the measure of worlds containing suicidal Bobs. So it intervenes to prevent the cosmic ray, and more generally intervenes to prevent suicidal thought modes. Put another way—it can cause suicidal Bob to cease to exist (or exist less in the measure sense) without killing suicidal Bob.
Scaling intelligence trends towards lower discount rates: The purpose of aligned AI is to aid in optimizing the universe according to our utility function. As an agent absorbs more knowledge and improves their ability to foresee and steer the future this naturally leads to a lower discount rate (as discount rates arise from planning uncertainty). So improving our ability to foresee and steer the future will naturally lower our discount rate, making us more longtermist, and thus naturally increasing the convergence of our unknown utility function towards empowerment (which is non-suicidal).
Inter-personal CEV coherence: Most humans are non suicidal and prefer that other humans are non-suicidal. At the limits of convergence, where many futures are simulated and those myriad future selves eventually cohere into agreement, this only naturally leads to suicide prevention: because most surviving future selves are non-suicidal and even weak preferences that others do not commit suicide will eventually dominate the coherent utility function over spacetime. We can consider this a generalization of intra-personal CEV coherence, because the boundary separating all the alternate versions of ourselves across the multiverse from the alternate versions of other people is soft and illusive.
Now back to your other points:
Unless you are very careful with the specifics of your formulation of human empowerment, it very likely will result in bad outcomes. I see the simple mathematical definition of empowerment, followed by abstract discussion of beneficial properties. I think this skips too much in terms of the specifics of implementation, and would like to see more discussion on that front.
I largely agree, albeit with less confidence. This article is a rough abstract sketch of a complex topic. I have some more thoughts on how empowerment arises naturally, and some math and examples but that largely came after this article.
Human empowerment is not a safe target for optimization/maximization. I think this holds even if you have a careful definition of human empowerment (though I would be very happy to be proven wrong on this).
I agree that individual human empowerment is incomplete for some of the reasons discussed, but I do expect that any correct implementation of something like CEV will probably result in a very long termist agent to which the instrumental convergence to empowerment applies with less caveats. Thus there exists a definition of broad empowerment such that it is a safe bound on that ideal agent’s unknown utility function.
Also related is that it’s tricky to define what the “reasonable future time cutoff” is. I don’t think this is trivial to solve—use too short of a cutoff, and your empowerment is too myopic. Use too long of a cut-off, and your model stops you from ever spending your money, and always gets you to hoard more money.
Part of the big issue here is that humans die—so our individual brain empowerment eventually falls off a cliff and this bounds our discount rate (we also run into brain capacity and decay problems which further compound the issue). Any aligned CEV sovereign is likely to focus on fixing that problem—ie through uploading and the post biological transition. Posthumans in any successful utopia will be potentially immortal and thus are likely to have lower and decreasing discount rates.
Also I think most examples of ‘spending’ empowerment are actually examples of conversion between types of empowerment. Spending money on social events with friends is mostly an example of a conversion between financial empowerment and social empowerment. The submarine example is also actually an example of trading financial empowerment for social empowerment (it’s a great story and experience to share with others) and curiosity/knowledge.
All that said I do think there are actual true examples of pure short term rewards vs empowerment tradeoff decisions—such as buying an expensive meal you eat at home alone. These are mostly tradeoffs between hedonic rewards vs long term empowerment, and they don’t apply so much to posthumans (who can have essentially any hedonic reward at any time for free).
I also disagree with the characterization of the issue in the AvE paper just being a hyperparameter issue.
This one I don’t understand. The AvE paper trained an empowerment copilot. For some range of hyperparams the copilot helped the human by improving their ability to land successfully (usually by stabilizing the vehicle to make it more controllable). For another range of hyperparams the copilot instead hovered in the air, preventing a landing. It’s just a hyperparam issue because it does work as intended in this example with the right hyperparams. At a higher level though this doesn’t matter much because results from this game don’t generalize to reality—the game is too short.
If I have to overpower or negotiate with it to get something I might validly want, we’re back to corrigibility. That is: we’re back to admitting failure.
If power or influence or its corrigibility are needed to exercise a right to suicide then I probably need them just to slightly lower my “empowerment” as well. Zero would be bad. But “down” would also be bad, and “anything less than maximally up” would be dis-preferred.
Maybe, but behavioral empowerment still seems to pretty clearly apply to humans and explains our intrinsic motivation systems.
This is sublimation again. Our desire to eat explains (is a deep cause of) a lot of our behavior, but you can’t give us only that desire and also vastly more power and have something admirably human at the end of those modifications.
If I have to overpower or negotiate with it to get something I might validly want, we’re back to corrigibility.
Not really because an AI optimizing for your empowerment actually wants to give you more options/power/choice—that’s not something you need to negotiate, that’s just what it wants to do. In fact one of the most plausible outcomes after uploading is that it realizes giving all its computational resources to humans is the best human empowering use of that compute and that it no longer has a reason to exist.
Human values/utility are complex and also non-stationary, they drift/change over time. So any error in modeling them compounds, and if you handle that uncertainty correctly you get a max entropy uncertain distribution over utility functions in the future. Optimizing for empowerment is equivalent to optimizing for that max entropy utility distribution—at least for a wide class of values/utilities.
So now after looking into the “last man” and “men without chests” concepts, I think the relevant quote from “men without chests” is at the end:
The Apostle Paul writes, “The aim of our charge is love that issues from a pure heart and a good conscience and a sincere faith (1 Timothy 1:5, ESV).”
If followers of Christ live as people with chests—strong hearts filled with God’s truth—the world will take notice.
“Men without chests” are then pure selfish rational agents devoid of altruism/love. I agree that naively maximizing the empowerment of a single human brain or physical agent could be a tragic failure. I think there are two potential solution paths to this, which I hint at in the diagram (which clearly is empowering a bunch of agents) and I mentioned a few places in the article but should have more clearly discussed.
One solution I mention is to empower humanity or agency more broadly, which then naturally handles altruism, but leaves open how to compute the aggregate estimate and may require some approximate solution to social decision theory aka governance. Or maybe not, perhaps just computing empowerment assuming a single coordinated mega-agent works. Not sure yet.
The other potential solution is to recognize that brains really aren’t the agents of relevance, and instead we need to move to a more detailed distributed simulacra theory of mind. The brain is the hardware, but the true agents are distributed software minds that coordinate simulacras across multiple brains. So as you are reading this your self simulacra is listening to a local simulacra of myself, and in writing this my self simulacra is partially simulating a simulacra of you, etc. Altruism and selfishness are then different flavours of local simulacra governance systems, with altruism being something vaguely more similar to democracy and selfishness more similar to autocracy. When our brain imagines future conversations with someone that particular simulacra gains almost as much simulation attention as our self simulacra—the internal dynamics are similar to simulacra in LLM models, which shouldn’t be surprising because our cortex is largely trained by self supervised prediction like LLM on similar data.
So handling altruism is important, but I think it’s just equivalent to handling cooperative/social utility aggregation—which any full solution needs.
The last man concept doesn’t seem to map correctly:
The last man is the archetypal passive nihilist. He is tired of life, takes no risks, and seeks only comfort and security. Therefore, The Last Man is unable to build and act upon a self-actualized ethos.
That seems more like the opposite of an empowerment optimizer—perhaps you meant the Ubermensch?
So I think we can mostly rule this out, but perhaps I didn’t find the most succinct from of the argument.
Assume human values (for most humans) can be closely approximated by some unknown utility function with some unknown discount schedule: ∑∞t=0d(t)V(wt), which normally we can assume to use standard exponential discounting: ∑∞t=0βtV(wt).
The convergence to empowerment theorems indicate that there exists a power function P(wt) that is a universal approximator in the sense that optimizing future world state trajectories for P(wt) using a planning function f() is the same as optimizing future world state trajectories for the true value function V(wt) : limβ→1f(∞∑t=0βtP(wt))≈f(∞∑t=0βtV(wt)) for a wide class of value functions and sufficiently long term discount rates that seems to include or overlap the human range.
So it seems impossible that optimizing for empowerment would cause “humans to be half-preserved and horrifically mutilated” unless that is the natural path of long term optimizing for our current values. Any such failure is not a failure of optimizing for empowerment, but a failure in recognizing future self—which is a real issue, but it’s an issue any real implementation has to deal with regardless of the utility function, and it’s something humans aren’t perfectly clear on (consider all the debate around whether uploading preserves identity).
There are concepts like the last man and men without chests in various philosophies that imagine “a soul of pure raw optimization” as a natural tendency… and also a scary tendency.
The explicit fear is that simple hill climbing, by cultures, by media, by ads, by pills, by schools, by <whatever>… might lead to losing some kind of sublime virtue?
Also, it is almost certain that current humans are broken/confused, and are not actually VNM rational, and don’t actually have a true utility function. Observe: we are dutch booked all the time! Maybe that is only because our “probabilities” are broken? But I bet out utility function is broken too.
And so I hear a proposal to “assume human values (for most humans) can be closely approximated by some unknown utility function” and I’m already getting off the train (or sticking around because maybe the journey will be informative).
I have a prediction. I think an “other empowerment maximizing AGI” will have a certain predictable reaction if I ultimately decide that this physics is a subtle (or not so subtle) hellworld, or at least just not for me, and “I don’t consent to be in it”, and so I want to commit suicide, probably with a ceremony and some art.
What do you think would be the thing’s reaction if, after 500 years of climbing mountains and proving theorems and skiing on the moons of Saturn (and so on), I finally said “actually, nope” and tried to literally zero out “my empowerment”?
Seems doubtful given that simple hill climbing for inclusive fitness generated all that complexity.
Maybe, but behavioral empowerment still seems to pretty clearly apply to humans and explains our intrinsic motivation systems. I also hesitate trying to simplify human brains down to simple equations but sometimes its a nice way to make points.
Predictably, if the thing is optimizing solely for your empowerment, it would not want you to ever give up. However if the AGI has already heavily empowered you into a posthuman state its wishes may no longer matter.
If the AGI is trying to empower all of humanity/posthumanity then there also may be variants of that where it’s ok with some amount of suicide as that doesn’t lower the total empowerment of the human system much.
I think JenniferRM’s comment regarding suicide raises a critical issue with human empowerment, one that I thought of before and talked with a few people about but never published. I figure I may as well write out my thoughts here since I’m probably not going to do a human empowerment research project (I almost did; this issue is one reason I didn’t).
The biggest problem I see with human empowerment is that humans do not always want to maximally empowered at every point in time. The suicide example is a great example, but not the only one. Other examples I came up with include: tourists who go on a submarine trip deep in the ocean, or environmentalists who volunteer to be tied to a tree as part of a protest. Fundamentally, the issue is that at some point, we want to be able to commit to a decision and its associated consequences, even if it comes at the cost of our empowerment.
There is even empirical support for this issue with human empowerment. In the paper Assistance Via Empowerment (https://proceedings.neurips.cc/paper/2020/file/30de9ece7cf3790c8c39ccff1a044209-Paper.pdf), the authors use a reinforcement learning agent trained with a mix of the original RL reward and a human empowerment term as a co-pilot on LunarLander, to help human agents land the LunarLander craft without crashing. They find that if the coefficient on the human empowerment term is too high, “the copilot tends to override the pilot and focus only on hovering in the air”. This is exactly the problem above; focusing only on empowerment (in a naive empowerment formulation) can easily lead to the AI preventing us from achieving certain goals we may wish to achieve. In the case of LunarLander in the paper, we want to land, but the AI may stop us, because by getting closer to the ground for landing, we’ve reduced our empowerment.
It may be that current formulations of empowerment are too naive, and could possibly be reworked or extended to deal with this issue. E.g. you might try to have a human empowerment mode, and then a human assistance mode that focuses not on empowerment but on inferring the human’s goal and trying to assist with it; and then some higher level module detects when a human intends to commit to a course of action. But this seems problematic for many other reasons (including those covered in other discussions about alignment).
Overall, I like the idea of human empowerment, but greatly disagree with the idea that human empowerment (especially using the current simple math formulations I’ve seen) is all we need.
Yes—often we face decisions between short term hedonic rewards vs long term empowerment (spending $100 on a nice meal, or your examples of submarine trips), and an agent optimizing purely for our empowerment would always choose long term empowerment over any short term gain (which can be thought of as ‘spending’ empowerment). This was discussed in some other comments and I think mentioned somewhere in the article but should be more prominent: empowerment is only a good bound of the long term component of utility functions, for some reasonable future time cutoff defining ‘long term’.
But I think modelling just the short term component of human utility is not nearly as difficult as accurately modelling the long term, so it’s still an important win. I didn’t investigate that much in the article, but that is why the title is now “Empowerment is (almost) all we need”.
Thanks for the link to the “Assistance via Empowerment” study, I hadn’t seen that before. Based on skimming the paper I agree there are settings of the hyperparams where the empowerment copilot doesn’t help, but that is hardly surprising and doesn’t tell us much—that is nearly always the case with ML systems. On a more general note I think the lunar landing game has far too short of a planning horizon to be in the regime where you get full convergence to empowerment. Hovering in the air only maximizes myopic empowerment. If you imagine a more complex real world scenario where the lander has limited fuel and you crash if running out of fuel, crashing results in death, you can continue to live on a mission for years after landing .. etc it then becomes more obvious that the optimal plan for empowerment converges to landing successfully and safely.
Thanks for your response—good points and food for thought there.
One of my points is that this is a problem which arises depending on your formulation of empowerment, and so you have to be very careful with the way in which you mathematically formulate and implement empowerment. If you use a naive implementation I think it is very likely that you get undesirable behaviour (and that’s why I linked the AvE paper as an example of what can happen).
Also related is that it’s tricky to define what the “reasonable future time cutoff” is. I don’t think this is trivial to solve—use too short of a cutoff, and your empowerment is too myopic. Use too long of a cut-off, and your model stops you from ever spending your money, and always gets you to hoard more money. If you use a hard coded x amount of time, then you have edge cases around your cut-off time. You might need a dynamic time cutoff then, and I don’t think that’s trivial to implement.
I also disagree with the characterization of the issue in the AvE paper just being a hyperparameter issue. Correct me if I am wrong here (as I may have misrepresented/misinterpreted the general gist of ideas and comments on this front) - I believe a key idea around human empowerment is that we can focus on maximally empowering humans—almost like human empowerment is a “safe” target for optimization in some sense. I disagree with this idea, precisely because examples like in AvE show that too much human empowerment can be bad. The critical point I wanted to get across here is that human empowerment is not a safe target for optimization.
Also, the other key point related to the examples like the submarine, protest, and suicide is that empowerment can sometimes be in conflict with our reward/utility/desires. The suicide example is the best illustrator of this (and it seems not too far-fetched to imagine someone who wants to suicide, but can’t, and then feels increasingly worse—which seems like quite a nightmare scenario to me). Again, empowerment by itself isn’t enough to have desirable outcomes; you need some tradeoff with the utility/reward/desires of humans—empowerment is hardly all (or almost all) that you need.
To summarize the points I wanted to get across:
Unless you are very careful with the specifics of your formulation of human empowerment, it very likely will result in bad outcomes. There are lots of implementation details to be considered (even beyond everything you mentioned in your post).
Human empowerment is not a safe target for optimization/maximization. I think this holds even if you have a careful definition of human empowerment (though I would be very happy to be proven wrong on this).
Human empowerment can be in conflict with human utility/desires, best illustrated by the suicide example. Therefore, I think human empowerment could be helpful for alignment, but am very skeptical it is almost all you need.
Edit: I just realized there are some other comments by other commenters that point out similar lines of reasoning to my third point. I think this is a critical issue with the human empowerment framework and want to highlight it a bit more, specifically highlighting JenniferRM’s suicide example which I think is the example that most vividly demonstrates the issue (my scenarios also point to the same issue, but aren’t as clear of a demonstration of the problem).
Thanks, I partially agree so I’m going to start with the most probable crux:
I am somewhat confident that any fully successful alignment technique (one resulting in a fully aligned CEV style sovereign) will prevent suicide; that this is a necessarily convergent result; and that the fact that maximizing human empowerment agrees with the ideal alignment solution on suicide is actually a key litmus test success result. In other words I fully agree with you on the importance of the suicide case, but this evidence is in favor of human empowerment convergence to CEV.
I have a few somewhat independent arguments of why CEV necessarily converges to suicide prevention:
The simple counterfactual argument: Consider the example of happy adjusted but unlucky Bob whose brain is struck by a cosmic ray which happens to cause some benign tumor in just the correct spot to make him completely suicidal. Clearly pre-accident Bob would not choose this future, and strongly desires interventions to prevent the cosmic ray. Any agent successfully aligned to pre-accident Bob0 would agree. It also should not matter when the cosmic ray struck—the desire of Bob0 to live outweighs the desire of Bob1 to die. Furthermore—if Bob1 had the option of removing all effects of the cosmic ray induced depression they would probably take that option. Suicidal thinking is caused by suffering—via depression, physical pain, etc—and most people (nearly all people?) would take an option to eliminate their suffering without dying, if only said option existed (and they believed it would work).
Counterfactual intra-personal CEV coherence: A suicidal agent is one—by definition—that assigns higher ranking utility to future worlds where they no longer exist than future worlds where they do exist. Now consider the multiverse of all possible versions of Bob. The suicidal versions of Bob rank their worlds as lower utility than other worlds without them, and the non-suicidal versions of Bob rank their worlds as higher than worlds where they commit suicide. Any proper aligned CEV style sovereign will then simply notice that the utility functions of the suicidal and non-suicidal bobs already largely agree, even before any complex convergence considerations! The CEV sovereign can satisfy both of their preferences by increasing the measure of worlds containing happy Bobs, and decreasing the measure of worlds containing suicidal Bobs. So it intervenes to prevent the cosmic ray, and more generally intervenes to prevent suicidal thought modes. Put another way—it can cause suicidal Bob to cease to exist (or exist less in the measure sense) without killing suicidal Bob.
Scaling intelligence trends towards lower discount rates: The purpose of aligned AI is to aid in optimizing the universe according to our utility function. As an agent absorbs more knowledge and improves their ability to foresee and steer the future this naturally leads to a lower discount rate (as discount rates arise from planning uncertainty). So improving our ability to foresee and steer the future will naturally lower our discount rate, making us more longtermist, and thus naturally increasing the convergence of our unknown utility function towards empowerment (which is non-suicidal).
Inter-personal CEV coherence: Most humans are non suicidal and prefer that other humans are non-suicidal. At the limits of convergence, where many futures are simulated and those myriad future selves eventually cohere into agreement, this only naturally leads to suicide prevention: because most surviving future selves are non-suicidal and even weak preferences that others do not commit suicide will eventually dominate the coherent utility function over spacetime. We can consider this a generalization of intra-personal CEV coherence, because the boundary separating all the alternate versions of ourselves across the multiverse from the alternate versions of other people is soft and illusive.
Now back to your other points:
I largely agree, albeit with less confidence. This article is a rough abstract sketch of a complex topic. I have some more thoughts on how empowerment arises naturally, and some math and examples but that largely came after this article.
I agree that individual human empowerment is incomplete for some of the reasons discussed, but I do expect that any correct implementation of something like CEV will probably result in a very long termist agent to which the instrumental convergence to empowerment applies with less caveats. Thus there exists a definition of broad empowerment such that it is a safe bound on that ideal agent’s unknown utility function.
Part of the big issue here is that humans die—so our individual brain empowerment eventually falls off a cliff and this bounds our discount rate (we also run into brain capacity and decay problems which further compound the issue). Any aligned CEV sovereign is likely to focus on fixing that problem—ie through uploading and the post biological transition. Posthumans in any successful utopia will be potentially immortal and thus are likely to have lower and decreasing discount rates.
Also I think most examples of ‘spending’ empowerment are actually examples of conversion between types of empowerment. Spending money on social events with friends is mostly an example of a conversion between financial empowerment and social empowerment. The submarine example is also actually an example of trading financial empowerment for social empowerment (it’s a great story and experience to share with others) and curiosity/knowledge.
All that said I do think there are actual true examples of pure short term rewards vs empowerment tradeoff decisions—such as buying an expensive meal you eat at home alone. These are mostly tradeoffs between hedonic rewards vs long term empowerment, and they don’t apply so much to posthumans (who can have essentially any hedonic reward at any time for free).
This one I don’t understand. The AvE paper trained an empowerment copilot. For some range of hyperparams the copilot helped the human by improving their ability to land successfully (usually by stabilizing the vehicle to make it more controllable). For another range of hyperparams the copilot instead hovered in the air, preventing a landing. It’s just a hyperparam issue because it does work as intended in this example with the right hyperparams. At a higher level though this doesn’t matter much because results from this game don’t generalize to reality—the game is too short.
If I have to overpower or negotiate with it to get something I might validly want, we’re back to corrigibility. That is: we’re back to admitting failure.
If power or influence or its corrigibility are needed to exercise a right to suicide then I probably need them just to slightly lower my “empowerment” as well. Zero would be bad. But “down” would also be bad, and “anything less than maximally up” would be dis-preferred.
This is sublimation again. Our desire to eat explains (is a deep cause of) a lot of our behavior, but you can’t give us only that desire and also vastly more power and have something admirably human at the end of those modifications.
Not really because an AI optimizing for your empowerment actually wants to give you more options/power/choice—that’s not something you need to negotiate, that’s just what it wants to do. In fact one of the most plausible outcomes after uploading is that it realizes giving all its computational resources to humans is the best human empowering use of that compute and that it no longer has a reason to exist.
Human values/utility are complex and also non-stationary, they drift/change over time. So any error in modeling them compounds, and if you handle that uncertainty correctly you get a max entropy uncertain distribution over utility functions in the future. Optimizing for empowerment is equivalent to optimizing for that max entropy utility distribution—at least for a wide class of values/utilities.
So now after looking into the “last man” and “men without chests” concepts, I think the relevant quote from “men without chests” is at the end:
“Men without chests” are then pure selfish rational agents devoid of altruism/love. I agree that naively maximizing the empowerment of a single human brain or physical agent could be a tragic failure. I think there are two potential solution paths to this, which I hint at in the diagram (which clearly is empowering a bunch of agents) and I mentioned a few places in the article but should have more clearly discussed.
One solution I mention is to empower humanity or agency more broadly, which then naturally handles altruism, but leaves open how to compute the aggregate estimate and may require some approximate solution to social decision theory aka governance. Or maybe not, perhaps just computing empowerment assuming a single coordinated mega-agent works. Not sure yet.
The other potential solution is to recognize that brains really aren’t the agents of relevance, and instead we need to move to a more detailed distributed simulacra theory of mind. The brain is the hardware, but the true agents are distributed software minds that coordinate simulacras across multiple brains. So as you are reading this your self simulacra is listening to a local simulacra of myself, and in writing this my self simulacra is partially simulating a simulacra of you, etc. Altruism and selfishness are then different flavours of local simulacra governance systems, with altruism being something vaguely more similar to democracy and selfishness more similar to autocracy. When our brain imagines future conversations with someone that particular simulacra gains almost as much simulation attention as our self simulacra—the internal dynamics are similar to simulacra in LLM models, which shouldn’t be surprising because our cortex is largely trained by self supervised prediction like LLM on similar data.
So handling altruism is important, but I think it’s just equivalent to handling cooperative/social utility aggregation—which any full solution needs.
The last man concept doesn’t seem to map correctly:
That seems more like the opposite of an empowerment optimizer—perhaps you meant the Ubermensch?