An “efficient and robust approximation of empowerment” is a “natural abstraction” / salient concept that AGIs (“even those of human-level intelligence”) are likely to have learned
If an AGI was trying to maximize the empowerment of humanity, that would be a good thing.
Of these:
I’m somewhat skeptical of Claim 1. I am a human-level intelligence, and I do have an empowerment concept, but I don’t have a robust empowerment concept. For example, when I ask myself how someone would spend $1 trillion to maximally “empower” their pet turtle, my brain immediately responds: ¯\_(ツ)_/¯. I agree that AGIs are very likely to have an “empowerment” concept (if only because AGIs are very likely to have heard the English word “empowerment”), but I expect that by default that concept (like all concepts) would be defined by a bunch of statistical associations to other concepts and pattern-matches to lots of examples. But the examples would all be everyday examples of humans empowering humans (e.g. on youtube), without a clear-cut method to extrapolate that concept into weird sci-fi futures, nor a clear-cut motivation to develop such a method, I think. I would instead propose to figure out the “efficient and robust approximation of empowerment” ourselves, right now, then write down the formula, make up a jargon word for it, directly tell the AGI what the formula is and give it tons of worked examples during training, and then we can have somewhat more hope that the AGI will have this particular concept that we (allegedly) want it to have.
I will remain very concerned about Claim 2 until someone shows me the exact proxy to be used. (I’m working on this problem myself, and don’t expect any solution I can think of to converge to pure empowerment motivation.) For example, when Human A learns that Human B is helpless (disempowered), that thought can be negative-valence for Human A, and Human A can address this problem by helping Human B (which is good), or by deliberately avoiding thinking about Human B (which is bad), or by dehumanizing / kicking Human B out of the circle of agents that Human A cares about (which is very bad). It’s not clear how to make sure that the first one happens reliably.
I’m skeptical of Claim 3 mainly for similar reasons as Charlie Steiner’s comment on this page that an AGI trying to empower me would want me to accumulate resources but not spend them, and would want to me to want to accumulate resources but not spend them, and more generally would not feel any particular affinity for me maintaining my idiosyncratic set of values and desires as opposed to getting brainwashed or whatever.
Related to this last bullet point, OP writes “The specific human values that most deviate from empowerment are exactly the values that are least robust and the most likely to drift or change as we become posthuman and continue our increasingly accelerated mental and cultural evolution”. But I claim these are also exactly the values that determine whether our future lightcone is tiled with hedonium versus paperclips versus cosmopolitan posthuman society etc. Yes they might drift, but we care very much that they drift in a way that “carries the torch of human values into the future” (or somesuch), as opposed to deleting them entirely or rolling an RNG.
I’m guessing the reply would be something related to the topic of the “identity preservation” section. But I don’t understand how the time bounds work here. If the AGI spends Monday executing a plan to accumulate resources, and then gives those resources to the person on Tuesday, to use for the rest of their lives, that’s good. If the AGI spends Monday brainwashing the human to be more power-hungry, and then the person is more effective at resource-acquisition starting on Tuesday and continuing for the rest of their lives, that’s bad. I’m confused how the empowerment formula would treat these two cases in the way we want.
Our drives for power, knowledge, self-actualization, social status/influence, curiosity and even fun can all be derived as instrumental subgoals or manifestations of empowerment.
I can read this sentence in two ways, and I’m not sure which one is intended:
There is innate stuff in the genome that makes humans want empowerment, and humans discover through within-lifetime learning that social status is instrumentally useful for achieving empowerment. Ditto curiosity, fun, etc.
There is innate stuff in the genome that makes humans want social status. Oh by the way, the reason that this stuff wound up in the genome is because social status tends to lead to empowerment, which in turn tends to lead to higher inclusive genetic fitness. Ditto curiosity, fun, etc.
The second one is fine, but I feel pretty strongly opposed to the first one. I think people and animals start do things out of curiosity, or for fun, etc., long before having any basis for knowing that the resulting behaviors will tend to increase their empowerment.
This is important because the first one would suggest that humans don’t really want fun, or social status, or whatever. They really want empowerment, and everything else is a means to an end. But that’s not true! Humans really do want to have fun, and not suffer, etc., as an end in itself, I claim. That means that an AGI with an empowerment objective would want different things for us, than we want for ourselves, which is bad.
An “efficient and robust approximation of empowerment” is a “natural abstraction” / salient concept that AGIs (“even those of human-level intelligence”) are likely to have learned
That isn’t actually a claim I’m making—empowerment intrinsic motivation is the core utility function for selfish AGI (see all the examples in that section ) rather than something learned. Conceptually altruistic AGI uses external empowerment as the core utility function (although in practice it will also likely need self-empowerment derived intrinsic motivation to bootstrap).
Also compare to the behavioral empowerment hypothesis from the intro: bacteria moving along sugar gradients, chimpanzees seeking social status, and humans seeking wealth are not doing those things to satisfy a learned concept of empowerment—they are acting as if they are maximizing empowerment. Evolution learned various approximations of empowerment intrinsic motivation.
without a clear-cut method to extrapolate that concept into weird sci-fi futures,
Ah ok, so I think the core novelty here is that no matter what your values are, optimizing for your empowerment today is identical to optimizing for your long term values today. Those specific wierd dystopian sci-fi futures are mostly all automatically avoided.
The price that you’d pay for that is giving up short term utility for long term utility, and possibly a change in core values when becoming posthuman. But I think we can mostly handle that by using learned human values to cover more of the short term utility and empowerment for the long term.
But there is always this unavoidable tradeoff between utility at different timescales, and there is an optimization pressure gradient favoring low discount rates.
I would instead propose to figure out the “efficient and robust approximation of empowerment” ourselves, right now, then write down the formula, make up a jargon word for it,
Right—that’s all research track on empowerment and intrinsic motivation I briefly summarized.
If an AGI is trained on “correlation-guided proxy matching” with [todo: fill-in-the-blank] proxy, then it will wind up wanting to maximize the “efficient and robust approximation of empowerment” of humanity
That isn’t a claim I make here in this article. I do think the circuit grounding/pointing problem is fundamental, and correlation-guided proxy matching is my current best vague guess about how the brain solves that problem. But that’s a core problem with selfish-AGI as well—for the reasons outlined in the cartesian objection section, robust utility functions must be computed from learned world model state.
I’m skeptical of Claim 3 mainly for similar reasons as Charlie Steiner’s comment on this page that an AGI trying to empower me would want me to accumulate resources but not spend them,
I already responded to his comment, but yes long-termism favors saving/investing over spending.
To the extent that’s actually a problem, one could attempt to tune an empowerment discount rate that matches the human discount rate, so the AGI wants you to sacrifice some long term optionality/wealth for some short term optionality/wealth. Doing that too much causes divergence however, so I focused on the pure long term cases where there is full convergence, and again I think using learned human values more directly for the short term seems promising.
If the AGI spends Monday executing a plan to accumulate resources, and then gives those resources to the person on Tuesday, to use for the rest of their lives, that’s good. If the AGI spends Monday brainwashing the human to be more power-hungry, and then the person is more effective at resource-acquisition starting on Tuesday and continuing for the rest of their lives, that’s bad.
If we are talking about human surpassing AGI, then almost by definition it will be more effective for the AGI to generate wealth for you directly rather than ‘brainwashing’ you into something that can generate wealth more effectively than it can.
There is innate stuff in the genome that makes humans want social status. Oh by the way, the reason that this stuff wound up in the genome is because social status tends to lead to empowerment, which in turn tends to lead to higher inclusive genetic fitness. Ditto curiosity, fun, etc.
Yeah mostly this because empowerment is very complex and can only be approximated, and it must be approximated efficiently even early on. So somewhere in there I described it as an instrumental hierarchy, where inclusive fitness leads to empowerment leads to curiosity, fun, etc.
Except of course there are some things like money which we seem to pretty quickly intuitively learn the utility of which suggests we are also eventually using some more direct learned approximations of empowerment.
Getting back to this:
But I claim these are also exactly the values that determine whether our future lightcone is tiled with hedonium versus paperclips versus cosmopolitan posthuman society etc.
Humans and all our complex values are the result of evolutionary optimization for a conceptually simple objective: inclusive fitness. A posthuman society transcends biology and inclusive fitness no longer applies. What is the new objective function for post-biological evolution? Post humans are still intelligent agents with varying egocentric objectives and thus still systems for which the behavioral empowerment law applies. So the outcome is a natural continuation of our memetic/cultural/technological evolution which fills the lightcone with a vast and varied complex cosmpolitan posthuman society.
The values that deviate from empowerment are near exclusively related to sex which no longer serves any direct purpose, but could still serve fun and thus empowerment. Reproduction still exists but in a new form. Everything that survives or flourishes tends to do so because it ultimately serves the purpose of some higher level optimization objective.
I’d like to separate out 3 claims here:
An “efficient and robust approximation of empowerment” is a “natural abstraction” / salient concept that AGIs (“even those of human-level intelligence”) are likely to have learned
If an AGI is trained on “correlation-guided proxy matching” with [todo: fill-in-the-blank] proxy, then it will wind up wanting to maximize the “efficient and robust approximation of empowerment” of humanity
If an AGI was trying to maximize the empowerment of humanity, that would be a good thing.
Of these:
I’m somewhat skeptical of Claim 1. I am a human-level intelligence, and I do have an empowerment concept, but I don’t have a robust empowerment concept. For example, when I ask myself how someone would spend $1 trillion to maximally “empower” their pet turtle, my brain immediately responds: ¯\_(ツ)_/¯. I agree that AGIs are very likely to have an “empowerment” concept (if only because AGIs are very likely to have heard the English word “empowerment”), but I expect that by default that concept (like all concepts) would be defined by a bunch of statistical associations to other concepts and pattern-matches to lots of examples. But the examples would all be everyday examples of humans empowering humans (e.g. on youtube), without a clear-cut method to extrapolate that concept into weird sci-fi futures, nor a clear-cut motivation to develop such a method, I think. I would instead propose to figure out the “efficient and robust approximation of empowerment” ourselves, right now, then write down the formula, make up a jargon word for it, directly tell the AGI what the formula is and give it tons of worked examples during training, and then we can have somewhat more hope that the AGI will have this particular concept that we (allegedly) want it to have.
I will remain very concerned about Claim 2 until someone shows me the exact proxy to be used. (I’m working on this problem myself, and don’t expect any solution I can think of to converge to pure empowerment motivation.) For example, when Human A learns that Human B is helpless (disempowered), that thought can be negative-valence for Human A, and Human A can address this problem by helping Human B (which is good), or by deliberately avoiding thinking about Human B (which is bad), or by dehumanizing / kicking Human B out of the circle of agents that Human A cares about (which is very bad). It’s not clear how to make sure that the first one happens reliably.
I’m skeptical of Claim 3 mainly for similar reasons as Charlie Steiner’s comment on this page that an AGI trying to empower me would want me to accumulate resources but not spend them, and would want to me to want to accumulate resources but not spend them, and more generally would not feel any particular affinity for me maintaining my idiosyncratic set of values and desires as opposed to getting brainwashed or whatever.
Related to this last bullet point, OP writes “The specific human values that most deviate from empowerment are exactly the values that are least robust and the most likely to drift or change as we become posthuman and continue our increasingly accelerated mental and cultural evolution”. But I claim these are also exactly the values that determine whether our future lightcone is tiled with hedonium versus paperclips versus cosmopolitan posthuman society etc. Yes they might drift, but we care very much that they drift in a way that “carries the torch of human values into the future” (or somesuch), as opposed to deleting them entirely or rolling an RNG.
I’m guessing the reply would be something related to the topic of the “identity preservation” section. But I don’t understand how the time bounds work here. If the AGI spends Monday executing a plan to accumulate resources, and then gives those resources to the person on Tuesday, to use for the rest of their lives, that’s good. If the AGI spends Monday brainwashing the human to be more power-hungry, and then the person is more effective at resource-acquisition starting on Tuesday and continuing for the rest of their lives, that’s bad. I’m confused how the empowerment formula would treat these two cases in the way we want.
I can read this sentence in two ways, and I’m not sure which one is intended:
There is innate stuff in the genome that makes humans want empowerment, and humans discover through within-lifetime learning that social status is instrumentally useful for achieving empowerment. Ditto curiosity, fun, etc.
There is innate stuff in the genome that makes humans want social status. Oh by the way, the reason that this stuff wound up in the genome is because social status tends to lead to empowerment, which in turn tends to lead to higher inclusive genetic fitness. Ditto curiosity, fun, etc.
The second one is fine, but I feel pretty strongly opposed to the first one. I think people and animals start do things out of curiosity, or for fun, etc., long before having any basis for knowing that the resulting behaviors will tend to increase their empowerment.
This is important because the first one would suggest that humans don’t really want fun, or social status, or whatever. They really want empowerment, and everything else is a means to an end. But that’s not true! Humans really do want to have fun, and not suffer, etc., as an end in itself, I claim. That means that an AGI with an empowerment objective would want different things for us, than we want for ourselves, which is bad.
That isn’t actually a claim I’m making—empowerment intrinsic motivation is the core utility function for selfish AGI (see all the examples in that section ) rather than something learned. Conceptually altruistic AGI uses external empowerment as the core utility function (although in practice it will also likely need self-empowerment derived intrinsic motivation to bootstrap).
Also compare to the behavioral empowerment hypothesis from the intro: bacteria moving along sugar gradients, chimpanzees seeking social status, and humans seeking wealth are not doing those things to satisfy a learned concept of empowerment—they are acting as if they are maximizing empowerment. Evolution learned various approximations of empowerment intrinsic motivation.
Ah ok, so I think the core novelty here is that no matter what your values are, optimizing for your empowerment today is identical to optimizing for your long term values today. Those specific wierd dystopian sci-fi futures are mostly all automatically avoided.
The price that you’d pay for that is giving up short term utility for long term utility, and possibly a change in core values when becoming posthuman. But I think we can mostly handle that by using learned human values to cover more of the short term utility and empowerment for the long term.
But there is always this unavoidable tradeoff between utility at different timescales, and there is an optimization pressure gradient favoring low discount rates.
Right—that’s all research track on empowerment and intrinsic motivation I briefly summarized.
That isn’t a claim I make here in this article. I do think the circuit grounding/pointing problem is fundamental, and correlation-guided proxy matching is my current best vague guess about how the brain solves that problem. But that’s a core problem with selfish-AGI as well—for the reasons outlined in the cartesian objection section, robust utility functions must be computed from learned world model state.
I already responded to his comment, but yes long-termism favors saving/investing over spending.
To the extent that’s actually a problem, one could attempt to tune an empowerment discount rate that matches the human discount rate, so the AGI wants you to sacrifice some long term optionality/wealth for some short term optionality/wealth. Doing that too much causes divergence however, so I focused on the pure long term cases where there is full convergence, and again I think using learned human values more directly for the short term seems promising.
If we are talking about human surpassing AGI, then almost by definition it will be more effective for the AGI to generate wealth for you directly rather than ‘brainwashing’ you into something that can generate wealth more effectively than it can.
Yeah mostly this because empowerment is very complex and can only be approximated, and it must be approximated efficiently even early on. So somewhere in there I described it as an instrumental hierarchy, where inclusive fitness leads to empowerment leads to curiosity, fun, etc. Except of course there are some things like money which we seem to pretty quickly intuitively learn the utility of which suggests we are also eventually using some more direct learned approximations of empowerment.
Getting back to this:
Humans and all our complex values are the result of evolutionary optimization for a conceptually simple objective: inclusive fitness. A posthuman society transcends biology and inclusive fitness no longer applies. What is the new objective function for post-biological evolution? Post humans are still intelligent agents with varying egocentric objectives and thus still systems for which the behavioral empowerment law applies. So the outcome is a natural continuation of our memetic/cultural/technological evolution which fills the lightcone with a vast and varied complex cosmpolitan posthuman society.
The values that deviate from empowerment are near exclusively related to sex which no longer serves any direct purpose, but could still serve fun and thus empowerment. Reproduction still exists but in a new form. Everything that survives or flourishes tends to do so because it ultimately serves the purpose of some higher level optimization objective.