For a single human/agent assume they have some utility function ‘u’ over future world trajectories: u(Wt∈T) - which really just says they a preference ranking over futures. A reasonable finite utility function will decompose into a sum of discounted utility over time: ∑t∈Trtu(Wt) and then there are some nice theorems indicating any such utility function converges to—and thus is well approximated by—empowerment (future optionality—a formal measure of power over future world states). However the approximation accuracy converges only with increasing time, so it only becomes a perfect approximation in the limit of discount factor r approaching 1.
Another way of saying that is: all agents with a discount factor of 1 are in some sense indistinguishable, because their optimal instrumental plans are all the same: take control of the universe.
So there are three objections/issues:
Humans are at least partially altruistic—so even when focusing on a single human it would not be correct to optimize for something like selfish empowerment of their brain’s action channel
Humans do not have a discount factor of 1 and so the approximation error for the short term component of our utility could cause issues
Even if we assume good solutions to 1 and 2 (which I’m optimistic about), its not immediately very clear how to correctly use this for more realistic alignment to many external agents (ie humanity, sapients in general, etc) - ie there is still perhaps a utility combination issue
Of these issues #2 seems like the least concern, as I fully expect that the short term component of utility is the easiest to learn via obvious methods. So the fact that empowerment is a useful approximation only for the very hard long term component of utility is a strength not a weakness—as it directly addresses the hard challenge of long term value alignment.
The solutions to 1 and 3 are intertwined. You could model the utility function of a fully altruistic agent as a weighted combination of other agent’s utility functions. Applying that to partially altruistic agents you get something like a pagerank graph recurrence which could be modeled more directly, but it also may just naturally fall out of broad multi-agent alignment (the solution to 3).
One approach which seems interesting/promising is to just broadly seek to empower any/all external agency in the world, weighted roughly by observational evidence for that agency. I believe that human altruism amounts to something like that—so children sometimes feel genuine empathy even for inanimate objects, but only because they anthropomorphize them—that is they model them as agents.
Indeed the altruistic part seems to be interestingly close to a broad ‘world empowerment’, but I’ve some doubts about a few elements surrounding this : “the short term component of utility is the easiest to learn via obvious methods”
It could be true, but there are worries that it might be hard, so I try to find a way to resolve this?
If the rule/policy to choose the utility function is a preference based on a model of humans/agents then there might be ways to circumvent/miss what we would truly prefer (the traction of maximization would cross the limited sharpness/completeness of models), because the model underfits reality (which would drift into more and more divergence as the model updates along the transformations performed by AI)
In practice this would allow a sort of intrusion of AI into agents to force mutations.
So, Intrusion could be instrumental
Which is why I want to escape the ‘trap of modelling’ even further by indirectly targeting our preferences through a primal goal of non-myopic optionality (even more externally focused) before guessing utility.
If your #2 is a least concern then indeed those worries aren’t as meaningful
I’m also trying to avoid us becoming grabby aliens, but if -> Altruism is naturally derived from a broad world empowerment
Then it could be functional because the features of the combination of worldwide utilities (empower all agencies) *are* altruism, sufficiently to generalize in the ‘latent space of altruism’ which implies being careful about what you do to other planets
The maximizer worry would also be tamed by design
And in fact my focus on optionality would essentially be the same to a worldwide agency concern (but I’m thinking of an universal agency to completely erase the maximizer issue)
For a single human/agent assume they have some utility function ‘u’ over future world trajectories: u(Wt∈T) - which really just says they a preference ranking over futures. A reasonable finite utility function will decompose into a sum of discounted utility over time: ∑t∈Trtu(Wt) and then there are some nice theorems indicating any such utility function converges to—and thus is well approximated by—empowerment (future optionality—a formal measure of power over future world states). However the approximation accuracy converges only with increasing time, so it only becomes a perfect approximation in the limit of discount factor r approaching 1.
Another way of saying that is: all agents with a discount factor of 1 are in some sense indistinguishable, because their optimal instrumental plans are all the same: take control of the universe.
So there are three objections/issues:
Humans are at least partially altruistic—so even when focusing on a single human it would not be correct to optimize for something like selfish empowerment of their brain’s action channel
Humans do not have a discount factor of 1 and so the approximation error for the short term component of our utility could cause issues
Even if we assume good solutions to 1 and 2 (which I’m optimistic about), its not immediately very clear how to correctly use this for more realistic alignment to many external agents (ie humanity, sapients in general, etc) - ie there is still perhaps a utility combination issue
Of these issues #2 seems like the least concern, as I fully expect that the short term component of utility is the easiest to learn via obvious methods. So the fact that empowerment is a useful approximation only for the very hard long term component of utility is a strength not a weakness—as it directly addresses the hard challenge of long term value alignment.
The solutions to 1 and 3 are intertwined. You could model the utility function of a fully altruistic agent as a weighted combination of other agent’s utility functions. Applying that to partially altruistic agents you get something like a pagerank graph recurrence which could be modeled more directly, but it also may just naturally fall out of broad multi-agent alignment (the solution to 3).
One approach which seems interesting/promising is to just broadly seek to empower any/all external agency in the world, weighted roughly by observational evidence for that agency. I believe that human altruism amounts to something like that—so children sometimes feel genuine empathy even for inanimate objects, but only because they anthropomorphize them—that is they model them as agents.
All right! Thank you for the precision,
Indeed the altruistic part seems to be interestingly close to a broad ‘world empowerment’, but I’ve some doubts about a few elements surrounding this : “the short term component of utility is the easiest to learn via obvious methods”
It could be true, but there are worries that it might be hard, so I try to find a way to resolve this?
If the rule/policy to choose the utility function is a preference based on a model of humans/agents then there might be ways to circumvent/miss what we would truly prefer (the traction of maximization would cross the limited sharpness/completeness of models), because the model underfits reality (which would drift into more and more divergence as the model updates along the transformations performed by AI)
In practice this would allow a sort of intrusion of AI into agents to force mutations.
So,
Intrusion could be instrumental
Which is why I want to escape the ‘trap of modelling’ even further by indirectly targeting our preferences through a primal goal of non-myopic optionality (even more externally focused) before guessing utility.
If your #2 is a least concern then indeed those worries aren’t as meaningful
I’m also trying to avoid us becoming grabby aliens, but if
-> Altruism is naturally derived from a broad world empowerment
Then it could be functional because the features of the combination of worldwide utilities (empower all agencies) *are* altruism, sufficiently to generalize in the ‘latent space of altruism’ which implies being careful about what you do to other planets
The maximizer worry would also be tamed by design
And in fact my focus on optionality would essentially be the same to a worldwide agency concern (but I’m thinking of an universal agency to completely erase the maximizer issue)