This seems potentially consistent with CEV and with June Ku’s refinement of it. But the devil’s in the details. Do you or anyone have a formalization of “altruistic empowerment”?
What is June Ku’s refinement of CEV? A few quick searches on LW and google isn’t bringing up the expected thing (although some of the old discussions on CEV are still interesting).
Clearly Franzmeyer et al have a formalization of altruistic empowerment that works in gridworlds, but I don’t think it would survive cartesian embedding obstacles in a more realistic world without substantial modification.
If I had to formalize it quickly, I’d start with a diffuse planning style agent which outputs actions to optimize over future world trajectories by sampling from the learned trajectory distribution weighted by utility (high probable/realistic and high utility trajectories):
f(WT)=wT∈WT∼(p(wT)VT(wT))
The WT is a distribution over predicted world trajectories, an individual predicted world trajectory is wT with unweighted probability p(wT), and the agent’s future predicted actions and everything else is included in these trajectories. The generic value function then is over the entire trajectory and can use any component of it. This is a generic simplification of the diffusion-planning approach which avoids adversarial optimization issues by unifying world-action prediction and the utility function into a common objective.
So then if we assume some standard sensible utility function with exponential discounting: VT(β,wT)=∑∞t=0βtV(wt), then instrumental convergence implies something like:
That we can substitute in the empowerment proxy function PT(wT) for the true utility function VT(wT) and the resulting planning trajectories converge to equivalence as the discount rate β goes to 1.
Clearly the convergence only holds for some utility functions (as a few pointed out in this thread, clearly doesn’t converge for suicidal agents).
The agent identification and/or continuity of identity issue is ensuring that the empowerment function PT(wT) is identifying the same agents as in the ‘true’ desired utility function VT(wT), which seems like much of the challenge.
I also think it’s interesting to compare the empowerment bound to using some max entropy uncertain distribution over egocentric utility functions, seems like it’s equivalent or similar.
You can see the absolute minimal statement of it here. To me, steps 2 and 3 are the big innovation.
I read a bit more about empowerment, and it’s unclear to me what it outputs, when the input is an agent without a clear utility function. (Other replies have touched on this.) An agent can have multiple goals which come to the fore under complicated conditions, an agent can even want contrary things at different times. We should look for thought-experiments which test whether empowerment really does resolve such conflicts in an acceptable way. It might also be interesting to apply the empowerment formalism to Yann LeCun’s “energy-based models”, because they seem to be an important concrete example of an agent architecture that isn’t a utility maximizer.
I read a bit more about empowerment, and it’s unclear to me what it outputs, when the input is an agent without a clear utility function.
I realize this probably isn’t quite responding to what you meant, but—Empowerment doesn’t require or use a utility function, so you can estimate it for any ‘agent’ which has some conceptual output action channel. You could even compute it for output channels which aren’t really controlled by agents, and it would still compute empowerment as if that output channel was controlled by an agent. However the more nuanced versions which one probably would need to use for human-level agents probably need to consider the agent’s actual planning ability, for reasons mentioned in the cartesian objections section.
An agent can have multiple goals which come to the fore under complicated conditions, an agent can even want contrary things at different times. We should look for thought-experiments which test whether empowerment really does resolve such conflicts in an acceptable way.
Yeah, a human brain clearly consists of sub-modules which one could consider sub-agents to some degree. For example the decision to splurge a few hundred dollars on an expensive meal is largely a tradeoff between immediate hedonic utility and long term optionality—and does seem to be implemented as two neural sub-populations competitively ‘bidding’ for the different decisions as arbitrated in the basal ganglia.
Empowerment always favors the long term optionality. So it’s clearly not a fully general tight approximation of human values in practice, but it is a reasonable approximation of the long term component which seems to be most of the difficulty for value learning.
External empowerment is the first/only reasonably simple and theoretically computable utility function that seems to not only keep humans alive, but also plausibly would step down and hand over control to posthumans (with the key caveat that it may want to change/influence posthuman designs in ways we would dislike).
This seems potentially consistent with CEV and with June Ku’s refinement of it. But the devil’s in the details. Do you or anyone have a formalization of “altruistic empowerment”?
What is June Ku’s refinement of CEV? A few quick searches on LW and google isn’t bringing up the expected thing (although some of the old discussions on CEV are still interesting).
Clearly Franzmeyer et al have a formalization of altruistic empowerment that works in gridworlds, but I don’t think it would survive cartesian embedding obstacles in a more realistic world without substantial modification.
If I had to formalize it quickly, I’d start with a diffuse planning style agent which outputs actions to optimize over future world trajectories by sampling from the learned trajectory distribution weighted by utility (high probable/realistic and high utility trajectories):
f(WT)=wT∈WT∼(p(wT)VT(wT))
The WT is a distribution over predicted world trajectories, an individual predicted world trajectory is wT with unweighted probability p(wT), and the agent’s future predicted actions and everything else is included in these trajectories. The generic value function then is over the entire trajectory and can use any component of it. This is a generic simplification of the diffusion-planning approach which avoids adversarial optimization issues by unifying world-action prediction and the utility function into a common objective.
So then if we assume some standard sensible utility function with exponential discounting: VT(β,wT)=∑∞t=0βtV(wt), then instrumental convergence implies something like:
limβ→1(wT∈WT∼(p(wT)PT(β,wT)))≈wT∈WT∼(p(wT)VT(β,wT))
That we can substitute in the empowerment proxy function PT(wT) for the true utility function VT(wT) and the resulting planning trajectories converge to equivalence as the discount rate β goes to 1.
Clearly the convergence only holds for some utility functions (as a few pointed out in this thread, clearly doesn’t converge for suicidal agents).
The agent identification and/or continuity of identity issue is ensuring that the empowerment function PT(wT) is identifying the same agents as in the ‘true’ desired utility function VT(wT), which seems like much of the challenge.
I also think it’s interesting to compare the empowerment bound to using some max entropy uncertain distribution over egocentric utility functions, seems like it’s equivalent or similar.
You can see the absolute minimal statement of it here. To me, steps 2 and 3 are the big innovation.
I read a bit more about empowerment, and it’s unclear to me what it outputs, when the input is an agent without a clear utility function. (Other replies have touched on this.) An agent can have multiple goals which come to the fore under complicated conditions, an agent can even want contrary things at different times. We should look for thought-experiments which test whether empowerment really does resolve such conflicts in an acceptable way. It might also be interesting to apply the empowerment formalism to Yann LeCun’s “energy-based models”, because they seem to be an important concrete example of an agent architecture that isn’t a utility maximizer.
I realize this probably isn’t quite responding to what you meant, but—Empowerment doesn’t require or use a utility function, so you can estimate it for any ‘agent’ which has some conceptual output action channel. You could even compute it for output channels which aren’t really controlled by agents, and it would still compute empowerment as if that output channel was controlled by an agent. However the more nuanced versions which one probably would need to use for human-level agents probably need to consider the agent’s actual planning ability, for reasons mentioned in the cartesian objections section.
Yeah, a human brain clearly consists of sub-modules which one could consider sub-agents to some degree. For example the decision to splurge a few hundred dollars on an expensive meal is largely a tradeoff between immediate hedonic utility and long term optionality—and does seem to be implemented as two neural sub-populations competitively ‘bidding’ for the different decisions as arbitrated in the basal ganglia.
Empowerment always favors the long term optionality. So it’s clearly not a fully general tight approximation of human values in practice, but it is a reasonable approximation of the long term component which seems to be most of the difficulty for value learning.
External empowerment is the first/only reasonably simple and theoretically computable utility function that seems to not only keep humans alive, but also plausibly would step down and hand over control to posthumans (with the key caveat that it may want to change/influence posthuman designs in ways we would dislike).