avoided side effects by penalizing shifts in the ability to achieve randomly generated goals.
Does this correspond to making the agent preserve general optionality (in the more colloquial sense, in case it is a term of art here)?
Does that mean that some specification of random goals would serve as an approximation of optionality?
It occurs to me that preserving the ability to pursue randomly generated goals doesn’t necessarily preserve the ability of other agents to preserve goals. If I recall, that is kind of the theme of the instrumental power paper; as a concrete example of how they would combine, it feels like:
Add value to get money to advance goal X.
Don’t destroy your ability to get money to advance goal X a little faster, in case you want to pursue randomly generated goal Y.
This preserves the ability to pursue goal Y (Z, A, B...) but it does not imply that other agents should be allowed to add value and get money.
How closely does this map, I wonder? It feels like including other agents in the randomly generated goals somehow would help, but that just does for the agents themselves and not for the agents goals.
Does a tuple of [goal(preserve agent),goal(preserve object of agent’s goal)] do a good job of preserving the other agent’s ability to pursue that goal? Can that be generalized?
Does this correspond to making the agent preserve general optionality (in the more colloquial sense, in case it is a term of art here)?
I think that intuitively, preserving value for a high-entropy distribution over reward functions should indeed look like preserving optionality. This assumes away a lot of the messiness that comes with deep non-tabular RL, however, and so I don’t have a theorem linking the two yet.
Does that mean that some specification of random goals would serve as an approximation of optionality?
Yes, you’re basically letting reward functions vote on how “big of a deal” an action is, where “big of a deal” inherits the meaning established by the attainable utility theory of impact.
It occurs to me that preserving the ability to pursue randomly generated goals doesn’t necessarily preserve the ability of other agents to preserve goals.
Yup, that’s very much true. I see this as the motivation for corrigibility: if the agent preserves its own option value and freely lets us wield it to extend our own influence over the world, then that should look like preserving our option value.
Does this correspond to making the agent preserve general optionality (in the more colloquial sense, in case it is a term of art here)?
Does that mean that some specification of random goals would serve as an approximation of optionality?
It occurs to me that preserving the ability to pursue randomly generated goals doesn’t necessarily preserve the ability of other agents to preserve goals. If I recall, that is kind of the theme of the instrumental power paper; as a concrete example of how they would combine, it feels like:
Add value to get money to advance goal X.
Don’t destroy your ability to get money to advance goal X a little faster, in case you want to pursue randomly generated goal Y.
This preserves the ability to pursue goal Y (Z, A, B...) but it does not imply that other agents should be allowed to add value and get money.
How closely does this map, I wonder? It feels like including other agents in the randomly generated goals somehow would help, but that just does for the agents themselves and not for the agents goals.
Does a tuple of [goal(preserve agent),goal(preserve object of agent’s goal)] do a good job of preserving the other agent’s ability to pursue that goal? Can that be generalized?
...now to take a crack at the paper.
I think that intuitively, preserving value for a high-entropy distribution over reward functions should indeed look like preserving optionality. This assumes away a lot of the messiness that comes with deep non-tabular RL, however, and so I don’t have a theorem linking the two yet.
Yes, you’re basically letting reward functions vote on how “big of a deal” an action is, where “big of a deal” inherits the meaning established by the attainable utility theory of impact.
Yup, that’s very much true. I see this as the motivation for corrigibility: if the agent preserves its own option value and freely lets us wield it to extend our own influence over the world, then that should look like preserving our option value.