Does this correspond to making the agent preserve general optionality (in the more colloquial sense, in case it is a term of art here)?
I think that intuitively, preserving value for a high-entropy distribution over reward functions should indeed look like preserving optionality. This assumes away a lot of the messiness that comes with deep non-tabular RL, however, and so I don’t have a theorem linking the two yet.
Does that mean that some specification of random goals would serve as an approximation of optionality?
Yes, you’re basically letting reward functions vote on how “big of a deal” an action is, where “big of a deal” inherits the meaning established by the attainable utility theory of impact.
It occurs to me that preserving the ability to pursue randomly generated goals doesn’t necessarily preserve the ability of other agents to preserve goals.
Yup, that’s very much true. I see this as the motivation for corrigibility: if the agent preserves its own option value and freely lets us wield it to extend our own influence over the world, then that should look like preserving our option value.
I think that intuitively, preserving value for a high-entropy distribution over reward functions should indeed look like preserving optionality. This assumes away a lot of the messiness that comes with deep non-tabular RL, however, and so I don’t have a theorem linking the two yet.
Yes, you’re basically letting reward functions vote on how “big of a deal” an action is, where “big of a deal” inherits the meaning established by the attainable utility theory of impact.
Yup, that’s very much true. I see this as the motivation for corrigibility: if the agent preserves its own option value and freely lets us wield it to extend our own influence over the world, then that should look like preserving our option value.