tl;dr, goal directedness of a policy wrt a utility function is measured by its min distance to one of the policies implied by the utility function, as per the intentional stance—that one should model a system as an agent insofar as doing so is useful.
Details
how is “policies implied by the utility function” operationalized? given a value u, we define a set containing policies of maximum entropy (of the decision variable, given its parents in the causal bayes net) among those policies that attain the utility u.
then union them over all the achievable values of u to get this “wide set of maxent policies,” and define goal directedness of a policy π wrt a utility function U as the maximum (negative) cross entropy between π and an element of the above set. (actually we get the same result if we quantify the min operation over just the set of maxent policies achieving the same utility as π.)
Intuition
intuitively, this is measuring: “how close is my policy π to being ‘deterministic,’ while ‘optimizing U at the competence level u(π)’ and not doing anything else ‘deliberately’?”
“close” / “deterministic” ~ large negative CE means small CE(π,πmaxent)=H(π)+KL(π||πmaxent)
“not doing anything else deliberately’” ~ because we’re quantifying over maxent policies. the policy is maximally uninformative/uncertain, the policy doesn’t take any ‘deliberate’ i.e. low entropy action, etc.
“at the competence level u(π)” ~ … under the constraint that it is identically competent to π
and you get the nice property of the measure being invariant to translation / scaling of U.
obviously so, because a policy is maxent among all policies achieving u on U iff that same policy is maxent among all policies achieving au+b on aU+b, so these two utilities have the same “wide set of maxent policies.”
Critiques
I find this measure problematic in many places, and am confused whether this is conceptually correct.
one property claimed is that the measure is maximum for uniquely optimal / anti-optimal policy.
it’s interesting that this measure of goal-directedness isn’t exactly an ~increasing function of u(π), and i think it makes sense. i want my measure of goal-directedness to, when evaluated relative to human values, return a large number for both aligned ASI and signflip ASI.
… except, going through the proof one finds that the latter property heavily relies on the “uniqueness” of the policy.
My policy can get the maximum goal-directedness measure if it is the only policy of its competence level while being very deterministic. It isn’t clear that this always holds for the optimal/anti-optimal policies or always relaxes smoothly to epsilon-optimal/anti-optimal policies.
Relatedly, the fact that the quantification is only happening over policies of the same competence level, which feels problematic.
minimum for uniformly random policy (this would’ve been a good property, but unless I’m mistaken I think the proof for the lower bound is incorrect, because negative cross entropy is not bounded below.)
honestly the maxent motivation isn’t super clear to me.
not causal. the reason you need causal interventions is because you want to rule out accidental agency/goal-directedness, like a rock that happens to be the perfect size to seal a water bottle—does your rock adapt when I intervene to change the size of the hole? discovering agents is excellent in this regards.
… except, going through the proof one finds that the latter property heavily relies on the “uniqueness” of the policy. My policy can get the maximum goal-directedness measure if it is the only policy of its competence level while being very deterministic. It isn’t clear that this always holds for the optimal/anti-optimal policies or always relaxes smoothly to epsilon-optimal/anti-optimal policies.
Yeah, uniqueness definitely doesn’t always hold for the optimal/anti-optimal policy. I think the way MEG works here makes sense: if you’re following the unique optimal policy for some utility function, that’s a lot of evidence for goal-directedness. If you’re following one of many optimal policies, that’s a bit less evidence—there’s a greater chance that it’s an accident. In the most extreme case (for the constant utility function) every policy is optimal—and we definitely don’t want to ascribe maximum goal-directedness to optimal policies there.
With regard to relaxing smoothly to epsilon-optimal/anti-optimal policies, from memory I think we do have the property that MEG is increasing in the utility of the policy for policies with greater than the utility of the uniform policy, and decreasing for policies with less than the utility of the uniform policy. I think you can prove this via the property that the set of maxent policies is (very nearly) just Boltzman policies with varying temperature. But I would have to sit down and think about it properly. I should probably add that to the paper if it’s the case.
minimum for uniformly random policy (this would’ve been a good property, but unless I’m mistaken I think the proof for the lower bound is incorrect, because negative cross entropy is not bounded below.)
Thanks for this. The proof is indeed nonsense, but I think the proposition is still true. I’ve corrected it to this.
Thanks for writing this up! Having not read the paper, I am wondering if in your opinion there’s a potential connection between this type of work and comp mech type of analysis/point of view? Even if it doesn’t fit in a concrete way right now, maybe there’s room to extend/modify things to combine things in a fruitful way? Any thoughts?
Quick paper review of Measuring Goal-Directedness from the causal incentives group.
tl;dr, goal directedness of a policy wrt a utility function is measured by its min distance to one of the policies implied by the utility function, as per the intentional stance—that one should model a system as an agent insofar as doing so is useful.
Details
how is “policies implied by the utility function” operationalized? given a value u, we define a set containing policies of maximum entropy (of the decision variable, given its parents in the causal bayes net) among those policies that attain the utility u.
then union them over all the achievable values of u to get this “wide set of maxent policies,” and define goal directedness of a policy π wrt a utility function U as the maximum (negative) cross entropy between π and an element of the above set. (actually we get the same result if we quantify the min operation over just the set of maxent policies achieving the same utility as π.)
Intuition
intuitively, this is measuring: “how close is my policy π to being ‘deterministic,’ while ‘optimizing U at the competence level u(π)’ and not doing anything else ‘deliberately’?”
“close” / “deterministic” ~ large negative CE means small CE(π,πmaxent)=H(π)+KL(π||πmaxent)
“not doing anything else deliberately’” ~ because we’re quantifying over maxent policies. the policy is maximally uninformative/uncertain, the policy doesn’t take any ‘deliberate’ i.e. low entropy action, etc.
“at the competence level u(π)” ~ … under the constraint that it is identically competent to π
and you get the nice property of the measure being invariant to translation / scaling of U.
obviously so, because a policy is maxent among all policies achieving u on U iff that same policy is maxent among all policies achieving au+b on aU+b, so these two utilities have the same “wide set of maxent policies.”
Critiques
I find this measure problematic in many places, and am confused whether this is conceptually correct.
one property claimed is that the measure is maximum for uniquely optimal / anti-optimal policy.
it’s interesting that this measure of goal-directedness isn’t exactly an ~increasing function of u(π), and i think it makes sense. i want my measure of goal-directedness to, when evaluated relative to human values, return a large number for both aligned ASI and signflip ASI.
… except, going through the proof one finds that the latter property heavily relies on the “uniqueness” of the policy.
My policy can get the maximum goal-directedness measure if it is the only policy of its competence level while being very deterministic. It isn’t clear that this always holds for the optimal/anti-optimal policies or always relaxes smoothly to epsilon-optimal/anti-optimal policies.
Relatedly, the fact that the quantification is only happening over policies of the same competence level, which feels problematic.
minimum for uniformly random policy(this would’ve been a good property, but unless I’m mistaken I think the proof for the lower bound is incorrect, because negative cross entropy is not bounded below.)honestly the maxent motivation isn’t super clear to me.
not causal. the reason you need causal interventions is because you want to rule out accidental agency/goal-directedness, like a rock that happens to be the perfect size to seal a water bottle—does your rock adapt when I intervene to change the size of the hole? discovering agents is excellent in this regards.
Thanks for the feedback!
Yeah, uniqueness definitely doesn’t always hold for the optimal/anti-optimal policy. I think the way MEG works here makes sense: if you’re following the unique optimal policy for some utility function, that’s a lot of evidence for goal-directedness. If you’re following one of many optimal policies, that’s a bit less evidence—there’s a greater chance that it’s an accident. In the most extreme case (for the constant utility function) every policy is optimal—and we definitely don’t want to ascribe maximum goal-directedness to optimal policies there.
With regard to relaxing smoothly to epsilon-optimal/anti-optimal policies, from memory I think we do have the property that MEG is increasing in the utility of the policy for policies with greater than the utility of the uniform policy, and decreasing for policies with less than the utility of the uniform policy. I think you can prove this via the property that the set of maxent policies is (very nearly) just Boltzman policies with varying temperature. But I would have to sit down and think about it properly. I should probably add that to the paper if it’s the case.
Thanks for this. The proof is indeed nonsense, but I think the proposition is still true. I’ve corrected it to this.
Reminds me a little bit of this idea from Vanessa Kosoy.
This link doesn’t work for me:
Thanks, it seems like the link got updated. Fixed!
Thanks for writing this up! Having not read the paper, I am wondering if in your opinion there’s a potential connection between this type of work and comp mech type of analysis/point of view? Even if it doesn’t fit in a concrete way right now, maybe there’s room to extend/modify things to combine things in a fruitful way? Any thoughts?
Here’s my current take, I wrote it as a separate shortform because it got too long. Thanks for prompting me to think about this :)