Behavioral Sufficient Statistics for Goal-Directedness

adamShimi11 Mar 2021 15:01 UTC

LW: 21 AF: 11

Goal-Directedness AI AI Risk Kolmogorov Complexity

Note: this is a new version—with a new title—of my recent post “A Behavioral Definition of Goal-Directedness”. Most of the formulas are the same, except for the triviality one that deals better with what I wanted; the point of this rewrite is to present the ideas in a perspective that makes sense. I’m not proposing a definition of goal-directedness, but just sufficient statistics on the complete behavior that make a behavioral study of goal-directedness more human-legible.

I also use this new version as a first experiment in another approach to feedback: this post includes a lot of questions asked through the elicit prediction feature. A lot. I definitely tried to overshoot the reasonable number to add, to compensate my tendency to never use them. But don’t worry: whether or not there were too many questions will be the subject of another question at the end!

Introduction

In a previous post, I argued for the study of goal-directedness in two steps:

Defining goal-directedness: depends only on the complete behavior of the system, and probably assumes infinite compute and resources.
Computing goal-directedness: depends on the internal structure, and more specifically what information about the complete behavior can be extracted from this structure.

Intuitively, understanding goal-directedness should mean knowing which questions to ask about the complete behavior of the system to determine its goal-directedness. Here the “complete” part is crucial; it simplifies the problem by removing the need to infer what the system will do based on limited behavior. Similarly, we don’t care about the tractability/computability of the questions asked; the point is to find what to look for, without worrying (yet) about how to get it.

(Precommitment before reading the rest of this post) Do you agree that a usable definition/understanding of goal-directedness will come from behavioral study first, and then studying the internal structure aspect? (1 = full disagreement, 99 = complete agreement)

Despite these simplifications, the behavioral approach still suffers from one massive problem: it’s not human-legible. We don’t know what to do with this mass of loosely structured information, and have slim hopes of finding the right angle or question by sheer luck.

This post addresses this problem: it proposes human-legible sufficient statistics on this complete behavior that should be enough to deconfuse and clarify most questions about goal-directedness. The next posts then use these statistics to explore a formal understanding of goal-directedness.

Eventually we will have to rely on internal structure; but knowing the property to derive/approximate beforehand should help quite a lot.

Thanks to Joe Collman and Michele Campolo for helpful feedback and discussion on this post.

Preliminaries

Let’s start with the formalisation of the environment. The interface is defined by the set $O$ of observations and the set $A$ of actions. We have a finite set $E n v s$ of environments, which are just finite deterministic POMDPs with no reward , using $O$ and $A$ for observations and actions, with a uniform distribution over initial states. For an $E \in E n v s$ , $S_{E}$ is the set of states of $E$ .

My sufficient statistics for goal-directedness actually extend to more reasonable settings (stochastic POMDPs and a general distribution over initial states) straightforwardly, but I start with the simpler deterministic case to get the intuitions right. On the other hand, the assumption that $E n v s$ is finite (although maybe intractably big) is kept through the post because it ensures without additional work the well-definedness of some expressions. There might be a way to extend the sufficient statistics to the countable case, but that’s beyond the scope of this post.

The system we study is given by a program $π$ that takes as inputs the successive observations and return the action taken. I use a program in place of a function from histories to actions because it hides the internal state (that I don’t use) while retaining the expressiveness of such a computable function. We can query the behavior of $π$ on any environment of $E n v s$ by giving an initial state and seeing what happens; we can also ask potentially uncomputable questions about this behavior (as long as they are well-defined).

Now, when we call a system goal-directed, we usually have a goal for it in mind. The subtlety about a behavioral definition is that we can’t just look inside the model to find the goal; we somehow have to infer goals from the behavior. This is made easier in the setting of this post because we have access to all the behavior and uncomputable procedures—but we still have to do it.

In fact, the sufficient statistics for goal-directedness of $π$ talk about all possible goals. More specifically, for each goal, I define a vector of numbers called focus, capturing how coherent the goal is with the behavior of $π$ .

The next section… focuses on defining and motivating this vector.

Focus of a goal

A goal is a a function from an environment $E \in E n v s$ to a subset of $S_{E}$ . That is, a goal gives for each environment the states to reach. This form is certainly limited; yet it captures enough intuitive goals to not be trivial. Another important constraint is that every goal $g$ considered satisfies $K (g) < K (π)$ , where K is the Kolmogorov complexity. What this means is that $g$ doesn’t just capture the states that $π$ end up in environment $E$ by simulating $π$ in $E$ ; if that was the case, then the smallest program implementing $g$ should be more complex than the smallest program implementing $π$ , and we forbid that.

(Percentage) What proportion of goals that we actually care about are of the type studied here?

For each goal $g$ , its focus for $π$ is a 4-tuple $(c o m p l_{g}, g e n_{g}, {eff}_{g}, e x p l_{g}) \in [0, 1]^{4}$ capturing important properties of $π$ and the goal. The last three correspond to the last three intuitions (without the far-sightedness) about goal-directedness from the literature review that we wrote with Joe Collman and Michele Campolo.

Complexity factor

This is just the Kolmogorov complexity of $g$ mapped into $[0, 1]$ : $c o m p l_{g} = 1 - \frac{1}{K (g)}$ . There’s not much more to say about it. It’s just useful to compare goal close or equal on the other factors, to reason about which one is more likely to emerge from training.

(Agreement) The complexity factor is useful for studying goal-directedness behaviorally (1 = full disagreement, 99 = complete agreement)

(Agreement) The complexity factor is necessary for studying goal-directedness behaviorally (1 = full disagreement, 99 = complete agreement)

Generalization factor

This first element of the focus, the generalization factor $g e n_{g}$ , captures how much $π$ reaches the goal $g$ over the environments of $E n v s$ . The formula is the following:

$g e n_{g} = m a x ⎛ ⎜ ⎝ 0, \frac{1}{| E n v s |} ⎛ ⎜ ⎝ \sum E \in E n v s \frac{\sum s \in r e a c h a b l e_{s} (E) g e n_{g}^{E} (π, s)}{| r e a c h a b l e_{g} (E) |} - t r i v i a l i t y_{g} (E) ⎞ ⎟ ⎠ ⎞ ⎟ ⎠$ ,
where $g e n_{g}^{E} (π, s) = {\begin{matrix} 1 & if π reaches g (E) starting at s 0 & otherwise \end{matrix}$ , $r e a c h a b l e_{g} (E)$ is the set of states of $S_{E}$ from which some goal state in $g (E)$ is reachable, and $t r i v i a l i t y_{g} (E) = \frac{1}{m i x_{g}^{E} (π, x)}$ such that $m i x_{g}^{E} (π, s)$ measures the time it takes for the random uniform policy to put $m i x_t h r e s h$ % of the probability mass on goal states. A bit more formally, if we start with a probability distribution over $S_{E}$ with $1$ on $s$ and $0$ everywhere else, and then update that probability distribution according to the random uniform policy and the environment, $m i x_{g}^{E} (π, x)$ captures the first time (if any) where the probability distribution puts more than $m i x_t h r e s h$ % of the probability mass on goal states. (It’s more involved than just “the probability that the random uniform policy reaches a goal state” because the simple version trivially goes to $1$ in a lot of simple and finite cases).

The intuition of the formula is straightforward: it’s the average generalization of $π$ for goal $g$ over $E n v s$ . The expression averaged is the indicator of whether $π$ reaches a goal state, minus the “triviality” of the goal (a measure of how difficult it is to reach a goal state). Thanks to this correction (for a good choice of $m i x_t h r e s h$ , which I don’t know how to make and motivate yet), trivial goals, like the one outputting $S_{E}$ for environment $E$ , don’t generalize well despite being trivially reachable.

(Research) What would be a good value of mix_thresh?

A high generalization means that $π$ reaches a goal state most of the time; a small one that it rarely does. In the former case it makes more sense to consider $g$ as a goal of the system.

(Agreement) The generalization factor is useful for studying goal-directedness behaviorally (1 = full disagreement, 99 = complete agreement)

(Agreement) The generalization factor is necessary for studying goal-directedness behaviorally (1 = full disagreement, 99 = complete agreement)

Efficiency factor

This second element of the focus, the efficiency factor ${eff}_{g}$ , captures how efficiently $π$ reaches the goal $g$ in the (environment,initial state) pairs. The formula is the following:
${eff}_{g} = \frac{1}{| {E \in E n v s | \exists s \in S_{E} : g e n_{g}^{E} (π, s) = 1} |} \sum \begin{matrix} E \in E n v s, \exists s \in S_{E} : g e n_{g}^{E} (π, s) = 1 \end{matrix} \frac{1}{| {s \in S_{E} | g e n_{g}^{E} (π, s) = 1} |} \sum \begin{matrix} s \in S_{E}, g e n_{g}^{E} (π, s) = 1 \end{matrix} {eff}_{g}^{E} (π, s)$ where ${eff}_{g}^{E} (π, s) = \frac{n b_s t e p s_o p t i_{g}}{n b_s t e p s_{π}}$ , the ratio between the number of steps taken by the optimal policy for $g$ to reach a goal state starting at $s$ , and the number of steps taken by $π$ to reach a goal state starting at $s$ .

It’s pretty straightforward; the only subtlety is that the so called optimal policy is the optimal policy for the reward (-1 for any non goal state, 0 for a goal state—and then the episode stops), and for all environments in $E n v s$ . Now, there might be multiple optimal policies (privileging different environments but getting the same expected return over $E n v s$ ). I’m fine with using the one that maximize $e f f_{g}$ . Doing so mean comparing $π$ with the optimal policy for $g$ that is most similar to it in some sense.

While the generalization factor captures in what proportion of environments does $π$ reach a goal state, the efficiency factor captures how fast $π$ does that compared to the optimal policy for $g$ .

(Agreement) The efficiency factor is useful for studying goal-directedness behaviorally (1 = full disagreement, 99 = complete agreement)

(Agreement) The efficiency factor is necessary for studying goal-directedness behaviorally (1 = full disagreement, 99 = complete agreement)

Explainability factor

This last element of the focus, the explainability factor ${expl}_{g}$ , captures how well explained $π$ is by assuming it is directed towards $g$ . The formula is the following:

$e x p l_{g} = max μ (\frac{1}{| E n v s |} \sum E \in E n v s \frac{1}{| S_{E} |} \sum s \in S_{E} p r e d_{g}^{E} (π, π_{g}, s))$

where $p r e d_{g}^{E} (π, μ, s) = \frac{1}{T} T \sum t = 0 \frac{max a q_{μ} (s_{t}, a) - q_{μ} (s_{t}, a c t i o n_{π})}{max a q_{μ} (s_{t}, a)}$ measures the average deviation of $π$ from the actions favored by the action-value function $q_{μ}$ of $μ$ .

There are many details and subtleties to unravel here.

(The policy $μ$ ) Another policy $μ$ is used in computing the prediction fitness for the goal $g$ . It is obtained by doing RL on the reward defined by $g$ (see the section on efficiency for more details); importantly, it’s not necessarily an optimal policy for this reward. It can be any policy that results from RL training (when you stop after some time), with the caveat that it must do $n$ times better than the random uniform policy (in terms of expected return). Here $n$ is a parameter that seems important (to not allow policies without any direction towards $g$ ), but I have no idea how to decide on its value.
The reason for the use of $μ$ instead of an optimal policy for the reward is that explainability by a “decent” policy moving towards the goal seems just as good to me.

(Agreement) Explainability must be computed using a broader set of RL policies than just the optimal ones (1 = full disagreement, 99 = complete agreement)

(The time limit $T$ ) This is used for normalizing the prediction using the goal. If $π$ actually reaches the goal (and thus ends the episode), then $T$ is just the length of this episode. If $π$ never finishes, then $T$ is some arbitrary time limit after which the episode ends anyway.

Explainability captures Daniel Dennett’s intentional stance through RL: it measures how well-explained $π$ is through some learned RL method on the goal. There’s one aspect that is missing though: the comparison with the best prediction of simple high-level models (Dennett’s design stance). This other sufficient statistics depends only on $π$ , so it’s not defined in this section.

(Agreement) The explainability factor is useful for studying goal-directedness behaviorally (1 = full disagreement, 99 = complete agreement)

(Agreement) The explainability factor is necessary for studying goal-directedness behaviorally (1 = full disagreement, 99 = complete agreement)

Complete Sufficient Statistics

The sufficient statistics on the behavior of $π$ for studying goal-directedness are then the focus for each goal satisfying the complexity constraint, as well as the explainability of $π$ in terms of simple gears-level model:

$e x p l_g e a r s = max K (d) < C (\frac{1}{| E n v s |} \sum E \in E n v s \frac{1}{| S_{E} |} \sum s \in S_{E} e r r o r_p r e d_{d}^{E} (π, s))$ , where $d$ is a program capturing a simple model of $π$ (the upper bound on its Kolmogorov complexity $C$ is assumed to be $<< K (π)$ ); $p r e d_{d}^{E} (π, s)$ is the proportion of time where the action taken by $π$ in its trajectory starting at $s$ is the action taken by $d$ .

When you compare it to the prediction using the goal, the prediction using a simple model looks very binary: either the action taken is the one $π$ takes or it’s completely wrong. We can rephrase this as saying that the actions are all equally far from each other.

This looks like a reasonable default assumption, but it doesn’t always hold. For example, if the actions are the angles by increments of 5 degrees, then getting the angle just next to the one $π$ took sure seems closer than taking the opposite one. That is, there is a notion of similarity for the action space. In these cases the definition should probably be altered to use that notion of similarity (and normalize it) instead of the binary default.

Now, given a goal $g$ , we can use $e x p l_g e a r s$ to compute the difference in explainability $e x p l_{g} - e x p l_g e a r s$ in $[- 1, 1]$ . Here being close to $- 1$ means being better explained by a simple model, being close to $1$ means being better explained by the goal $g$ , and being close to zero means that both are equally good (or bad) at predicting $π$ .

(Agreement) The difference in explainability is useful for studying goal-directedness behaviorally (1 = full disagreement, 99 = complete agreement)

(Agreement) The difference in explainability is necessary for studying goal-directedness behaviorally (1 = full disagreement, 99 = complete agreement)

What I hope for these sufficient statistics is that they can provide more than just a simple number telling you how goal directed you $π$ is. They also allow us to think more clearly about many questions related to goals, like the importance of explainability, convergent instrumental subgoals, and inner alignment. The next posts in this sequence explore these in more detail, but we can look quickly at one example (to be revisited later in the sequence) now.

(Agreement) Taken as a whole, these sufficient statistics contain all the information about the behavior useful for studying goal-directedness (1 = full disagreement, 99 = complete agreement)

The one true goal

In some settings, knowing if a clear goal exists, and what it is, matters. Maybe we’re worried about the focus on too narrow a goal, and what it entails. Can we adapt the core definition of goal-directedness to this application?

My current intuition is that this most representative goal should primarily depend on generalization. It should matter more because a goal with better generalization is a goal that points more often in the right direction. This leaves us two cases:

If there is a goal with a massive lead on generalization (something like 2x the second largest generalization), then I think we should go with that one.
If there isn’t, then we lack a clear representative goal.

What’s even more exciting is that framing it that way highlights important questions: what if there are two goals, both with far more generalization, and one with even more than the other? In the absence of a representative goal, are all goals the same or is there a relevant hierarchy?

All these questions, and more, will be solv… addressed at the very least in the following posts of the sequence.

Stochastic version

As promised, I’ll explain how to get from the deterministic case above to a more realistic stochastic one. The changes considered are making $π$ into a stochastic policy returning an element of $Δ A$ (a distribution over actions); the environments being stochastic POMDP with stochastic transition function and stochastic observation function (returning the observation for a given state); and there is a distribution of initial states for each environment.

Here are the changes necessary for the computation of each factor of the focus (no change necessary for complexity, as it just depends on the program itself):

(Generalization factor) $g e n_{g}^{E} (π)$ goes from an element of ${0, 1}$ to a the probability that $π$ eventually reaches a state in $g (E)$ , computed by extracting a distribution over histories from the distribution over initial states, the POMDP and the policy. Then we take the probability of getting a history that reaches a goal state.
(Efficiency factor) ${eff}_{g}^{E} (π)$ goes through similar changes, where the time taken becomes an expected value over the distributions on histories generated.
(Explainability factor) The prediction error computed now compares distributions at each step. But that’s doable with something like KL divergence (maybe we want something different if we allow distributions with 0 probability, which might make KL divergence… diverge)

(Agreement) The stochastic version captures the same information in a broader setting than the deterministic one (1 = full disagreement, 99 = complete agreement)

Conclusion

I proposed sufficient statistics over the complete behavior of a system encoding relationships with goals (about reaching some states). These properties are the complexity of the goal, the generalization towards the goal, the efficiency of the system when it generalizes, and how well it is explained by a goal-based model. There’s an additional sufficient statistic for the system in general, about how well it is explained by a simple gears-level model.

We can’t compute these directly for concrete systems, as they rely on the knowledge of the complete behavior, and ask many questions that might be uncomputable or at best intractable.

Nonetheless, I believe this is progress. Instead of arguing without common grounding, we can argue using these statistics. And any deeper understanding of goal-directedness we get will provide a guiding light to checking the goal-directedness of some actual AIs.

The next posts in this sequence will explore what can be done with these statistics

(Opinion after reading the post) Do you agree that a usable definition/understanding of goal-directedness will come from behavioral study first, and then studying the internal structure aspect? (1 = full disagreement, 99 = complete agreement)

(Agreement) If you've read the previous version, this one clarified my point a lot (1 = full disagreement, 99 = complete agreement)

Feedback about the number of questions/predictions in this post (50 = perfect number, towards 1= want less and towards 99 = want more)

What links here?

Traps of Formalization in Deconfusion by adamShimi (5 Aug 2021 22:40 UTC; 28 points)