Koen.Holtman comments on Behavioral Sufficient Statistics for Goal-Directedness

Koen.Holtman 1 Mar 2021 14:06 UTC
LW: 5 AF: 3
AF
This post proposes such a behavioral definition of goal-directedness. If it survives the artillery fire of feedback and criticism, it will provide a more formal grounding for goal-directedness,

I guess you are looking for critical comments. I’ll bite.

Technical comment on the above post

So if I understand this correctly. then $e x p l_{g}$ is a metric of goal-directedness. However, I am somewhat puzzled because $e x p l_{g}$ only measures directedness to the single goal $g$ .

But to get close to the concept of goal-directedness introduced by Rohin, don’t you need then do an operation over all possible values of $g$ ?

More general comments on goal-directedness

Reading the earlier posts in this sequence and several of the linked articles, I see a whole bunch of problems.

I think you are being inspired by the The Misspecified Goal Argument. From Rohin’s introductory post on goal directedness:

The Misspecified Goal Argument for AI Risk: Very intelligent AI systems will be able to make long-term plans in order to achieve their goals, and if their goals are even slightly misspecified then the AI system will become adversarial and work against us.

Rohin then speculates that if we remove the ‘goal’ from the above argument, we can make the AI safer. He then comes up with a metric of ‘goal-directedness’ where an agent can have zero goal-directedness even though he can model it as a system that is maximizing a utility function. Also, in Rohin’s terminology, an agent gets safer it if is less goal-directed.

Rohin then proposes that intuitively, a table-driven agent is not goal-directed. I think you are not going there with your metrics, you are looking at observable behavior, not at agent internals.

Where things completely move off the main sequence is in Rohin’s next step in developing his intuitive notion of goal-directedness:

This suggests a way to characterize these sorts of goal-directed agents: there is some goal such that the agent’s behavior in new circumstances can be predicted by figuring out which behavior best achieves the goal.

So what I am reading here is that if an agent behaves more unpredictably off-distribution, it is becomes less goal-directed in Rohin’s intuition. But I can’t really make sense of this anymore, as Rohin also associates less goal-directedness with more safety.

This all starts to look like a linguistic form of Goodharting: the meaning of the term ‘goal-directed’ collapses completely because too much pressure is placed on it for control purposes.

To state my own terminology preference: I am perfectly happy to call any possible AI agent a goal-directed agent. This is because people build AI agents to help them pursue some goals they have, which naturally makes these agents goal-directed. Identifying a sub-class of agents which we then call non-goal-directed looks like a pretty strange program to me, which can only cause confusion (and an artillery fire of feedback and criticism).

To bring this back to the post above, this leaves me wondering how the metrics you define above relate to safety, and how far along you are in your program of relating them to safety.
- Is your idea that a lower number on a metric implies more safety? This seems to be Rohin’s original idea.
- Are these metrics supposed to have any directly obvious correlation to safety, or the particular failure scenario of ‘will become adversarial and work against us’ at all? If so I am not seeing the correlation.
- adamShimi 1 Mar 2021 14:48 UTC
  LW: 2 AF: 1
  AF Parent
  Thanks for taking the time to give feedback!
  Technical comment on the above post
  So if I understand this correctly. then $e x p l_{g}$ is a metric of goal-directedness. However, I am somewhat puzzled because $e x p l_{g}$ only measures directedness to the single goal $g$ .
  But to get close to the concept of goal-directedness introduced by Rohin, don’t you need then do an operation over all possible values of $g$ ?
  That’s not what I had in mind, but it’s probably on me for not explaining it clearly enough.
  - First, for a fixed goal $g$ , the whole focus matters. That is, we also care about $g e n_{g}$ and ${eff}_{g}$ . I plan on writing a post defending why we need all of them, but basically there are situations when using only one of them would makes us order things weirdly.
  - You’re right that we need to consider all goals. That’s why the goal-directedness of the system $π$ is defined as a function that send each goal (satisfying the nice conditions) on a focus, the vector of three numbers. So the goal-directedness of $π$ contains the focus for every goal, and the focus captures the coherence of $π$ with the goal.
  Rohin then speculates that if we remove the ‘goal’ from the above argument, we can make the AI safer. He then comes up with a metric of ‘goal-directedness’ where an agent can have zero goal-directedness even though he can model it as a system that is maximizing a utility function. Also, in Rohin’s terminology, an agent gets safer it if is less goal-directed.
  This doesn’t feel like a good summary of what Rohin says in his sequence.
  - He says that many scenarios used to argue for AI risks implicitly use systems following goals, and thus that building AIs not having goal might make these scenarios go away. But he doesn’t say that new problems can’t emerge.
  - He doesn’t propose a metric of goal-directedness. He just argues that every system is maximizing a utility function, and so this isn’t the way to differenciate goal-directed with non-goal-directed systems. The point of this argument is also to say that reasons to believe that AGIs should maximize expected utility are not enough to say that such AGI must necessarily be goal-directed.
  Rohin then proposes that intuitively, a table-driven agent is not goal-directed. I think you are not going there with your metrics, you are looking at observable behavior, not at agent internals.
  Where things completely move off the main sequence is in Rohin’s next step in developing his intuitive notion of goal-directedness:
  “This suggests a way to characterize these sorts of goal-directed agents: there is some goal such that the agent’s behavior in new circumstances can be predicted by figuring out which behavior best achieves the goal.”
  So what I am reading here is that if an agent behaves more unpredictably off-distribution, it is becomes less goal-directed in Rohin’s intuition. But I can’t really make sense of this anymore, as Rohin also associates less goal-directedness with more safety.
  This all starts to look like a linguistic form of Goodharting: the meaning of the term ‘goal-directed’ collapses completely because too much pressure is placed on it for control purposes.
  My previous answer mostly addresses this issue, but let’s spell it out: Rohin doesn’t say that non-goal-directed system. What he defends is that
  1. Non-goal-directed (or low-goal-directed) systems wouldn’t be unsafe in many of the ways we study, because these depend on having a goal (convergent instrumental subgoals for example)
  2. Non-goal-directed competent agents are not a mathematical impossibility, even if every competent agent must maximize expected utility.
  3. Since removing goal-directedness apparently gets rid of many big problem with aligning AI, and we don’t have an argument for why making a competent non-goal-directed system is impossible, then we should try to look into non-goal-directed approaches.
  Basically, the intuition of “less goal-directed means safer” makes sense when safer means “less probability that the AI steals all my money to buy hardware and goons to ensure that it can never be shutdown”, not when it means “less probability that the AI takes an unexpected and counterproductive action”.
  Another way to put it is that Rohin argues that removing goal-directedness (if possible) seems to remove many of the specific issues we worry about in AI Alignment—and leaves mostly the near-term “my automated car is running over people because it thinks they are parts of the road” kind of problems.
  To bring this back to the post above, this leaves me wondering how the metrics you define above relate to safety, and how far along you are in your program of relating them to safety.
  Is your idea that a lower number on a metric implies more safety? This seems to be Rohin’s original idea.
  Are these metrics supposed to have any directly obvious correlation to safety, or the particular failure scenario of ‘will become adversarial and work against us’ at all? If so I am not seeing the correlation.
  That’s a very good and fair question. My reason for not using a single metric is that I think the whole structure of focuses for many goals can tell us many important things (for safety) when looked at from different perspective. That’s definitely something I’m working on, and I think I have nice links for explainability (and others probably coming). But to take an example from the post, it seems that a system with one goal with far more generalization than any other is more at risk of the kind of safety problems Rohin related to goal-directedness.
  - Koen.Holtman 1 Mar 2021 19:23 UTC
    LW: 1 AF: 1
    AF Parent
    
    This doesn’t feel like a good summary of what Rohin says in his sequence.
    
    I was not trying to summarize the entire sequence, only summarizing my impressions of some things he said in the first post of the sequence. Those impressions are that Rohin was developing his intuitive notion of goal-directedness in a very different direction than you have been doing, given the examples he provides.
    
    Which would be fine, but it does lead to questions of how much your approach differs. My gut feeling is that the difference in directions might be much larger than can be expressed by the mere adjective ‘behavioral’.
    
    On a more technical note, if your goal is to search for metrics related to “less probability that the AI steals all my money to buy hardware and goons to ensure that it can never be shutdown”, then the metrics that have been most productive in my opinion are, first, ‘indifference’, in the meaning where it is synonymous with ‘not having a control incentive’. Other very relevant metrics are ‘myopia’ or ‘short planning horizons’ (see for example here) and ‘power’ (see my discussion in the post Creating AGI Safety Interlocks).
    
    (My paper counterfactual planning has a definition of ‘indifference’ which I designed to be more accessible than the `not having a control incentive’ definition, i.e. more accessible for people not familiar with Pearl’s math.)
    
    None of the above metrics look very much like ‘non-goal-directedness’ to me, with the possible exception of myopia.

Koen.Holtman comments on Behavioral Sufficient Statistics for Goal-Directedness

Technical comment on the above post

More general comments on goal-directedness

Technical comment on the above post