Rohin Shah comments on Grokking the Intentional Stance

Rohin Shah 9 Sep 2021 8:25 UTC
LW: 9 AF: 6
5
AF
I mostly agree with everything here, but I think it is understating the extent to which the intentional stance is insufficient for the purposes of AI alignment. I think if you accept “agency = intentional stance”, then you need to think “well, I guess AI risk wasn’t actually about agency”.
A fundamental part of the argument for AI risk is that an AI system will behave in a novel manner when it is deployed out in the world, that then leads to our extinction. The obvious question: why should it behave in this novel manner? Typically, we say something like “because it will be agentic / be goal-directed with the wrong goal”.
If you then deconfuse agency as “its behavior is reliably predictable by the intentional strategy”, I then have the same question: “why is its behavior reliably predictable by the intentional strategy?” Sure, its behavior in the set of circumstances we’ve observed is predictable by the intentional strategy, but none of those circumstances involved human extinction; why expect that the behavior will continue to be reliably predictable in settings where the prediction is “causes human extinction”?
Overall, I generally agree with the intentional stance as an explanation of the human concept of agency, but I do not think it can be used as a foundation for AI risk arguments. For that, you need something else, such as mechanistic implementation details, empirical trend extrapolations, analyses of the inductive biases of AI systems, etc.
Some previous discussion:
- Section “Our understanding of the behavior” in Intuitions about goal-directed behavior
- Discussion on this post
What links here?
- [AN #164]: How well can language models write code? by Rohin Shah (15 Sep 2021 17:20 UTC; 13 points)
- Rohin Shah's comment on Grokking the Intentional Stance by jbkjr (9 Sep 2021 9:19 UTC; 2 points)
- jbkjr 16 Sep 2021 14:45 UTC
  LW: 6 AF: 5
  AF Parent
  
  If you then deconfuse agency as “its behavior is reliably predictable by the intentional strategy”, I then have the same question: “why is its behavior reliably predictable by the intentional strategy?” Sure, its behavior in the set of circumstances we’ve observed is predictable by the intentional strategy, but none of those circumstances involved human extinction; why expect that the behavior will continue to be reliably predictable in settings where the prediction is “causes human extinction”?
  
  Overall, I generally agree with the intentional stance as an explanation of the human concept of agency, but I do not think it can be used as a foundation for AI risk arguments. For that, you need something else, such as mechanistic implementation details, empirical trend extrapolations, analyses of the inductive biases of AI systems, etc.
  
  The requirement for its behavior being “reliably predictable” by the intentional strategy doesn’t necessarily limit us to postdiction in already-observed situations; we could require our intentional stance model of the system’s behavior to generalize OOD. Obviously, to build such a model that generalizes well, you’ll want it to mirror the actual causal dynamics producing the agent’s behavior as closely as possible, so you need to make further assumptions about the agent’s cognitive architecture, inductive biases, etc. that you hope will hold true in that specific context (e.g. human minds or prosaic AIs). However, these are additional assumptions needed to answer question of why an intentional stance model will generalize OOD, not replacing the intentional stance as the foundation of our concept of agency, because, as you say, it explains the human concept of agency, and we’re worried that AI systems will fail catastrophically in ways that look agentic and goal-directed… to us.
  
  You are correct that having only the intentional stance is insufficient to make the case for AI risk from “goal-directed” prosaic systems, but having it as the foundation of what we mean by “agent” clarifies what more is needed to make the sufficient case—what about the mechanics of prosaic systems will allow us to build intentional stance models of their behavior that generalize well OOD?
  What links here?
  - Integrating Three Models of (Human) Cognition by jbkjr (23 Nov 2021 1:06 UTC; 33 points)
  - Rohin Shah 17 Sep 2021 7:43 UTC
    LW: 3 AF: 3
    AF Parent
    Yeah, I agree with all of that.