Richard_Ngo comments on Against the Backward Approach to Goal-Directedness

Richard_Ngo 20 Jan 2021 16:19 UTC
LW: 9 AF: 6
0
AF
Hmmm, it doesn’t seem like these two approaches are actually that distinct. Consider: in the forward approach, which intuitions about goal-directedness are you using? If you’re only using intuitions about human goal-directedness, then you’ll probably miss out on a bunch of important ideas. Whereas if you’re using intuitions about extreme cases, like superintelligences, then this is not so different to the backwards approach.
Meanwhile, I agree that the backward approach will fail if we try to find “the fundamental property that the forward approach is trying to formalise”. But this seems like bad philosophy. We shouldn’t expect there to be a formal or fundamental definition of agency, just like there’s no formal definition of tables or democracy (or knowledge, or morality, or any of the other complex concepts philosophers have spent centuries trying to formalise). Instead, the best way to understand complex concepts is often to treat them as a nebulous cluster of traits, analyse which traits it’s most useful to include and how they interact, and then do the same for each of the component traits. On this approach, identifying convergent instrumental goals is one valuable step in fleshing out agency; and another valuable step is saying “what cognition leads to the pursuit of convergent instrumental goals”; and another valuable step is saying “what ways of building minds lead to that cognition”; and once we understand all this stuff in detail, then we will have a very thorough understanding of agency. Note that even academic philosophy is steering towards this approach, under the heading of “conceptual engineering”.
So I count my approach as a backwards one, consisting of the following steps:
1. It’s possible to build AGIs which are dangerous in a way that intuitively involves something like “agency”.
2. Broadly speaking, the class of dangerous agentic AGIs have certain cognition in common, such as making long-term plans, and pursuing convergent instrumental goals (many of which will also be shared by dangerous agentic humans).
3. By thinking about the cognition that agentic AGIs would need to carry out to be dangerous, we can identify some of traits which contribute a lot to danger, but contribute little to capabilities.
4. We can then try to design training processes which prevent some of those traits from arising.
(Another way of putting this: the backwards approach works when you use it to analyse concepts as being like network 1, not network 2.)
If you’re still keen to find a “fundamental property”, then it feels like you’ll need to address a bunch of issues in embedded agency.
What links here?
- Richard_Ngo's comment on Literature Review on Goal-Directedness by adamShimi (26 Jan 2021 14:35 UTC; 4 points)
- adamShimi 26 Jan 2021 17:01 UTC
  LW: 2 AF: 1
  AF Parent
  Thanks for both your careful response and the pointer to Conceptual Engineering!
  I believe I am usually thinking in terms of defining properties for their use, but it’s important to keep that in mind. The post on Conceptual Engineering lead me to this follow up interview, which contains a great formulation of my position:
  Livengood: Yes. The best example I can give is work by Joseph Halpern, a computer scientist at Cornell. He’s got a couple really interesting books, one on knowledge one on causation, and big parts of what he’s doing are informed by the long history of conceptual analysis. He’ll go through the puzzles, show a formalization, but then does a further thing, which philosophers need to take very seriously and should do more often. He says, look, I have this core idea, but to deploy it I need to know the problem domain. The shape of the problem domain may put additional constraints on the mathematical, precise version of the concept. I might need to tweak the core idea in a way that makes it look unusual, relative to ordinary language, so that it can excel in the problem domain. And you can see how he’s making use of this long history of case-based, conceptual analysis-friendly approach, and also the pragmatist twist: that you need to be thinking relative to a problem, you need to have a constraint which you can optimize for, and this tells you what it means to have a right or wrong answer to a question. It’s not so much free-form fitting of intuitions, built from ordinary language, but the solving of a specific problem.
  So my take is that there is probably a core/basic concept of goal-directedness, which can be altered and fitted to different uses. What we actually want here is the version fitted to AI Alignment. So we could focus on that specific version from the beginning; yet I believe that looking for the core/basic version and then fitting it to the problem is more efficient. That might be a give source of our disagreement.
  (By the way, Joe Halpern is indeed awesome. I studied a lot of his work related to distributed systems, and it’s always the perfect intersection of a philosophical concept and problem with a computer science treatement and analysis.)
  Hmmm, it doesn’t seem like these two approaches are actually that distinct. Consider: in the forward approach, which intuitions about goal-directedness are you using? If you’re only using intuitions about human goal-directedness, then you’ll probably miss out on a bunch of important ideas. Whereas if you’re using intuitions about extreme cases, like superintelligences, then this is not so different to the backwards approach.
  I resolve the apparent paradox that you raise by saying that the intuitions are about the core/basic idea which is close to human goal-directedness; but that it should then be fitted and adapted to our specific application of AI Alignment.
  Meanwhile, I agree that the backward approach will fail if we try to find “the fundamental property that the forward approach is trying to formalise”. But this seems like bad philosophy. We shouldn’t expect there to be a formal or fundamental definition of agency, just like there’s no formal definition of tables or democracy (or knowledge, or morality, or any of the other complex concepts philosophers have spent centuries trying to formalise). Instead, the best way to understand complex concepts is often to treat them as a nebulous cluster of traits, analyse which traits it’s most useful to include and how they interact, and then do the same for each of the component traits. On this approach, identifying convergent instrumental goals is one valuable step in fleshing out agency; and another valuable step is saying “what cognition leads to the pursuit of convergent instrumental goals”; and another valuable step is saying “what ways of building minds lead to that cognition”; and once we understand all this stuff in detail, then we will have a very thorough understanding of agency. Note that even academic philosophy is steering towards this approach, under the heading of “conceptual engineering”.
  Agreed. My distinction of forward and backward felt shakier by the day, and your point finally puts it out of its misery.
  So I count my approach as a backwards one, consisting of the following steps:
  It’s possible to build AGIs which are dangerous in a way that intuitively involves something like “agency”.
  Broadly speaking, the class of dangerous agentic AGIs have certain cognition in common, such as making long-term plans, and pursuing convergent instrumental goals (many of which will also be shared by dangerous agentic humans).
  By thinking about the cognition that agentic AGIs would need to carry out to be dangerous, we can identify some of traits which contribute a lot to danger, but contribute little to capabilities.
  We can then try to design training processes which prevent some of those traits from arising.
  My take on your approach is that we’re still at 3, and we don’t have yet a good enough understanding of those traits/properties to manage 4. As for how to solve 3, I reiterate that finding a core/basic version of goal-directedness and adapting it to the usecase seems to way to go for me.