Richard_Ngo comments on Literature Review on Goal-Directedness

Richard_Ngo 20 Jan 2021 16:26 UTC
LW: 4 AF: 2
AF
Hmm, okay, I think there’s still some sort of disagreement here, but it doesn’t seem particularly important. I agree that my distinction doesn’t sufficiently capture the middle ground of interpretability analysis (although the intentional stance doesn’t make use of that, so I think my argument still applies against it).
- adamShimi 26 Jan 2021 9:51 UTC
  LW: 2 AF: 1
  AF Parent
  I think the disagreement left is whether we should first find a definition of goal-directedness then study how it appears through training (my position), or if we should instead define goal-directedness according to the kind of training processes that generate similar properties and risks (what I take to be your position).
  Does that make sense to you?
  - Richard_Ngo 26 Jan 2021 14:35 UTC
    LW: 4 AF: 2
    AF Parent
    Kinda, but I think both of these approaches are incomplete. In practice finding a definition and studying examples of it need to be interwoven, and you’ll have a gradual process where you start with a tentative definition, identify examples and counterexamples, adjust the definition, and so on. And insofar as our examples should focus on things which are actually possible to build (rather than weird thought experiments like blockhead or the chinese room) then it seems like what I’m proposing has aspects of both of the approaches you suggest.
    My guess is that it’s more productive to continue discussing this on my response to your other post, where I make this argument in a more comprehensive way.