Hmm, okay, I think there’s still some sort of disagreement here, but it doesn’t seem particularly important. I agree that my distinction doesn’t sufficiently capture the middle ground of interpretability analysis (although the intentional stance doesn’t make use of that, so I think my argument still applies against it).
I think the disagreement left is whether we should first find a definition of goal-directedness then study how it appears through training (my position), or if we should instead define goal-directedness according to the kind of training processes that generate similar properties and risks (what I take to be your position).
Kinda, but I think both of these approaches are incomplete. In practice finding a definition and studying examples of it need to be interwoven, and you’ll have a gradual process where you start with a tentative definition, identify examples and counterexamples, adjust the definition, and so on. And insofar as our examples should focus on things which are actually possible to build (rather than weird thought experiments like blockhead or the chinese room) then it seems like what I’m proposing has aspects of both of the approaches you suggest.
My guess is that it’s more productive to continue discussing this on my response to your other post, where I make this argument in a more comprehensive way.
Hmm, okay, I think there’s still some sort of disagreement here, but it doesn’t seem particularly important. I agree that my distinction doesn’t sufficiently capture the middle ground of interpretability analysis (although the intentional stance doesn’t make use of that, so I think my argument still applies against it).
I think the disagreement left is whether we should first find a definition of goal-directedness then study how it appears through training (my position), or if we should instead define goal-directedness according to the kind of training processes that generate similar properties and risks (what I take to be your position).
Does that make sense to you?
Kinda, but I think both of these approaches are incomplete. In practice finding a definition and studying examples of it need to be interwoven, and you’ll have a gradual process where you start with a tentative definition, identify examples and counterexamples, adjust the definition, and so on. And insofar as our examples should focus on things which are actually possible to build (rather than weird thought experiments like blockhead or the chinese room) then it seems like what I’m proposing has aspects of both of the approaches you suggest.
My guess is that it’s more productive to continue discussing this on my response to your other post, where I make this argument in a more comprehensive way.