Really, the only issue for our purposes with this definition is that it focuses on how goal-directedness emerges, instead of what it entails for a system. Hence it gives less of a handle to predict the behavior of a system than Dennett’s intentional stance for example.
Another way to talk about this distinction is between definitions that allow you to predict the behaviour of agents which you haven’t observed yet given how they were trained, versus definitions of goal-directedness which allow you to predict the future behaviour of an existing system given its previous behaviour.
I claim that the former is more important for our current purposes, for three reasons. Firstly, we don’t have any AGIs to study, and so when we ask the question of how likely it is that AGIs will be goal-directed, we need to talk about the way in which that trait might emerge.
Secondly, because of the possibility of deceptive alignment, it doesn’t seem like focusing on observed behaviour is sufficient for analysing goal-directedness.
Thirdly, suppose that we build a system that’s goal-directed in a dangerous way. What do we do then? Well, we need to know why that goal-directedness emerges, and how to change the training regime so that it doesn’t happen again.
Another way to talk about this distinction is between definitions that allow you to predict the behaviour of agents which you haven’t observed yet given how they were trained, versus definitions of goal-directedness which allow you to predict the future behaviour of an existing system given its previous behaviour.
I actually don’t think we should make this distinction. It’s true that Dennett’s intentional stance falls in the first category for example, but that’s not the reason why I’m interested about it. Explainability seems to me like a way to find a definition of goal-directedness that we can check through interpretability and verification, and which tells us something about the behavior of the system with regards to AI risk. Yet that doesn’t mean it only applies to the observed behavior of systems.
The biggest difference between your definition and the intuitions is that you focus on how goal-directedness appears through training. I agree that this is a fundamental problem; I just think that this is something we can only solve after having a definition of goal-directedness that we can check concretely in a system and that allows the prediction of behavior.
Firstly, we don’t have any AGIs to study, and so when we ask the question of how likely it is that AGIs will be goal-directed, we need to talk about the way in which that trait might emerge.
As mentioned above, I think a definition of goal-directedness should allow us to predict what an AGI will broadly do based on its level of goal-directedness. Training for me is only relevant in understanding which level of goal-directedness are possible/probable. That seems like the crux of the disagreement here.
Secondly, because of the possibility of deceptive alignment, it doesn’t seem like focusing on observed behaviour is sufficient for analysing goal-directedness.
I agree, but I definitely don’t think the intuitions are limiting themselves to the observed behavior. With a definition you can check through interpretability and verification, you might be able to steer clear of deception during training. That’s a use of (low) goal-directedness similar to the one Evan has in mind for myopia.
Thirdly, suppose that we build a system that’s goal-directed in a dangerous way. What do we do then? Well, we need to know why that goal-directedness emerges, and how to change the training regime so that it doesn’t happen again.
For that one, understanding how goal-directedness emerges is definitely crucial.
Hmm, okay, I think there’s still some sort of disagreement here, but it doesn’t seem particularly important. I agree that my distinction doesn’t sufficiently capture the middle ground of interpretability analysis (although the intentional stance doesn’t make use of that, so I think my argument still applies against it).
I think the disagreement left is whether we should first find a definition of goal-directedness then study how it appears through training (my position), or if we should instead define goal-directedness according to the kind of training processes that generate similar properties and risks (what I take to be your position).
Kinda, but I think both of these approaches are incomplete. In practice finding a definition and studying examples of it need to be interwoven, and you’ll have a gradual process where you start with a tentative definition, identify examples and counterexamples, adjust the definition, and so on. And insofar as our examples should focus on things which are actually possible to build (rather than weird thought experiments like blockhead or the chinese room) then it seems like what I’m proposing has aspects of both of the approaches you suggest.
My guess is that it’s more productive to continue discussing this on my response to your other post, where I make this argument in a more comprehensive way.
Another way to talk about this distinction is between definitions that allow you to predict the behaviour of agents which you haven’t observed yet given how they were trained, versus definitions of goal-directedness which allow you to predict the future behaviour of an existing system given its previous behaviour.
I claim that the former is more important for our current purposes, for three reasons. Firstly, we don’t have any AGIs to study, and so when we ask the question of how likely it is that AGIs will be goal-directed, we need to talk about the way in which that trait might emerge.
Secondly, because of the possibility of deceptive alignment, it doesn’t seem like focusing on observed behaviour is sufficient for analysing goal-directedness.
Thirdly, suppose that we build a system that’s goal-directed in a dangerous way. What do we do then? Well, we need to know why that goal-directedness emerges, and how to change the training regime so that it doesn’t happen again.
I actually don’t think we should make this distinction. It’s true that Dennett’s intentional stance falls in the first category for example, but that’s not the reason why I’m interested about it. Explainability seems to me like a way to find a definition of goal-directedness that we can check through interpretability and verification, and which tells us something about the behavior of the system with regards to AI risk. Yet that doesn’t mean it only applies to the observed behavior of systems.
The biggest difference between your definition and the intuitions is that you focus on how goal-directedness appears through training. I agree that this is a fundamental problem; I just think that this is something we can only solve after having a definition of goal-directedness that we can check concretely in a system and that allows the prediction of behavior.
As mentioned above, I think a definition of goal-directedness should allow us to predict what an AGI will broadly do based on its level of goal-directedness. Training for me is only relevant in understanding which level of goal-directedness are possible/probable. That seems like the crux of the disagreement here.
I agree, but I definitely don’t think the intuitions are limiting themselves to the observed behavior. With a definition you can check through interpretability and verification, you might be able to steer clear of deception during training. That’s a use of (low) goal-directedness similar to the one Evan has in mind for myopia.
For that one, understanding how goal-directedness emerges is definitely crucial.
Hmm, okay, I think there’s still some sort of disagreement here, but it doesn’t seem particularly important. I agree that my distinction doesn’t sufficiently capture the middle ground of interpretability analysis (although the intentional stance doesn’t make use of that, so I think my argument still applies against it).
I think the disagreement left is whether we should first find a definition of goal-directedness then study how it appears through training (my position), or if we should instead define goal-directedness according to the kind of training processes that generate similar properties and risks (what I take to be your position).
Does that make sense to you?
Kinda, but I think both of these approaches are incomplete. In practice finding a definition and studying examples of it need to be interwoven, and you’ll have a gradual process where you start with a tentative definition, identify examples and counterexamples, adjust the definition, and so on. And insofar as our examples should focus on things which are actually possible to build (rather than weird thought experiments like blockhead or the chinese room) then it seems like what I’m proposing has aspects of both of the approaches you suggest.
My guess is that it’s more productive to continue discussing this on my response to your other post, where I make this argument in a more comprehensive way.