Notice that both questions above are about predicting properties of the system based on its goal-directedness. These properties we care about depend only on the behavior of the system, not on its internal structure. It thus makes sense to consider that goal-directedness should also depend only on the behavior of the system. For if it didn’t, then two systems with the same properties (safety, competitiveness) would have different goal-directedness, breaking the pattern of prediction.
I have two large objections to this.
First, the two questions considered are both questions about goal-directed AI. As I see it, the most important reason to think about goal directedness is not that AI might be goal directed, but that humans might be goal directed. The whole point of alignment is to build AI which does what humans want; the entire concept of “what humans want” has goal directedness built into it. We need a model in which it makes sense for humans to want things, in order to even formulate the question “will this AI do what humans want?”. That’s why goal directedness matters.
If we think about goal-directedness in terms of figuring out what humans want, then it’s much less clear that it should be behaviorally defined.
Second, think about the implied logic in these two sentences:
These properties we care about depend only on the behavior of the system, not on its internal structure. It thus makes sense to consider that goal-directedness should also depend only on the behavior of the system.
Here’s an analogous argument, to make the problem more obvious: I want to predict whether a system is foo based on whether it is bar. Foo-ness depends only on how big the system is, not on how red it is. Thus it makes sense to consider that bar-ness should also only depend on how big the system is, not on how red it is.
If I were to sketch out a causal graph for the implied model behind this argument, it would have an arrow/path Big-ness → Foo-ness, with no other inputs to foo-ness. The claim “therefore bar-ness should also depend only on how big the system is” effectively assumes that bar-ness is on the path between big-ness and foo-ness. Assuming bar-ness is on that path, it shouldn’t have a side input from red-ness, because then red-ness would be upstream of foo-ness. But that’s not the only possibility; in the goal-directness case it makes more sense for bar-ness to be upstream of big-ness—i.e. goal-directness determines behavior, not the other way around.
Anyway, moving on...
Actually, this assumes that our predictor is injective: it sends different “levels” of goal-directedness to different values of the properties. I agree with this intuition, given how much performance and safety issues seem to vary according to goal-directedness.
I disagree with this. See Alignment as Translation: goal-directedness is a sufficient condition for a misaligned AI to be dangerous, not a necessary condition. AI can be dangerous in exactly the same way as nukes: it can make big irreversible changes too quickly to stop. This relates to the previous objection as well: it’s the behavior that makes AI dangerous, and goal-directedness is one possible cause of dangerous behavior, not the only possible cause. Goal-directedness causes behavior, not vice-versa.
Overall, I’m quite open to the notion that goal-directedness must be defined behaviorally, but the arguments in this post do not lend any significant support to that notion.
Thanks for the comment, and sorry for taking that long to answer, I had my hands full with the application for the LTFF.
Except your first one (I go into that below), I agree with all your criticisms of my argument. I also realized that the position opposite of mine was not to think that we care about something else than the behavior, but that specifying what matters in the behavior might require thinking about the insides. I still disagree, but I don’t think I have conclusive arguments for that debate. The best I can do is try to do it and see if I fail.
About your first point:
First, the two questions considered are both questions about goal-directed AI. As I see it, the most important reason to think about goal-directedness is not that AI might be goal directed, but that humans might be goal directed. The whole point of alignment is to build AI which does what humans want; the entire concept of “what humans want” has goal directedness built into it. We need a model in which it makes sense for humans to want things, in order to even formulate the question “will this AI do what humans want?”. That’s why goal directedness matters.
Well, the questions I care about (and the ones Rohin asked) are actually about goal-directed AI. It’s about whether it must be goal-directed, and whether making it not/less goal-directed improves its safety. So I’m clearly not considering “what humans want” first, even if it would be a nice consequence.
Well, the questions I care about (and the ones Rohin asked) are actually about goal-directed AI. It’s about whether it must be goal-directed, and whether making it not/less goal-directed improves its safety. So I’m clearly not considering “what humans want” first, even if it would be a nice consequence.
Yeah, I definitely see that you’re trying to do a useful thing here, and the fact that you’re not doing some other useful thing doesn’t make the current efforts any less useful.
That said, I would suggest that, if you’re thinking about a notion of “goal-directedness” which isn’t even intended to capture many of the things people often call “goal-directedness”, then maybe finding a better name for the thing you want to formalize would be a useful step. It feels like the thing you’re trying to formalize is not actually goal-directedness per se, and figuring out what it is would likely be a big step forward in terms of figuring out the best ways to formalize it and what properties it’s likely to have.
(Alternatively, if you really do want a general theory of goal-directedness, then I strongly recommend brainstorming many use-cases/examples and figuring out what unifies all of them.)
Drawing an analogy to my current work: if I want to formulate a general notion of abstraction, then that project is about making it work on as many abstraction-use-cases as possible. On the other hand, if I just want a formulation of abstraction to solve one or two particular problems, then a solution to that might not need to be a general formulation of abstraction—and figuring out what it does need to be would probably help me avoid the hard work of building a fully general theory.
You make a good point. Actually, I think I answered a bit too fast, maybe because I was in the defensive (given the content of your comment). We probably are actually trying to capture the intuitive goal-directedness, in the sense that many of our examples, use-cases, intuitions and counter-examples draw on humans.
What I reacted against is a focus solely on humans. I do think that goal-directedness should capture/explain humans, but I also believe that studying simpler settings/systems will provide many insight that would be lost in the complexity of humans. It’s in that sense that I think the bulk of the formalization/abstraction work should focus less on humans than you implied.
There is also the fact that we want to answer some of the questions raised by goal-directedness for AI safety. And thus even if the complete picture is lacking, having a theory capturing this aspect would already be a big progress.
I have two large objections to this.
First, the two questions considered are both questions about goal-directed AI. As I see it, the most important reason to think about goal directedness is not that AI might be goal directed, but that humans might be goal directed. The whole point of alignment is to build AI which does what humans want; the entire concept of “what humans want” has goal directedness built into it. We need a model in which it makes sense for humans to want things, in order to even formulate the question “will this AI do what humans want?”. That’s why goal directedness matters.
If we think about goal-directedness in terms of figuring out what humans want, then it’s much less clear that it should be behaviorally defined.
Second, think about the implied logic in these two sentences:
Here’s an analogous argument, to make the problem more obvious: I want to predict whether a system is foo based on whether it is bar. Foo-ness depends only on how big the system is, not on how red it is. Thus it makes sense to consider that bar-ness should also only depend on how big the system is, not on how red it is.
If I were to sketch out a causal graph for the implied model behind this argument, it would have an arrow/path Big-ness → Foo-ness, with no other inputs to foo-ness. The claim “therefore bar-ness should also depend only on how big the system is” effectively assumes that bar-ness is on the path between big-ness and foo-ness. Assuming bar-ness is on that path, it shouldn’t have a side input from red-ness, because then red-ness would be upstream of foo-ness. But that’s not the only possibility; in the goal-directness case it makes more sense for bar-ness to be upstream of big-ness—i.e. goal-directness determines behavior, not the other way around.
Anyway, moving on...
I disagree with this. See Alignment as Translation: goal-directedness is a sufficient condition for a misaligned AI to be dangerous, not a necessary condition. AI can be dangerous in exactly the same way as nukes: it can make big irreversible changes too quickly to stop. This relates to the previous objection as well: it’s the behavior that makes AI dangerous, and goal-directedness is one possible cause of dangerous behavior, not the only possible cause. Goal-directedness causes behavior, not vice-versa.
Overall, I’m quite open to the notion that goal-directedness must be defined behaviorally, but the arguments in this post do not lend any significant support to that notion.
Thanks for the comment, and sorry for taking that long to answer, I had my hands full with the application for the LTFF.
Except your first one (I go into that below), I agree with all your criticisms of my argument. I also realized that the position opposite of mine was not to think that we care about something else than the behavior, but that specifying what matters in the behavior might require thinking about the insides. I still disagree, but I don’t think I have conclusive arguments for that debate. The best I can do is try to do it and see if I fail.
About your first point:
Well, the questions I care about (and the ones Rohin asked) are actually about goal-directed AI. It’s about whether it must be goal-directed, and whether making it not/less goal-directed improves its safety. So I’m clearly not considering “what humans want” first, even if it would be a nice consequence.
Yeah, I definitely see that you’re trying to do a useful thing here, and the fact that you’re not doing some other useful thing doesn’t make the current efforts any less useful.
That said, I would suggest that, if you’re thinking about a notion of “goal-directedness” which isn’t even intended to capture many of the things people often call “goal-directedness”, then maybe finding a better name for the thing you want to formalize would be a useful step. It feels like the thing you’re trying to formalize is not actually goal-directedness per se, and figuring out what it is would likely be a big step forward in terms of figuring out the best ways to formalize it and what properties it’s likely to have.
(Alternatively, if you really do want a general theory of goal-directedness, then I strongly recommend brainstorming many use-cases/examples and figuring out what unifies all of them.)
Drawing an analogy to my current work: if I want to formulate a general notion of abstraction, then that project is about making it work on as many abstraction-use-cases as possible. On the other hand, if I just want a formulation of abstraction to solve one or two particular problems, then a solution to that might not need to be a general formulation of abstraction—and figuring out what it does need to be would probably help me avoid the hard work of building a fully general theory.
You make a good point. Actually, I think I answered a bit too fast, maybe because I was in the defensive (given the content of your comment). We probably are actually trying to capture the intuitive goal-directedness, in the sense that many of our examples, use-cases, intuitions and counter-examples draw on humans.
What I reacted against is a focus solely on humans. I do think that goal-directedness should capture/explain humans, but I also believe that studying simpler settings/systems will provide many insight that would be lost in the complexity of humans. It’s in that sense that I think the bulk of the formalization/abstraction work should focus less on humans than you implied.
There is also the fact that we want to answer some of the questions raised by goal-directedness for AI safety. And thus even if the complete picture is lacking, having a theory capturing this aspect would already be a big progress.