Thanks for the comment, and sorry for taking that long to answer, I had my hands full with the application for the LTFF.
Except your first one (I go into that below), I agree with all your criticisms of my argument. I also realized that the position opposite of mine was not to think that we care about something else than the behavior, but that specifying what matters in the behavior might require thinking about the insides. I still disagree, but I don’t think I have conclusive arguments for that debate. The best I can do is try to do it and see if I fail.
About your first point:
First, the two questions considered are both questions about goal-directed AI. As I see it, the most important reason to think about goal-directedness is not that AI might be goal directed, but that humans might be goal directed. The whole point of alignment is to build AI which does what humans want; the entire concept of “what humans want” has goal directedness built into it. We need a model in which it makes sense for humans to want things, in order to even formulate the question “will this AI do what humans want?”. That’s why goal directedness matters.
Well, the questions I care about (and the ones Rohin asked) are actually about goal-directed AI. It’s about whether it must be goal-directed, and whether making it not/less goal-directed improves its safety. So I’m clearly not considering “what humans want” first, even if it would be a nice consequence.
Well, the questions I care about (and the ones Rohin asked) are actually about goal-directed AI. It’s about whether it must be goal-directed, and whether making it not/less goal-directed improves its safety. So I’m clearly not considering “what humans want” first, even if it would be a nice consequence.
Yeah, I definitely see that you’re trying to do a useful thing here, and the fact that you’re not doing some other useful thing doesn’t make the current efforts any less useful.
That said, I would suggest that, if you’re thinking about a notion of “goal-directedness” which isn’t even intended to capture many of the things people often call “goal-directedness”, then maybe finding a better name for the thing you want to formalize would be a useful step. It feels like the thing you’re trying to formalize is not actually goal-directedness per se, and figuring out what it is would likely be a big step forward in terms of figuring out the best ways to formalize it and what properties it’s likely to have.
(Alternatively, if you really do want a general theory of goal-directedness, then I strongly recommend brainstorming many use-cases/examples and figuring out what unifies all of them.)
Drawing an analogy to my current work: if I want to formulate a general notion of abstraction, then that project is about making it work on as many abstraction-use-cases as possible. On the other hand, if I just want a formulation of abstraction to solve one or two particular problems, then a solution to that might not need to be a general formulation of abstraction—and figuring out what it does need to be would probably help me avoid the hard work of building a fully general theory.
You make a good point. Actually, I think I answered a bit too fast, maybe because I was in the defensive (given the content of your comment). We probably are actually trying to capture the intuitive goal-directedness, in the sense that many of our examples, use-cases, intuitions and counter-examples draw on humans.
What I reacted against is a focus solely on humans. I do think that goal-directedness should capture/explain humans, but I also believe that studying simpler settings/systems will provide many insight that would be lost in the complexity of humans. It’s in that sense that I think the bulk of the formalization/abstraction work should focus less on humans than you implied.
There is also the fact that we want to answer some of the questions raised by goal-directedness for AI safety. And thus even if the complete picture is lacking, having a theory capturing this aspect would already be a big progress.
Thanks for the comment, and sorry for taking that long to answer, I had my hands full with the application for the LTFF.
Except your first one (I go into that below), I agree with all your criticisms of my argument. I also realized that the position opposite of mine was not to think that we care about something else than the behavior, but that specifying what matters in the behavior might require thinking about the insides. I still disagree, but I don’t think I have conclusive arguments for that debate. The best I can do is try to do it and see if I fail.
About your first point:
Well, the questions I care about (and the ones Rohin asked) are actually about goal-directed AI. It’s about whether it must be goal-directed, and whether making it not/less goal-directed improves its safety. So I’m clearly not considering “what humans want” first, even if it would be a nice consequence.
Yeah, I definitely see that you’re trying to do a useful thing here, and the fact that you’re not doing some other useful thing doesn’t make the current efforts any less useful.
That said, I would suggest that, if you’re thinking about a notion of “goal-directedness” which isn’t even intended to capture many of the things people often call “goal-directedness”, then maybe finding a better name for the thing you want to formalize would be a useful step. It feels like the thing you’re trying to formalize is not actually goal-directedness per se, and figuring out what it is would likely be a big step forward in terms of figuring out the best ways to formalize it and what properties it’s likely to have.
(Alternatively, if you really do want a general theory of goal-directedness, then I strongly recommend brainstorming many use-cases/examples and figuring out what unifies all of them.)
Drawing an analogy to my current work: if I want to formulate a general notion of abstraction, then that project is about making it work on as many abstraction-use-cases as possible. On the other hand, if I just want a formulation of abstraction to solve one or two particular problems, then a solution to that might not need to be a general formulation of abstraction—and figuring out what it does need to be would probably help me avoid the hard work of building a fully general theory.
You make a good point. Actually, I think I answered a bit too fast, maybe because I was in the defensive (given the content of your comment). We probably are actually trying to capture the intuitive goal-directedness, in the sense that many of our examples, use-cases, intuitions and counter-examples draw on humans.
What I reacted against is a focus solely on humans. I do think that goal-directedness should capture/explain humans, but I also believe that studying simpler settings/systems will provide many insight that would be lost in the complexity of humans. It’s in that sense that I think the bulk of the formalization/abstraction work should focus less on humans than you implied.
There is also the fact that we want to answer some of the questions raised by goal-directedness for AI safety. And thus even if the complete picture is lacking, having a theory capturing this aspect would already be a big progress.