Well, the questions I care about (and the ones Rohin asked) are actually about goal-directed AI. It’s about whether it must be goal-directed, and whether making it not/less goal-directed improves its safety. So I’m clearly not considering “what humans want” first, even if it would be a nice consequence.
Yeah, I definitely see that you’re trying to do a useful thing here, and the fact that you’re not doing some other useful thing doesn’t make the current efforts any less useful.
That said, I would suggest that, if you’re thinking about a notion of “goal-directedness” which isn’t even intended to capture many of the things people often call “goal-directedness”, then maybe finding a better name for the thing you want to formalize would be a useful step. It feels like the thing you’re trying to formalize is not actually goal-directedness per se, and figuring out what it is would likely be a big step forward in terms of figuring out the best ways to formalize it and what properties it’s likely to have.
(Alternatively, if you really do want a general theory of goal-directedness, then I strongly recommend brainstorming many use-cases/examples and figuring out what unifies all of them.)
Drawing an analogy to my current work: if I want to formulate a general notion of abstraction, then that project is about making it work on as many abstraction-use-cases as possible. On the other hand, if I just want a formulation of abstraction to solve one or two particular problems, then a solution to that might not need to be a general formulation of abstraction—and figuring out what it does need to be would probably help me avoid the hard work of building a fully general theory.
You make a good point. Actually, I think I answered a bit too fast, maybe because I was in the defensive (given the content of your comment). We probably are actually trying to capture the intuitive goal-directedness, in the sense that many of our examples, use-cases, intuitions and counter-examples draw on humans.
What I reacted against is a focus solely on humans. I do think that goal-directedness should capture/explain humans, but I also believe that studying simpler settings/systems will provide many insight that would be lost in the complexity of humans. It’s in that sense that I think the bulk of the formalization/abstraction work should focus less on humans than you implied.
There is also the fact that we want to answer some of the questions raised by goal-directedness for AI safety. And thus even if the complete picture is lacking, having a theory capturing this aspect would already be a big progress.
Yeah, I definitely see that you’re trying to do a useful thing here, and the fact that you’re not doing some other useful thing doesn’t make the current efforts any less useful.
That said, I would suggest that, if you’re thinking about a notion of “goal-directedness” which isn’t even intended to capture many of the things people often call “goal-directedness”, then maybe finding a better name for the thing you want to formalize would be a useful step. It feels like the thing you’re trying to formalize is not actually goal-directedness per se, and figuring out what it is would likely be a big step forward in terms of figuring out the best ways to formalize it and what properties it’s likely to have.
(Alternatively, if you really do want a general theory of goal-directedness, then I strongly recommend brainstorming many use-cases/examples and figuring out what unifies all of them.)
Drawing an analogy to my current work: if I want to formulate a general notion of abstraction, then that project is about making it work on as many abstraction-use-cases as possible. On the other hand, if I just want a formulation of abstraction to solve one or two particular problems, then a solution to that might not need to be a general formulation of abstraction—and figuring out what it does need to be would probably help me avoid the hard work of building a fully general theory.
You make a good point. Actually, I think I answered a bit too fast, maybe because I was in the defensive (given the content of your comment). We probably are actually trying to capture the intuitive goal-directedness, in the sense that many of our examples, use-cases, intuitions and counter-examples draw on humans.
What I reacted against is a focus solely on humans. I do think that goal-directedness should capture/explain humans, but I also believe that studying simpler settings/systems will provide many insight that would be lost in the complexity of humans. It’s in that sense that I think the bulk of the formalization/abstraction work should focus less on humans than you implied.
There is also the fact that we want to answer some of the questions raised by goal-directedness for AI safety. And thus even if the complete picture is lacking, having a theory capturing this aspect would already be a big progress.