My only critique so far is that I’m not really on board yet with your methodology of making desiderata by looking at what people seem to be saying in the literature. I’d prefer a methodology like “We are looking for a definition of goal-directedness such that the standard arguments about AI risk that invoke goal-directedness make sense. If there is no such definition, great! Those arguments are wrong then.”
I agree with you that the endgoal of this research is to make sense of the arguments about AI risk invoking goal-directedness, and of the proposed alternatives. The thing is, even if it’s true, proving that there is no property making these arguments work looks extremely hard. I have very little hope that it is possible to show one way or the other heads-on.
On the other hand, when people invoke goal-directedness, they seem to reference a cluster of similar concepts. And if we manage to formalize this cluster in a satisfying manner for most people, then we can look whether these (now formal) concepts make the arguments for AI risk work. If they do, then problem solved. If the arguments now fail with this definition, I still believe that this is a strong evidence for the arguments not working in general. You can say that I’m taking the bet that “The behavior of AI risk arguments with inputs in the cluster of intuitions from the literature is representative of the behavior of AI risks arguments with any definition of goal-directedness”. Rohin for one seems less convinced by this bet (for example with regard to the importance of explainability)
My personal prediction is that the arguments for AI risks do work for a definition of goal-directedness close to this cluster of concepts. My big uncertainty is what constitute a non-goal-directed (or less goal-directed) system, and whether they’re viable against goal-directed ones.
(Note that I’m not saying that all the intuitions in the lit review should be part of a definition of goal-directedness. Just that they probably need to be addressed, and that most of them capture an important detail of the cluster)
I also have a suggestion or naive question: Why isn’t the obvious/naive definition discussed here? The obvious/naive definition, at least to me, is something like:
“The paradigmatic goal-directed system has within it some explicit representation of a way the world could be in the future—the goal—and then the system’s behavior results from following some plan, which itself resulted from some internal reasoning process in which a range of plans are proposed and considered on the basis of how effective they seemed to be at achieving the goal. When we say a system is goal-directed, we mean it is relevantly similar to the paradigmatic goal-directed system.”
I feel like this is how I (and probably everyone else?) thought about goal-directedness before attempting to theorize about it. Moreover I feel like it’s a pretty good way to begin one’s theorizing, on independent grounds: It puts the emphasis on relevantly similar and thus raises the question “Why do we care? For what purpose are we asking whether X is goal-directed?”
Your definition looks like Dennett’s intentional stance to me. In the intentional stance, the “paradigmatic goal-directed system” is the purely rational system that tries to achieve its desires based on its beliefs, and being an intentional system/goal-directed depend on similarity in terms of prediction with this system.
On the other hand, for most internal structure based definitions (like Richard’s or the mesa-optimizers), a goal-directed system is exactly a paradigmatic goal-directed system.
But I might have misunderstood your naive definition.
Thanks for the reply! I find your responses fairly satisfying. Now that you point it out, yeah, Dennett’s intentional stance is pretty similar to the naive definition. It’s a bit different because it’s committed to a particular kind of “relevantly similar:” if-we-think-of-it-as-this-then-we-can-predict-its-behavior-well. So it rules out e.g. caring about the internal structure. As for the methodology thing, I guess I would counter that there’s an even better way to provide evidence that there is no possible concept that would make the arguments work: actually try to find such a concept. If instead you are just trying to find the concept that according to you other people seem to be talking about, that seems like evidence, but not as strong. Or maybe it’s just as strong, but complementary: even if the concept you construct renders the arguments invalid, I for one probably won’t be fully satisfied until someone (maybe me) tries hard to make the arguments work.
Thanks for the feedback!
I agree with you that the endgoal of this research is to make sense of the arguments about AI risk invoking goal-directedness, and of the proposed alternatives. The thing is, even if it’s true, proving that there is no property making these arguments work looks extremely hard. I have very little hope that it is possible to show one way or the other heads-on.
On the other hand, when people invoke goal-directedness, they seem to reference a cluster of similar concepts. And if we manage to formalize this cluster in a satisfying manner for most people, then we can look whether these (now formal) concepts make the arguments for AI risk work. If they do, then problem solved. If the arguments now fail with this definition, I still believe that this is a strong evidence for the arguments not working in general. You can say that I’m taking the bet that “The behavior of AI risk arguments with inputs in the cluster of intuitions from the literature is representative of the behavior of AI risks arguments with any definition of goal-directedness”. Rohin for one seems less convinced by this bet (for example with regard to the importance of explainability)
My personal prediction is that the arguments for AI risks do work for a definition of goal-directedness close to this cluster of concepts. My big uncertainty is what constitute a non-goal-directed (or less goal-directed) system, and whether they’re viable against goal-directed ones.
(Note that I’m not saying that all the intuitions in the lit review should be part of a definition of goal-directedness. Just that they probably need to be addressed, and that most of them capture an important detail of the cluster)
Your definition looks like Dennett’s intentional stance to me. In the intentional stance, the “paradigmatic goal-directed system” is the purely rational system that tries to achieve its desires based on its beliefs, and being an intentional system/goal-directed depend on similarity in terms of prediction with this system.
On the other hand, for most internal structure based definitions (like Richard’s or the mesa-optimizers), a goal-directed system is exactly a paradigmatic goal-directed system.
But I might have misunderstood your naive definition.
Thanks for the reply! I find your responses fairly satisfying. Now that you point it out, yeah, Dennett’s intentional stance is pretty similar to the naive definition. It’s a bit different because it’s committed to a particular kind of “relevantly similar:” if-we-think-of-it-as-this-then-we-can-predict-its-behavior-well. So it rules out e.g. caring about the internal structure. As for the methodology thing, I guess I would counter that there’s an even better way to provide evidence that there is no possible concept that would make the arguments work: actually try to find such a concept. If instead you are just trying to find the concept that according to you other people seem to be talking about, that seems like evidence, but not as strong. Or maybe it’s just as strong, but complementary: even if the concept you construct renders the arguments invalid, I for one probably won’t be fully satisfied until someone (maybe me) tries hard to make the arguments work.
Glad my comment clarified some things.
About the methodology, I just published a post clarifying my thinking about it.