This theory of goal-directedness has the virtue of being closely tied to what we care about:
--If a system is goal-directed according to this definition, then (probably) it is the sort of thing that might behave as if it has convergent instrumental goals. It might, for example, deceive us and then turn on us later. Whereas if a system is not goal-directed according to this definition, then absent further information we have no reason to expect those behaviors.
--Obviously we want to model things efficiently. So we are independently interested in what the most efficient way to model something is. So this definition doesn’t make us go that far out of our way to compute, so to speak.
On the other hand, I think this definition is not completely satisfying, because it doesn’t help much with the most important questions:
--Given a proposal for an AGI architecture, is it the sort of thing that might deceive us and then turn on us later? Your definition answers: “Well, is it the sort of thing that can be most efficiently modelled as an EU-maximizer? If yes, then yes, if no, then no.” The problem with this answer is that trying to see whether or not we can model the system as an EU-maximizer involves calculating out the system’s behavior and comparing it to what an EU-maximizer (worse, to a range of EU-maximizers with various relatively simple or salient utility and credence functions) would do, and if we are doing that we can probably just answer the will-it-deceive-us question directly. Alternatively perhaps we could look at the structure of the system—the architecture—and say “see, this here is similar to the EU-max algorithm.” But if we are doing that, then again, maybe we don’t need this extra step in the middle; maybe we can jump straight from looking at the structure of the system to inferring whether or not it will act like it has convergent instrumental goals.
This theory of goal-directedness has the virtue of being closely tied to what we care about:
--If a system is goal-directed according to this definition, then (probably) it is the sort of thing that might behave as if it has convergent instrumental goals. It might, for example, deceive us and then turn on us later. Whereas if a system is not goal-directed according to this definition, then absent further information we have no reason to expect those behaviors.
--Obviously we want to model things efficiently. So we are independently interested in what the most efficient way to model something is. So this definition doesn’t make us go that far out of our way to compute, so to speak.
On the other hand, I think this definition is not completely satisfying, because it doesn’t help much with the most important questions:
--Given a proposal for an AGI architecture, is it the sort of thing that might deceive us and then turn on us later? Your definition answers: “Well, is it the sort of thing that can be most efficiently modelled as an EU-maximizer? If yes, then yes, if no, then no.” The problem with this answer is that trying to see whether or not we can model the system as an EU-maximizer involves calculating out the system’s behavior and comparing it to what an EU-maximizer (worse, to a range of EU-maximizers with various relatively simple or salient utility and credence functions) would do, and if we are doing that we can probably just answer the will-it-deceive-us question directly. Alternatively perhaps we could look at the structure of the system—the architecture—and say “see, this here is similar to the EU-max algorithm.” But if we are doing that, then again, maybe we don’t need this extra step in the middle; maybe we can jump straight from looking at the structure of the system to inferring whether or not it will act like it has convergent instrumental goals.