I’d say you passed my intellectual Turing test, but that seems like an understatement. More like… if you were a successor AI, I would be comfortable deferring to you on this topic. (Not literally true, but the analogy seems to convey something of the right spirit.) You fully understand my points and have made further novel observations about them; in particular, the analogy to the Sylow theorems is perfect, and you’re clearly asking the right questions.
Regarding instrumental convergence as a foundation for coherence theorems...
I touched on this a bit in this review of Coherent Decisions Imply Consistent Utilities. The main issue is that coherence theorems generally need some kind of “yardstick” to measure utility against, something which agents are assumed to generally want more of; the flavor text around the theorem usually calls it “money”. It need not be something that agents want as a terminal value, just something that we assume agents can always use more of in order to get more utility. We then recognize “incoherent decisions” by an agent “throwing away” the yardstick-resource unnecessarily—i.e. taking a path which expends strictly more of the resource than is necessary to reach the end-state.
But what if our universe doesn’t have some built-in, ontologically-basic yardstick against which to measure decision-coherence? How can we derive the yardstick from first principles?
That’s the question I think instrumental convergence could potentially answer. If broad classes of mind designs in a certain universe “want similar things” (as non-terminal goals), then those things might make a good yardstick. In order to to give full force to this argument, we need to ground “want similar things” in a way which doesn’t talk about “wanting”, since we’re trying to derive utility from first principles. That’s where something like “nearby subsystems can only influence far away subsystems via <small set of variables>” comes in. That small set of variables acts like a natural yardstick to measure coherence of nearby decisions: throwing away control over those variables implies that the agent is strictly suboptimal for controlling (almost) anything far away. In some sense, it’s coherence of nearby decisions, as viewed from a distance.
I’d say you passed my intellectual Turing test, but that seems like an understatement. More like… if you were a successor AI, I would be comfortable deferring to you on this topic. (Not literally true, but the analogy seems to convey something of the right spirit.) You fully understand my points and have made further novel observations about them; in particular, the analogy to the Sylow theorems is perfect, and you’re clearly asking the right questions.
Regarding instrumental convergence as a foundation for coherence theorems...
I touched on this a bit in this review of Coherent Decisions Imply Consistent Utilities. The main issue is that coherence theorems generally need some kind of “yardstick” to measure utility against, something which agents are assumed to generally want more of; the flavor text around the theorem usually calls it “money”. It need not be something that agents want as a terminal value, just something that we assume agents can always use more of in order to get more utility. We then recognize “incoherent decisions” by an agent “throwing away” the yardstick-resource unnecessarily—i.e. taking a path which expends strictly more of the resource than is necessary to reach the end-state.
But what if our universe doesn’t have some built-in, ontologically-basic yardstick against which to measure decision-coherence? How can we derive the yardstick from first principles?
That’s the question I think instrumental convergence could potentially answer. If broad classes of mind designs in a certain universe “want similar things” (as non-terminal goals), then those things might make a good yardstick. In order to to give full force to this argument, we need to ground “want similar things” in a way which doesn’t talk about “wanting”, since we’re trying to derive utility from first principles. That’s where something like “nearby subsystems can only influence far away subsystems via <small set of variables>” comes in. That small set of variables acts like a natural yardstick to measure coherence of nearby decisions: throwing away control over those variables implies that the agent is strictly suboptimal for controlling (almost) anything far away. In some sense, it’s coherence of nearby decisions, as viewed from a distance.