This review is mostly going to talk about what I think the post does wrong and how to fix it, because the post itself does a good job explaining what it does right. But before we get to that, it’s worth saying up-front what the post does well: the post proposes a basically-correct notion of “power” for purposes of instrumental convergence, and then uses it to prove that instrumental convergence is in fact highly probable under a wide range of conditions. On that basis alone, it is an excellent post.
I see two (related) central problems, from which various other symptoms follow:
POWER offers a black-box notion of instrumental convergence. This is the right starting point, but it needs to be complemented with a gears-level understanding of what features of the environment give rise to convergence.
Unstructured MDPs are a bad model in which to formulate instrumental convergence. In particular, they are bad for building a gears-level understanding of what features of the environment give rise to convergence.
Why Unstructured MDPs Are A Bad Model For Instrumental Convergence
The basic problem with unstructured MDPs is that the entire world-state is a single, monolithic object. Some symptoms of this problem:
it’s hard to talk about “resources”, which seem fairly central to instrumental convergence
it’s hard to talk about multiple agents competing for the same resources
it’s hard to talk about which parts of the world an agent controls/doesn’t control
it’s hard to talk about which parts of the world agents do/don’t care about
… indeed, it’s hard to talk about the world having “parts” at all
it’s hard to talk about agents not competing, since there’s only one monolithic world-state to control
any action which changes the world at all changes the entire world-state; there’s no built-in way to change a “small part” of the world
More generally, unstructured MDPs are problematic for most kinds of gears-level understanding: the point of gears is to talk about the structure of the world, and the point of unstructured MDPs is to use one black-box world-state without any internal structure. (The one exception to this is time-structure, which MDPs do have.)
My go-to model would instead be a circuit/Bayes net, with some decision nodes. There are alternatives to this, but it’s probably the most general option in which the world has structure/parts.
What Would A Gearsier Model Of Instrumental Convergence Look Like?
Intuitive example: in a real-time strategy game, units and buildings and so forth can be created, destroyed, and generally moved around given sufficient time. Over long time scales, the main thing which matters to the world-state is resources—creating or destroying anything else costs resources. So, even though there’s a high-dimensional game-world, it’s mainly a few (low-dimensional) resource counts which impact the long term state space. Any agents hoping to control anything in the long term will therefore compete to control those few resources.
More generally: of all the many “nearby” variables an agent can control, only a handful (or summary) are relevant to anything “far away”. Any “nearby” agents trying to control things “far away” will therefore compete to control the same handful of variables.
Main thing to notice: this intuition talks directly about a feature of the world—i.e. “far away” variables depending only on a handful of “nearby” variables. That, according to me, is the main feature which makes or breaks instrumental convergence in any given universe. We can talk about that feature entirely independent of agents or agency. Indeed, we could potentially use this intuition to derive agency, via some kind of coherence theorem; this notion of instrumental convergence is more fundamental than utility functions.
I would still expect something like this to agree with the POWER notion of instrumental convergence. But something along these lines would provide a more gears-level picture, to complement the more functional/black-boxy picture provided by POWER. Ideally, the two would turn out to fully agree, providing a strong characterization of instrumental convergence.
To quote Eliezer (who was originally talking to Benja Fallenstein; edits italicized):
Well, that’s a very intelligent review, John Wentworth. But I have a crushing reply to your review, such that, once I deliver it, you will at once give up further debate with me on this particular point: You’re right.
Here are some more thoughts.
POWER offers a black-box notion of instrumental convergence. This is the right starting point, but it needs to be complemented with a gears-level understanding of what features of the environment give rise to convergence.
I agree, and I’d like to elaborate my take on the black boxy-ness.
To me, these theorems (and the further basic MDP theorems I’ve developed but not yet made available) feel analogous to the Sylow theorems in group theory. Indeed, a CHAI researcher once remarked to me that my theorems seem to apply the spirit of abstract algebra to MDPs.
The Sylow theorems tell you that if you know the cardinality |G| of a group G, you can constrain its internal structure in useful ways, and sometimes even guarantee it has normal subgroups of given cardinalities. But maybe we don’t know |G|. What’s that world like, where we don’t have easy ways of knowing the group cardinality and deriving its prime factorization, but we still have the Sylow theorems?
My theorems say that if you know certain summary information of the graphical properties of an MDP, you can conclude POWER-seeking. But maybe we don’t know that summary information, because we don’t know exactly what the MDP looks like. What’s that world like, where we don’t have easy ways of knowing the MDP model and deriving high-level graphical properties, but we still have the POWER-seeking theorems?
I think that you still want to know both sets of theorems, even though you might not have recourse to a constructive explanation for the actual (groups / MDPs) you care about understanding.
But you also care about the cardinalities, and you also care about what kinds of things will tend to be robustly instrumental, what kinds of things tend to give you POWER / “resources”, and I think that the kind of theory you propose could take an important step in that direction.
I also think that it’s aesthetically pleasing to have a notion of POWER-seeking which doesn’t depend on the state featurization, but only on the environmental dynamics; however, more granular theories probably should depend on that.
if an agent’s goals do not explicitly involve things close to X, then the agent cares only about controlling f(X).
This, I think, is too strong: not only do some agents not care about [exact voltages on a CPU], some agents aren’t even incentivized to care about [the summary information f(X) of these voltages]. For example, an agent with a constant utility function cares neither about X nor about f(X), and I imagine there are less trivial utility functions which are indifferent to large classes of outcomes and share this property.
The main point here isn’t to nitpick the implication, but to shift the emphasis towards a direction I think might be productive (towards * below). So, I would say:
if an agent’s goals do not explicitly involve things close to X, then the agent cares only about controlling f(X), if it cares at all.
Crucially, X is of type thing-to-care-about, and f(X) is of type thing-to-condition-on. By definition, one cares about X itself for terminal reasons, but cares about f(X) for instrumental predictive benefit—because of what it implies about the other things one does care about.
Then you might wonder, in a given environment:
why “agents” are incentivized to develop “good abstractions”;
given f(X), what kinds of query classes tend to make f(X) a good abstraction;
There’s a corresponding question of:
“given variables X, what kinds of utility functions will terminally care about X?”,
but this isn’t as relevant. This question also seems easier, since X is presumably in the domain of the utility functions.
given an abstraction f(X), what kinds of goals will incentivize policies which care about controlling f(X).
* The answers to 2. and 3. would point to “caring about thing Y means you care about summary information Z, and if you care about summary information Z, you’ll tend to try to control features A, B, and C.” This could then not only say how many goals tend to seek POWER, but which kinds of goals seek which kinds of control and resources and flexible influence. *
We can talk about that feature entirely independent of agents or agency. Indeed, we could potentially use this intuition to derive agency, via some kind of coherence theorem; this notion of instrumental convergence is more fundamental than utility functions.
Can you expand on this? I’m mostly confused about what kind of agency might be derived, exactly.
I’d say you passed my intellectual Turing test, but that seems like an understatement. More like… if you were a successor AI, I would be comfortable deferring to you on this topic. (Not literally true, but the analogy seems to convey something of the right spirit.) You fully understand my points and have made further novel observations about them; in particular, the analogy to the Sylow theorems is perfect, and you’re clearly asking the right questions.
Regarding instrumental convergence as a foundation for coherence theorems...
I touched on this a bit in this review of Coherent Decisions Imply Consistent Utilities. The main issue is that coherence theorems generally need some kind of “yardstick” to measure utility against, something which agents are assumed to generally want more of; the flavor text around the theorem usually calls it “money”. It need not be something that agents want as a terminal value, just something that we assume agents can always use more of in order to get more utility. We then recognize “incoherent decisions” by an agent “throwing away” the yardstick-resource unnecessarily—i.e. taking a path which expends strictly more of the resource than is necessary to reach the end-state.
But what if our universe doesn’t have some built-in, ontologically-basic yardstick against which to measure decision-coherence? How can we derive the yardstick from first principles?
That’s the question I think instrumental convergence could potentially answer. If broad classes of mind designs in a certain universe “want similar things” (as non-terminal goals), then those things might make a good yardstick. In order to to give full force to this argument, we need to ground “want similar things” in a way which doesn’t talk about “wanting”, since we’re trying to derive utility from first principles. That’s where something like “nearby subsystems can only influence far away subsystems via <small set of variables>” comes in. That small set of variables acts like a natural yardstick to measure coherence of nearby decisions: throwing away control over those variables implies that the agent is strictly suboptimal for controlling (almost) anything far away. In some sense, it’s coherence of nearby decisions, as viewed from a distance.
I’m going to reply more fully later (I’m taking a break right now), but I want to say: as an author, I always hope to receive reviews like this for my academic papers. Rare enough are reviewers who demonstrate a clear grasp of the work; rarer still are reviewers whose critiques are so well-placed that my gut reaction is excitement instead of defensiveness. I’ve received several such reviews this year, and I’ve appreciated every one.
This review is mostly going to talk about what I think the post does wrong and how to fix it, because the post itself does a good job explaining what it does right. But before we get to that, it’s worth saying up-front what the post does well: the post proposes a basically-correct notion of “power” for purposes of instrumental convergence, and then uses it to prove that instrumental convergence is in fact highly probable under a wide range of conditions. On that basis alone, it is an excellent post.
I see two (related) central problems, from which various other symptoms follow:
POWER offers a black-box notion of instrumental convergence. This is the right starting point, but it needs to be complemented with a gears-level understanding of what features of the environment give rise to convergence.
Unstructured MDPs are a bad model in which to formulate instrumental convergence. In particular, they are bad for building a gears-level understanding of what features of the environment give rise to convergence.
Some things I’ve thought a lot about over the past year seem particularly well-suited to address these problems, so I have a fair bit to say about them.
Why Unstructured MDPs Are A Bad Model For Instrumental Convergence
The basic problem with unstructured MDPs is that the entire world-state is a single, monolithic object. Some symptoms of this problem:
it’s hard to talk about “resources”, which seem fairly central to instrumental convergence
it’s hard to talk about multiple agents competing for the same resources
it’s hard to talk about which parts of the world an agent controls/doesn’t control
it’s hard to talk about which parts of the world agents do/don’t care about
… indeed, it’s hard to talk about the world having “parts” at all
it’s hard to talk about agents not competing, since there’s only one monolithic world-state to control
any action which changes the world at all changes the entire world-state; there’s no built-in way to change a “small part” of the world
More generally, unstructured MDPs are problematic for most kinds of gears-level understanding: the point of gears is to talk about the structure of the world, and the point of unstructured MDPs is to use one black-box world-state without any internal structure. (The one exception to this is time-structure, which MDPs do have.)
My go-to model would instead be a circuit/Bayes net, with some decision nodes. There are alternatives to this, but it’s probably the most general option in which the world has structure/parts.
What Would A Gearsier Model Of Instrumental Convergence Look Like?
Intuitive example: in a real-time strategy game, units and buildings and so forth can be created, destroyed, and generally moved around given sufficient time. Over long time scales, the main thing which matters to the world-state is resources—creating or destroying anything else costs resources. So, even though there’s a high-dimensional game-world, it’s mainly a few (low-dimensional) resource counts which impact the long term state space. Any agents hoping to control anything in the long term will therefore compete to control those few resources.
More generally: of all the many “nearby” variables an agent can control, only a handful (or summary) are relevant to anything “far away”. Any “nearby” agents trying to control things “far away” will therefore compete to control the same handful of variables.
Main thing to notice: this intuition talks directly about a feature of the world—i.e. “far away” variables depending only on a handful of “nearby” variables. That, according to me, is the main feature which makes or breaks instrumental convergence in any given universe. We can talk about that feature entirely independent of agents or agency. Indeed, we could potentially use this intuition to derive agency, via some kind of coherence theorem; this notion of instrumental convergence is more fundamental than utility functions.
I would still expect something like this to agree with the POWER notion of instrumental convergence. But something along these lines would provide a more gears-level picture, to complement the more functional/black-boxy picture provided by POWER. Ideally, the two would turn out to fully agree, providing a strong characterization of instrumental convergence.
To quote Eliezer (who was originally talking to Benja Fallenstein; edits italicized):
Here are some more thoughts.
I agree, and I’d like to elaborate my take on the black boxy-ness.
To me, these theorems (and the further basic MDP theorems I’ve developed but not yet made available) feel analogous to the Sylow theorems in group theory. Indeed, a CHAI researcher once remarked to me that my theorems seem to apply the spirit of abstract algebra to MDPs.
The Sylow theorems tell you that if you know the cardinality |G| of a group G, you can constrain its internal structure in useful ways, and sometimes even guarantee it has normal subgroups of given cardinalities. But maybe we don’t know |G|. What’s that world like, where we don’t have easy ways of knowing the group cardinality and deriving its prime factorization, but we still have the Sylow theorems?
My theorems say that if you know certain summary information of the graphical properties of an MDP, you can conclude POWER-seeking. But maybe we don’t know that summary information, because we don’t know exactly what the MDP looks like. What’s that world like, where we don’t have easy ways of knowing the MDP model and deriving high-level graphical properties, but we still have the POWER-seeking theorems?
I think that you still want to know both sets of theorems, even though you might not have recourse to a constructive explanation for the actual (groups / MDPs) you care about understanding.
But you also care about the cardinalities, and you also care about what kinds of things will tend to be robustly instrumental, what kinds of things tend to give you POWER / “resources”, and I think that the kind of theory you propose could take an important step in that direction.
I also think that it’s aesthetically pleasing to have a notion of POWER-seeking which doesn’t depend on the state featurization, but only on the environmental dynamics; however, more granular theories probably should depend on that.
~~~~
In Abstraction, Evolution and Gears, you write:
This, I think, is too strong: not only do some agents not care about [exact voltages on a CPU], some agents aren’t even incentivized to care about [the summary information f(X) of these voltages]. For example, an agent with a constant utility function cares neither about X nor about f(X), and I imagine there are less trivial utility functions which are indifferent to large classes of outcomes and share this property.
The main point here isn’t to nitpick the implication, but to shift the emphasis towards a direction I think might be productive (towards * below). So, I would say:
Crucially, X is of type thing-to-care-about, and f(X) is of type thing-to-condition-on. By definition, one cares about X itself for terminal reasons, but cares about f(X) for instrumental predictive benefit—because of what it implies about the other things one does care about.
Then you might wonder, in a given environment:
why “agents” are incentivized to develop “good abstractions”;
given f(X), what kinds of query classes tend to make f(X) a good abstraction;
There’s a corresponding question of:
“given variables X, what kinds of utility functions will terminally care about X?”,
but this isn’t as relevant. This question also seems easier, since X is presumably in the domain of the utility functions.
given an abstraction f(X), what kinds of goals will incentivize policies which care about controlling f(X).
* The answers to 2. and 3. would point to “caring about thing Y means you care about summary information Z, and if you care about summary information Z, you’ll tend to try to control features A, B, and C.” This could then not only say how many goals tend to seek POWER, but which kinds of goals seek which kinds of control and resources and flexible influence. *
Can you expand on this? I’m mostly confused about what kind of agency might be derived, exactly.
I’d say you passed my intellectual Turing test, but that seems like an understatement. More like… if you were a successor AI, I would be comfortable deferring to you on this topic. (Not literally true, but the analogy seems to convey something of the right spirit.) You fully understand my points and have made further novel observations about them; in particular, the analogy to the Sylow theorems is perfect, and you’re clearly asking the right questions.
Regarding instrumental convergence as a foundation for coherence theorems...
I touched on this a bit in this review of Coherent Decisions Imply Consistent Utilities. The main issue is that coherence theorems generally need some kind of “yardstick” to measure utility against, something which agents are assumed to generally want more of; the flavor text around the theorem usually calls it “money”. It need not be something that agents want as a terminal value, just something that we assume agents can always use more of in order to get more utility. We then recognize “incoherent decisions” by an agent “throwing away” the yardstick-resource unnecessarily—i.e. taking a path which expends strictly more of the resource than is necessary to reach the end-state.
But what if our universe doesn’t have some built-in, ontologically-basic yardstick against which to measure decision-coherence? How can we derive the yardstick from first principles?
That’s the question I think instrumental convergence could potentially answer. If broad classes of mind designs in a certain universe “want similar things” (as non-terminal goals), then those things might make a good yardstick. In order to to give full force to this argument, we need to ground “want similar things” in a way which doesn’t talk about “wanting”, since we’re trying to derive utility from first principles. That’s where something like “nearby subsystems can only influence far away subsystems via <small set of variables>” comes in. That small set of variables acts like a natural yardstick to measure coherence of nearby decisions: throwing away control over those variables implies that the agent is strictly suboptimal for controlling (almost) anything far away. In some sense, it’s coherence of nearby decisions, as viewed from a distance.
Review of Review for 2019 Review
I’m going to reply more fully later (I’m taking a break right now), but I want to say: as an author, I always hope to receive reviews like this for my academic papers. Rare enough are reviewers who demonstrate a clear grasp of the work; rarer still are reviewers whose critiques are so well-placed that my gut reaction is excitement instead of defensiveness. I’ve received several such reviews this year, and I’ve appreciated every one.