This going to be a somewhat-scattered summary of my own current understanding. My understanding of this question has evolved over time, and is therefore likely to continue to evolve over time.
Classic Theorems
First, there’s all the classic coherence theorems—think Complete Class or Savage or Dutch books or any of the other arguments you’d find in Stanford Encyclopedia of Philosophy. The general pattern of these is:
Assume some arguably-intuitively-reasonable properties of an agent’s decisions (think e.g. lack of circular preferences).
Show that these imply that the agent’s decisions maximize some expected utility function.
I would group objections to this sort of theorem into three broad classes:
Argue that some of the arguably-intuitively-reasonable properties are not actually necessary for powerful agents.
Be confused about something, and accidentally argue against something which is either not really what the theorem says or assumes a particular way of applying the theorem which is not the only way of applying the theorem.
Argue that all systems can be modeled as expected utility maximizers (i.e. just pick a utility function which is maximized by whatever the system in fact does) and therefore the theorems don’t say anything useful.
For an old answer to (2.a), see the discussion under my mini-essay comment on Coherent Decisions Imply Consistent Utilities. (We’ll also talk about (2.a) some more below.) Other than that particularly common confusion, there’s a whole variety of other confusions; a few common types include:
Only pay attention to the VNM theorem, which is relatively incomplete as coherence theorems go.
Attempt to rely on some notion of preferences which is not revealed preference.
Lose track of which things the theorems say an agent has utility and/or uncertainty over, i.e. what the inputs to the utility and/or probability functions are.
How To Talk About “Powerful Agents” Directly
While I think EJT’s arguments specifically are not quite right in a few ways, there is an importantly correct claim close to his: none of the classic coherence theorems say “powerful agent → EU maximizer (in a nontrivial sense)”. They instead say “<list of properties which are not obviously implied by powerful agency> → EU maximizer”. In order to even start to make a theorem of the form “powerful agent → EU maximizer (in a nontrivial sense)”, we’d first need a clean intuitively-correct mathematical operationalization of what “powerful agent” even means.
Currently, the best method I know of for making the connection between “powerful agency” and utility maximization is in Utility Maximization = Description Length Minimization. There, the notion of “powerful agency” is tied to optimization, in the sense of pushing the world into a relatively small number of states. That, in turn, is equivalent (the post argues) to expected utility maximization. That said, that approach doesn’t explicitly talk about “an agent” at all; I see it less as a coherence theorem and more as a likely-useful piece of some future coherence theorem.
What would the rest of such a future coherence theorem look like? Here’s my current best guess:
We start from the idea of an agent optimizing stuff “far away” in spacetime. Coherence of Caches and Agents hints at why this is necessary: standard coherence constraints are only substantive when the utility/”reward” is not given for the immediate effects of local actions, but rather for some long-term outcome. Intuitively, coherence is inherently substantive for long-range optimizers, not myopic agents.
We invoke the Utility Maximization = Description Length Minimization equivalence to say that optimization of the far-away parts of the world will be equivalent to maximization of some utility function over the far-away parts of the world.
We then use basically similar arguments to Coherence of Caches and Agents, but generalized to operate on spacetime (rather than just states-over-time with no spatial structure) and allow for uncertainty.
Pareto-Optimality/Dominated Strategies
There are various claims along the lines of “agent behaves like <X>, or else it’s executing a pareto-suboptimal/dominated strategy”.
Some of these are very easy to prove; here’s my favorite example. An agent has a fixed utility function and performs pareto-optimally on that utility function across multiple worlds (so “utility in each world” is the set of objectives). Then there’s a normal vector (or family of normal vectors) to the pareto surface at whatever point the agent achieves. (You should draw a picture at this point in order for this to make sense.) That normal vector’s components will all be nonnegative (because pareto surface), and the vector is defined only up to normalization, so we can interpret that normal vector as a probability distribution. That also makes sense intuitively: larger components of that vector (i.e. higher probabilities) indicate that the agent is “optimizing relatively harder” for utility in those worlds. This says nothing at all about how the agent will update, and we’d need a another couple sentences to argue that the agent maximizes expected utility under the distribution, but it does give the prototypical mental picture behind the “pareto-optimal → probabilities” idea.
The most fundamental and general problem with pareto-optimality-based claims is that “pareto-suboptimal” implies that we already had a set of quantitative objectives in mind (or in some cases a “measuring stick of utility”, like e.g. money). But then some people will say “ok, but what if a powerful agent just isn’t pareto-optimal with respect to any resources at all, for instance because it just produces craptons of resources and then uses them inefficiently?”.
(Aside: “‘pareto-suboptimal’ implies we already had a set of quantitative objectives in mind” is also usually the answer to claims that all systems can be represented as expected utility maximizers. Sure, any system can be represented as an expected utility maximizer which is pareto-optimal with respect to some made-up objectives/resources which we picked specifically for this system. That does not mean all systems are pareto-optimal with respect to money, or energy, or other resources which we actually care about. Or, if using Utility Maximization = Description Length Minimization to ground out the quantitative objectives: not all systems are pareto-optimal with respect to optimization of some stuff far away in the world. That’s where the nontrivial content of most coherence theorems comes from: the quantitative objectives with respect to which the agent is pareto-optimal need to be things we care about for some reason.)
Approximate Coherence
What if a powerful agent just isn’t pareto-optimal with respect to any resources or far-away optimization targets at all? Or: even if you do expect powerful agents to be approximately pareto-optimal, presumably they will be approximately pareto optimal, not exactly pareto-optimal. What can we say about coherence then?
To date, I know of no theorems saying anything at all about approximate coherence. That said, this looks like more a case of “nobody’s done the legwork yet” rather than “people tried and failed”. It’s on my todo list.
My guess is that there’s a way to come at the problem with a thermodynamics-esque flavor, which would yield global bounds, for instance of roughly the form “in order for the system to apply n bits of optimization more than it could achieve with outputs independent of its inputs, it must observe at least m bits and approximate coherence to within m-n bits” (though to be clear I don’t yet know the right ways to operationalize all the parts of that sentence). The simplest version of a theorem of that form doesn’t work, but David and I have played with some variations and have some promising threads.
This going to be a somewhat-scattered summary of my own current understanding. My understanding of this question has evolved over time, and is therefore likely to continue to evolve over time.
Classic Theorems
First, there’s all the classic coherence theorems—think Complete Class or Savage or Dutch books or any of the other arguments you’d find in Stanford Encyclopedia of Philosophy. The general pattern of these is:
Assume some arguably-intuitively-reasonable properties of an agent’s decisions (think e.g. lack of circular preferences).
Show that these imply that the agent’s decisions maximize some expected utility function.
I would group objections to this sort of theorem into three broad classes:
Argue that some of the arguably-intuitively-reasonable properties are not actually necessary for powerful agents.
Be confused about something, and accidentally argue against something which is either not really what the theorem says or assumes a particular way of applying the theorem which is not the only way of applying the theorem.
Argue that all systems can be modeled as expected utility maximizers (i.e. just pick a utility function which is maximized by whatever the system in fact does) and therefore the theorems don’t say anything useful.
For an old answer to (2.a), see the discussion under my mini-essay comment on Coherent Decisions Imply Consistent Utilities. (We’ll also talk about (2.a) some more below.) Other than that particularly common confusion, there’s a whole variety of other confusions; a few common types include:
Only pay attention to the VNM theorem, which is relatively incomplete as coherence theorems go.
Attempt to rely on some notion of preferences which is not revealed preference.
Lose track of which things the theorems say an agent has utility and/or uncertainty over, i.e. what the inputs to the utility and/or probability functions are.
How To Talk About “Powerful Agents” Directly
While I think EJT’s arguments specifically are not quite right in a few ways, there is an importantly correct claim close to his: none of the classic coherence theorems say “powerful agent → EU maximizer (in a nontrivial sense)”. They instead say “<list of properties which are not obviously implied by powerful agency> → EU maximizer”. In order to even start to make a theorem of the form “powerful agent → EU maximizer (in a nontrivial sense)”, we’d first need a clean intuitively-correct mathematical operationalization of what “powerful agent” even means.
Currently, the best method I know of for making the connection between “powerful agency” and utility maximization is in Utility Maximization = Description Length Minimization. There, the notion of “powerful agency” is tied to optimization, in the sense of pushing the world into a relatively small number of states. That, in turn, is equivalent (the post argues) to expected utility maximization. That said, that approach doesn’t explicitly talk about “an agent” at all; I see it less as a coherence theorem and more as a likely-useful piece of some future coherence theorem.
What would the rest of such a future coherence theorem look like? Here’s my current best guess:
We start from the idea of an agent optimizing stuff “far away” in spacetime. Coherence of Caches and Agents hints at why this is necessary: standard coherence constraints are only substantive when the utility/”reward” is not given for the immediate effects of local actions, but rather for some long-term outcome. Intuitively, coherence is inherently substantive for long-range optimizers, not myopic agents.
We invoke the Utility Maximization = Description Length Minimization equivalence to say that optimization of the far-away parts of the world will be equivalent to maximization of some utility function over the far-away parts of the world.
We then use basically similar arguments to Coherence of Caches and Agents, but generalized to operate on spacetime (rather than just states-over-time with no spatial structure) and allow for uncertainty.
Pareto-Optimality/Dominated Strategies
There are various claims along the lines of “agent behaves like <X>, or else it’s executing a pareto-suboptimal/dominated strategy”.
Some of these are very easy to prove; here’s my favorite example. An agent has a fixed utility function and performs pareto-optimally on that utility function across multiple worlds (so “utility in each world” is the set of objectives). Then there’s a normal vector (or family of normal vectors) to the pareto surface at whatever point the agent achieves. (You should draw a picture at this point in order for this to make sense.) That normal vector’s components will all be nonnegative (because pareto surface), and the vector is defined only up to normalization, so we can interpret that normal vector as a probability distribution. That also makes sense intuitively: larger components of that vector (i.e. higher probabilities) indicate that the agent is “optimizing relatively harder” for utility in those worlds. This says nothing at all about how the agent will update, and we’d need a another couple sentences to argue that the agent maximizes expected utility under the distribution, but it does give the prototypical mental picture behind the “pareto-optimal → probabilities” idea.
The most fundamental and general problem with pareto-optimality-based claims is that “pareto-suboptimal” implies that we already had a set of quantitative objectives in mind (or in some cases a “measuring stick of utility”, like e.g. money). But then some people will say “ok, but what if a powerful agent just isn’t pareto-optimal with respect to any resources at all, for instance because it just produces craptons of resources and then uses them inefficiently?”.
(Aside: “‘pareto-suboptimal’ implies we already had a set of quantitative objectives in mind” is also usually the answer to claims that all systems can be represented as expected utility maximizers. Sure, any system can be represented as an expected utility maximizer which is pareto-optimal with respect to some made-up objectives/resources which we picked specifically for this system. That does not mean all systems are pareto-optimal with respect to money, or energy, or other resources which we actually care about. Or, if using Utility Maximization = Description Length Minimization to ground out the quantitative objectives: not all systems are pareto-optimal with respect to optimization of some stuff far away in the world. That’s where the nontrivial content of most coherence theorems comes from: the quantitative objectives with respect to which the agent is pareto-optimal need to be things we care about for some reason.)
Approximate Coherence
What if a powerful agent just isn’t pareto-optimal with respect to any resources or far-away optimization targets at all? Or: even if you do expect powerful agents to be approximately pareto-optimal, presumably they will be approximately pareto optimal, not exactly pareto-optimal. What can we say about coherence then?
To date, I know of no theorems saying anything at all about approximate coherence. That said, this looks like more a case of “nobody’s done the legwork yet” rather than “people tried and failed”. It’s on my todo list.
My guess is that there’s a way to come at the problem with a thermodynamics-esque flavor, which would yield global bounds, for instance of roughly the form “in order for the system to apply n bits of optimization more than it could achieve with outputs independent of its inputs, it must observe at least m bits and approximate coherence to within m-n bits” (though to be clear I don’t yet know the right ways to operationalize all the parts of that sentence). The simplest version of a theorem of that form doesn’t work, but David and I have played with some variations and have some promising threads.