The actual result here looks right to me, but kinda surfaces a lot of my confusion about how people in this space use coherence theorems/reinforces my sense they get misused
You say:
This ties to a common criticism: that any system can be well-modeled as a utility maximizer, by simply choosing the utility function which rewards whatever the system in fact does. As far as I can tell, that criticism usually reflects ignorance of what coherence says
My sense of how this conversation goes is as follows:
“Utility maximisers are scary, and here are some theorems that show that anything sufficiently smart/rational (i.e. a superintelligence) will be a utility maximiser. That’s scary”
“Literally anything can be modelled as a utility maximiser. It’s not the case that literally everything is scary, so something’s wrong here”
“Well sure, you can model anything as a utility maximiser technically, but the resource w.r.t which it’s being optimal/the way its preferences are carving up state-space will be incredibly awkward/garbled/unnatural (in the extreme, they could just be utility-maximizing over entire universe-histories). But these are unnatural/trivial. If we add constraints over the kind of resources it’s caring about/kinds of outcomes it can have preferences over, we constrain the set of what can be a utility-maximiser a lot. And if we constrain it to smth like the set of resources that we think in terms of, the resulting set of possible utility-maximisers do look scary”
Does this seem accurate-ish? If so it feels like this last response is true but also kind of vacuously so, and kind of undercuts the scariness of the coherence theorems in the first place. As in, it seems much more plausible that a utility-maximiser drawn from this constrained set will be scary, but then where’s the argument we’re sampling from this subset when we make a superintelligence? It feels like there’s this weird motte-and-bailey going on where people flip-flop between the very unobjectionable “it’s representable as a utility-maximiser” implied by the theorems and “it’ll look like a utility-maximiser “internally”, or relative to some constrained set of possible resources s.t. it seems scary to us” which feels murky and un-argued for.
Also on the actual theorem you outline here—it looks right, but isn’t assuming utilities assigned to outcomes s.t. the agent is trying to maximise over them kind of begging most of the question that coherence theorems are after? i.e. the starting data is usually a set of preferences, with the actual work being proving that this along with some assumptions yields a utility function over outcomes. This also seems why you don’t have to use anything like dutch-book arguments etc as you point out—but only because you’ve kind of skipped over the step where they’re used
Well sure, you can model anything as a utility maximiser technically, but the resource w.r.t which it’s being optimal/the way its preferences are carving up state-space will be incredibly awkward/garbled/unnatural (in the extreme, they could just be utility-maximizing over entire universe-histories). But these are unnatural/trivial. If we add constraints over the kind of resources it’s caring about/kinds of outcomes it can have preferences over, we constrain the set of what can be a utility-maximiser a lot. And if we constrain it to smth like the set of resources that we think in terms of, the resulting set of possible utility-maximisers do look scary.
I would guess that response is memetically largely downstream of my own old take. It’s not wrong, and it’s pretty easy to argue that future systems will in fact behave efficiently with respect to the resources we care about: we’ll design/train the system to behave efficiently with respect to those resources precisely because we care about those resources and resource-usage is very legible/measurable. But over the past year or so I’ve moved away from that frame, and part of the point of this post is to emphasize the frame I usually use now instead.
In that new frame, here’s what I would say instead: “Well sure, you can model anything as a utility maximizer technically, but usually any utility function compatible with the system’s behavior is very myopic—it mostly just cares about some details of the world “close to” (in time/space) the system itself, and doesn’t involve much optimization pressure against most of the world. If a system is to apply much optimization pressure to parts of the world far away from itself—like e.g. make & execute long-term plans—then the system must be a(n approximate) utility maximizer in a much less trivial sense. It must behave like it’s maximizing a utility function specifically over stuff far away.”
(… actually that’s not a thing I’d say, because right from the start I would have said that I’m using utility maximization mainly because it makes it easy to illustrate various problems. Those problems usually remain even when we don’t assume utility maximization, they’re just a lot less legible without a mathematical framework. But, y’know, for purposes of this discussion...)
Also on the actual theorem you outline here—it looks right, but isn’t assuming utilities assigned to outcomes s.t. the agent is trying to maximise over them kind of begging most of the question that coherence theorems are after?
In my head, an important complement to this post is Utility Maximization = Description Length Minimization, which basically argues that “optimization” in the usual Flint/Yudkowsky sense is synonymous with optimizing some utility function over the part of the world being optimized. However, that post doesn’t involve an optimizer; it just talks about stuff “being optimized” in a way which may or may not involve a separate thing which “does the optimization”.
This post adds the optimizer to that picture. We start from utility maximization over some “far away” stuff, in order to express optimization occurring over that far away stuff. Then we can ask “but what’s being adjusted to do that optimization?”, i.e. in the problem maxxu(x) what’s x? And if x is the “policy” of some system, such that the whole setup is an MDP, then find that there’s a nontrivial sense in which the system can be or not be a (long-range) utility maximizer—i.e. an optimizer.
Thanks, I feel like I understand your perspective a bit better now.
Re: your “old” frame: I agree that the fact we’re training an AI to be useful from our perspective will certainly constrain its preferences a lot, such that it’ll look like it has preferences over resources we think in terms of/won’t just be representable as a maximally random utility function. I think there’s a huge step from that though to “it’s a optimizer with respect to those resources” i.e there are a lot of partial orderings you can put over states where it broadly has preference orderings we like w.r.t resources without looking like a maximizer over those resources, and I don’t think that’s necessarily scary. I think some of this disagreement may be downstream of how much you think a superintelligence will “iron out wrinkles” like preference gaps internally though which is another can of worms
Re: your new frame: I think I agree that looking like a long-term/distance planner is much scarier. Obviously implicitly assuming we’re restricting to some interesting set of resources, because otherwise we can reframe any myopic maximizer as long-term and vice-versa. But this is going round in circles a bit, typing this out I think the main crux here for me is what I said in the previous point in that I think there’s too much of a leap from “looks like it has preferences over this resource and long-term plans” vs. “is a hardcore optimizer of said resource”. Maybe this is just a separate issue though, not sure I have any local disagreements here
Re: your last pont, thanks—I don’t think I have a problem with this, I think I was just misunderstanding the intended scope of the post
Obviously implicitly assuming we’re restricting to some interesting set of resources, because otherwise we can reframe any myopic maximizer as long-term and vice-versa.
This part I think is false. The theorem in this post does not need any notion of resources, and neither does Utility Maximization = Description Length Minimization. We do need a notion of spacetime (in order to talk about stuff far away in space/time), but that’s a much weaker ontological assumption.
I think what I’m getting at is more general than specifically talking about resources, I’m more getting at the degree of freedom in the problem description that lets you frame anything as technically optimizing something at a distance i.e. in ‘Utility Maximization = Description Length Minimization’ you can take any system, find its long-term and long-distance effects on some other region of space-time, and find a coding-scheme where those particular states have the shortest descriptions. The description length of the universe will by construction get minimized. Obviously this just corresponds to one of those (to us) very unnatural-looking “utility functions” over universe-histories or w/e
If we’re first fixing the coding scheme then this seems to me to be equivalent to constraining the kinds of properties we’re allowing as viable targets of optimization
I guess one way of looking at it is I don’t think it makes sense to talk about a system as being an optimizer/not an optimizer intrinsically. It’s a property of a system relative to a coding scheme/set of interesting properties/resources, everything is an optimizer relative to some encoding scheme. And all of the actual, empirical scariness of AI comes from how close the encoding scheme that by-definition makes it an optimizer is to our native encoding scheme—as you point out they’ll probably have some overlap but I don’t think that itself is scary
All possible encoding schemes / universal priors differ from each other by at most a finite prefix. You might think this doesn’t achieve much, since the length of the prefix can be in principle unbounded; but in practice, the length of the prefix (or rather, the prior itself) is constrained by a system’s physical implementation. There are some encoding schemes which neither you nor any other physical entity will ever be able to implement, and so for the purposes of description length minimization these are off the table. And of the encoding schemes that remain on the table, virtually all of them will behave identically with respect to the description lengths they assign to “natural” versus “unnatural” optimization criteria.
“Utility maximisers are scary, and here are some theorems that show that anything sufficiently smart/rational (i.e. a superintelligence) will be a utility maximiser. That’s scary”
I would say “systems that act according to preferences about the state of the world in the distant future are scary”, and then that can hopefully lead to a productive and substantive discussion about whether people are likely to build such systems. (See e.g. here where I argue that someone is being too pessimistic about that, & section 1 here where I argue that someone else is being too optimistic.)
Thanks, I think that’s a good distinction—I guess I have like 3 issues if we roll with that though
I don’t think a system acting according to preferences over future states entails it is EV-maximising w.r.t. some property/resource of those future states. If it’s not doing the latter it seems like it’s not necessarily scary, and if it is then I think we’re back at the issue that we’re making an unjustified leap, this time from “it’s a utility maximizer + it has preferences over future-states” (i.e. having preferences over properties of future states is compatible w/ also having preferences over world-histories/all sorts of weird stuff)
It’s not clear to me that specifying “preferences over future states” actually restricts things much—if I have some preferences over the path I take through lotteries, then whether I take path A or path B to reach outcome X will show up as some difference in the final state, so it feels like we can cast a lot (Most? All?) types of preferences as “preferences over future states”. I think the implicit response here is that we’re categorizing future states by a subset of “interesting-to-us” properties, and the differences in future-states yielded by taking Path A or Path B don’t matter to us (in other words, implicitly whenever we talk about these kinds of preferences over states we’re taking some equivalence class over actual micro-states relative to some subset of properties). But then again I think the issue recurs that a system having preferences over future states w.r.t. this subset of properties is a stronger claim
I’m more and more convinced that, even if a system does have preferences over future-states in the scariest sense here, there’s not really an overriding normative force for it to update towards being a utility-maximiser. But I think this is maybe a kind of orthogonal issue about the force of exploitability arguments rather than coherence theorems here
I think you’ve said something along the lines of one or two of these points in your links, sorry! Not expecting this to be super novel to you, half just helpful for me to get my own thoughts down explicitly
It’s not clear to me that specifying “preferences over future states” actually restricts things much—if I have some preferences over the path I take through lotteries, then whether I take path A or path B to reach outcome X will show up as some difference in the final state, so it feels like we can cast a lot (Most? All?) types of preferences as “preferences over future states”.
In terms of the OP toy model, I think the OP omitted another condition under which the coherence theorem is trivial / doesn’t apply, which is that you always start the MDP in the same place and the MDP graph is a directed tree or directed forest. (i.e., there are no cycles even if you ignore the arrow-heads … I hope I’m getting the graph theory terminology right). In those cases, for any possible end-state, there’s at most one way to get from the start to the end-state; and conversely, for any possible path through the MDP, that’s the path that would result from wanting to get to that end-state. Therefore, you can rationalize any path through the MDP as the optimal way to get to whatever end-state it actually gets to. Right? (cc @johnswentworth@David Lorell )
OK, so what about the real world? The laws of physics are unitary, so it is technically true that if I have some non-distant-future-related preferences (e.g. “I prefer to never tell a lie”, “I prefer to never use my pinky finger”, etc.), this preference can be cast as some inscrutably complicated preference about the state of the world on January 1 2050, assuming omniscient knowledge of the state of the world right now and infinite computational power. For example, “a preference to never use my pinky finger starting right now” might be equivalent to something kinda like “On January 1 2050, IF {air molecule 9834705982347598 has speed between 34.2894583000000 and 34.2894583000001 AND air molecule 8934637823747621 has … [etc. for a googolplex more lines of text]”
This is kind of an irrelevant technicality, I think. The real world MDP in fact is full of (undirected) cycles—i.e. different ways to get to the same endpoint—…as far as anyone can measure it. For example, let’s say that I care about the state of a history ledger on January 1 2050. Then it’s possible for me to do whatever for 25 years … and then hack into the ledger and change it!
However, if the history ledger is completely unbreachable (haha), then I think we should say that this isn’t really a preference about the state of the world in the distant future, but rather an implementation method for making an agent with preferences about trajectories.
In terms of the OP toy model, I think the OP omitted another condition under which the coherence theorem is trivial / doesn’t apply, which is that you always start the MDP in the same place and the MDP graph is a directed tree or directed forest. (i.e., there are no cycles even if you ignore the arrow-heads … I’m hope I’m getting the graph theory terminology right). In those cases, for any possible end-state, there’s at most one way to get from the start to the end-state; and conversely, for any possible path through the MDP, that’s the path that would result from wanting to get to that end-state. Therefore, you can rationalize any path through the MDP as the optimal way to get to whatever end-state it actually gets to. Right?
Technically correct.
I’d emphasize here that this toy theorem is assuming an MDP, which specifically means that the “agent” must be able to observe the entire state at every timestep. If you start thinking about low-level physics and microscopic reversibility, then the entire state is definitely not observable by real agents. In order to properly handle that sort of thing, we’d mostly need to add uncertainty, i.e. shift to POMDPs.
different ways to get to the same endpoint—…as far as anyone can measure it
I would say the territory has no cycles but any map of it does. You can have a butterfly effect where a small nudge is amplified to some measurable difference but you cannot predict the result of that measurement. So the agent’s revealed preferences can only be modeled as a graph where some states are reachable through multiple paths.
In addition to what Steve Byrnes has said, you should also read the answers and associated commentary on my recent question about coherence theorems and agentic behavior.
The actual result here looks right to me, but kinda surfaces a lot of my confusion about how people in this space use coherence theorems/reinforces my sense they get misused
You say:
My sense of how this conversation goes is as follows:
“Utility maximisers are scary, and here are some theorems that show that anything sufficiently smart/rational (i.e. a superintelligence) will be a utility maximiser. That’s scary”
“Literally anything can be modelled as a utility maximiser. It’s not the case that literally everything is scary, so something’s wrong here”
“Well sure, you can model anything as a utility maximiser technically, but the resource w.r.t which it’s being optimal/the way its preferences are carving up state-space will be incredibly awkward/garbled/unnatural (in the extreme, they could just be utility-maximizing over entire universe-histories). But these are unnatural/trivial. If we add constraints over the kind of resources it’s caring about/kinds of outcomes it can have preferences over, we constrain the set of what can be a utility-maximiser a lot. And if we constrain it to smth like the set of resources that we think in terms of, the resulting set of possible utility-maximisers do look scary”
Does this seem accurate-ish? If so it feels like this last response is true but also kind of vacuously so, and kind of undercuts the scariness of the coherence theorems in the first place. As in, it seems much more plausible that a utility-maximiser drawn from this constrained set will be scary, but then where’s the argument we’re sampling from this subset when we make a superintelligence? It feels like there’s this weird motte-and-bailey going on where people flip-flop between the very unobjectionable “it’s representable as a utility-maximiser” implied by the theorems and “it’ll look like a utility-maximiser “internally”, or relative to some constrained set of possible resources s.t. it seems scary to us” which feels murky and un-argued for.
Also on the actual theorem you outline here—it looks right, but isn’t assuming utilities assigned to outcomes s.t. the agent is trying to maximise over them kind of begging most of the question that coherence theorems are after? i.e. the starting data is usually a set of preferences, with the actual work being proving that this along with some assumptions yields a utility function over outcomes. This also seems why you don’t have to use anything like dutch-book arguments etc as you point out—but only because you’ve kind of skipped over the step where they’re used
I would guess that response is memetically largely downstream of my own old take. It’s not wrong, and it’s pretty easy to argue that future systems will in fact behave efficiently with respect to the resources we care about: we’ll design/train the system to behave efficiently with respect to those resources precisely because we care about those resources and resource-usage is very legible/measurable. But over the past year or so I’ve moved away from that frame, and part of the point of this post is to emphasize the frame I usually use now instead.
In that new frame, here’s what I would say instead: “Well sure, you can model anything as a utility maximizer technically, but usually any utility function compatible with the system’s behavior is very myopic—it mostly just cares about some details of the world “close to” (in time/space) the system itself, and doesn’t involve much optimization pressure against most of the world. If a system is to apply much optimization pressure to parts of the world far away from itself—like e.g. make & execute long-term plans—then the system must be a(n approximate) utility maximizer in a much less trivial sense. It must behave like it’s maximizing a utility function specifically over stuff far away.”
(… actually that’s not a thing I’d say, because right from the start I would have said that I’m using utility maximization mainly because it makes it easy to illustrate various problems. Those problems usually remain even when we don’t assume utility maximization, they’re just a lot less legible without a mathematical framework. But, y’know, for purposes of this discussion...)
In my head, an important complement to this post is Utility Maximization = Description Length Minimization, which basically argues that “optimization” in the usual Flint/Yudkowsky sense is synonymous with optimizing some utility function over the part of the world being optimized. However, that post doesn’t involve an optimizer; it just talks about stuff “being optimized” in a way which may or may not involve a separate thing which “does the optimization”.
This post adds the optimizer to that picture. We start from utility maximization over some “far away” stuff, in order to express optimization occurring over that far away stuff. Then we can ask “but what’s being adjusted to do that optimization?”, i.e. in the problem maxx u(x) what’s x? And if x is the “policy” of some system, such that the whole setup is an MDP, then find that there’s a nontrivial sense in which the system can be or not be a (long-range) utility maximizer—i.e. an optimizer.
Thanks, I feel like I understand your perspective a bit better now.
Re: your “old” frame: I agree that the fact we’re training an AI to be useful from our perspective will certainly constrain its preferences a lot, such that it’ll look like it has preferences over resources we think in terms of/won’t just be representable as a maximally random utility function. I think there’s a huge step from that though to “it’s a optimizer with respect to those resources” i.e there are a lot of partial orderings you can put over states where it broadly has preference orderings we like w.r.t resources without looking like a maximizer over those resources, and I don’t think that’s necessarily scary. I think some of this disagreement may be downstream of how much you think a superintelligence will “iron out wrinkles” like preference gaps internally though which is another can of worms
Re: your new frame: I think I agree that looking like a long-term/distance planner is much scarier. Obviously implicitly assuming we’re restricting to some interesting set of resources, because otherwise we can reframe any myopic maximizer as long-term and vice-versa. But this is going round in circles a bit, typing this out I think the main crux here for me is what I said in the previous point in that I think there’s too much of a leap from “looks like it has preferences over this resource and long-term plans” vs. “is a hardcore optimizer of said resource”. Maybe this is just a separate issue though, not sure I have any local disagreements here
Re: your last pont, thanks—I don’t think I have a problem with this, I think I was just misunderstanding the intended scope of the post
This part I think is false. The theorem in this post does not need any notion of resources, and neither does Utility Maximization = Description Length Minimization. We do need a notion of spacetime (in order to talk about stuff far away in space/time), but that’s a much weaker ontological assumption.
I think what I’m getting at is more general than specifically talking about resources, I’m more getting at the degree of freedom in the problem description that lets you frame anything as technically optimizing something at a distance i.e. in ‘Utility Maximization = Description Length Minimization’ you can take any system, find its long-term and long-distance effects on some other region of space-time, and find a coding-scheme where those particular states have the shortest descriptions. The description length of the universe will by construction get minimized. Obviously this just corresponds to one of those (to us) very unnatural-looking “utility functions” over universe-histories or w/e
If we’re first fixing the coding scheme then this seems to me to be equivalent to constraining the kinds of properties we’re allowing as viable targets of optimization
I guess one way of looking at it is I don’t think it makes sense to talk about a system as being an optimizer/not an optimizer intrinsically. It’s a property of a system relative to a coding scheme/set of interesting properties/resources, everything is an optimizer relative to some encoding scheme. And all of the actual, empirical scariness of AI comes from how close the encoding scheme that by-definition makes it an optimizer is to our native encoding scheme—as you point out they’ll probably have some overlap but I don’t think that itself is scary
All possible encoding schemes / universal priors differ from each other by at most a finite prefix. You might think this doesn’t achieve much, since the length of the prefix can be in principle unbounded; but in practice, the length of the prefix (or rather, the prior itself) is constrained by a system’s physical implementation. There are some encoding schemes which neither you nor any other physical entity will ever be able to implement, and so for the purposes of description length minimization these are off the table. And of the encoding schemes that remain on the table, virtually all of them will behave identically with respect to the description lengths they assign to “natural” versus “unnatural” optimization criteria.
I would say “systems that act according to preferences about the state of the world in the distant future are scary”, and then that can hopefully lead to a productive and substantive discussion about whether people are likely to build such systems. (See e.g. here where I argue that someone is being too pessimistic about that, & section 1 here where I argue that someone else is being too optimistic.)
Thanks, I think that’s a good distinction—I guess I have like 3 issues if we roll with that though
I don’t think a system acting according to preferences over future states entails it is EV-maximising w.r.t. some property/resource of those future states. If it’s not doing the latter it seems like it’s not necessarily scary, and if it is then I think we’re back at the issue that we’re making an unjustified leap, this time from “it’s a utility maximizer + it has preferences over future-states” (i.e. having preferences over properties of future states is compatible w/ also having preferences over world-histories/all sorts of weird stuff)
It’s not clear to me that specifying “preferences over future states” actually restricts things much—if I have some preferences over the path I take through lotteries, then whether I take path A or path B to reach outcome X will show up as some difference in the final state, so it feels like we can cast a lot (Most? All?) types of preferences as “preferences over future states”. I think the implicit response here is that we’re categorizing future states by a subset of “interesting-to-us” properties, and the differences in future-states yielded by taking Path A or Path B don’t matter to us (in other words, implicitly whenever we talk about these kinds of preferences over states we’re taking some equivalence class over actual micro-states relative to some subset of properties). But then again I think the issue recurs that a system having preferences over future states w.r.t. this subset of properties is a stronger claim
I’m more and more convinced that, even if a system does have preferences over future-states in the scariest sense here, there’s not really an overriding normative force for it to update towards being a utility-maximiser. But I think this is maybe a kind of orthogonal issue about the force of exploitability arguments rather than coherence theorems here
I think you’ve said something along the lines of one or two of these points in your links, sorry! Not expecting this to be super novel to you, half just helpful for me to get my own thoughts down explicitly
In terms of the OP toy model, I think the OP omitted another condition under which the coherence theorem is trivial / doesn’t apply, which is that you always start the MDP in the same place and the MDP graph is a directed tree or directed forest. (i.e., there are no cycles even if you ignore the arrow-heads … I hope I’m getting the graph theory terminology right). In those cases, for any possible end-state, there’s at most one way to get from the start to the end-state; and conversely, for any possible path through the MDP, that’s the path that would result from wanting to get to that end-state. Therefore, you can rationalize any path through the MDP as the optimal way to get to whatever end-state it actually gets to. Right? (cc @johnswentworth @David Lorell )
OK, so what about the real world? The laws of physics are unitary, so it is technically true that if I have some non-distant-future-related preferences (e.g. “I prefer to never tell a lie”, “I prefer to never use my pinky finger”, etc.), this preference can be cast as some inscrutably complicated preference about the state of the world on January 1 2050, assuming omniscient knowledge of the state of the world right now and infinite computational power. For example, “a preference to never use my pinky finger starting right now” might be equivalent to something kinda like “On January 1 2050, IF {air molecule 9834705982347598 has speed between 34.2894583000000 and 34.2894583000001 AND air molecule 8934637823747621 has … [etc. for a googolplex more lines of text]”
This is kind of an irrelevant technicality, I think. The real world MDP in fact is full of (undirected) cycles—i.e. different ways to get to the same endpoint—…as far as anyone can measure it. For example, let’s say that I care about the state of a history ledger on January 1 2050. Then it’s possible for me to do whatever for 25 years … and then hack into the ledger and change it!
However, if the history ledger is completely unbreachable (haha), then I think we should say that this isn’t really a preference about the state of the world in the distant future, but rather an implementation method for making an agent with preferences about trajectories.
Technically correct.
I’d emphasize here that this toy theorem is assuming an MDP, which specifically means that the “agent” must be able to observe the entire state at every timestep. If you start thinking about low-level physics and microscopic reversibility, then the entire state is definitely not observable by real agents. In order to properly handle that sort of thing, we’d mostly need to add uncertainty, i.e. shift to POMDPs.
I would say the territory has no cycles but any map of it does. You can have a butterfly effect where a small nudge is amplified to some measurable difference but you cannot predict the result of that measurement. So the agent’s revealed preferences can only be modeled as a graph where some states are reachable through multiple paths.
In addition to what Steve Byrnes has said, you should also read the answers and associated commentary on my recent question about coherence theorems and agentic behavior.