johnswentworth comments on A Simple Toy Coherence Theorem

johnswentworth 4 Aug 2024 18:26 UTC
5 points
0
Well sure, you can model anything as a utility maximiser technically, but the resource w.r.t which it’s being optimal/the way its preferences are carving up state-space will be incredibly awkward/garbled/unnatural (in the extreme, they could just be utility-maximizing over entire universe-histories). But these are unnatural/trivial. If we add constraints over the kind of resources it’s caring about/kinds of outcomes it can have preferences over, we constrain the set of what can be a utility-maximiser a lot. And if we constrain it to smth like the set of resources that we think in terms of, the resulting set of possible utility-maximisers do look scary.
I would guess that response is memetically largely downstream of my own old take. It’s not wrong, and it’s pretty easy to argue that future systems will in fact behave efficiently with respect to the resources we care about: we’ll design/train the system to behave efficiently with respect to those resources precisely because we care about those resources and resource-usage is very legible/measurable. But over the past year or so I’ve moved away from that frame, and part of the point of this post is to emphasize the frame I usually use now instead.
In that new frame, here’s what I would say instead: “Well sure, you can model anything as a utility maximizer technically, but usually any utility function compatible with the system’s behavior is very myopic—it mostly just cares about some details of the world “close to” (in time/space) the system itself, and doesn’t involve much optimization pressure against most of the world. If a system is to apply much optimization pressure to parts of the world far away from itself—like e.g. make & execute long-term plans—then the system must be a(n approximate) utility maximizer in a much less trivial sense. It must behave like it’s maximizing a utility function specifically over stuff far away.”
(… actually that’s not a thing I’d say, because right from the start I would have said that I’m using utility maximization mainly because it makes it easy to illustrate various problems. Those problems usually remain even when we don’t assume utility maximization, they’re just a lot less legible without a mathematical framework. But, y’know, for purposes of this discussion...)
Also on the actual theorem you outline here—it looks right, but isn’t assuming utilities assigned to outcomes s.t. the agent is trying to maximise over them kind of begging most of the question that coherence theorems are after?
In my head, an important complement to this post is Utility Maximization = Description Length Minimization, which basically argues that “optimization” in the usual Flint/Yudkowsky sense is synonymous with optimizing some utility function over the part of the world being optimized. However, that post doesn’t involve an optimizer; it just talks about stuff “being optimized” in a way which may or may not involve a separate thing which “does the optimization”.
This post adds the optimizer to that picture. We start from utility maximization over some “far away” stuff, in order to express optimization occurring over that far away stuff. Then we can ask “but what’s being adjusted to do that optimization?”, i.e. in the problem ${max}_{x} u (x)$ what’s $x$ ? And if $x$ is the “policy” of some system, such that the whole setup is an MDP, then find that there’s a nontrivial sense in which the system can be or not be a (long-range) utility maximizer—i.e. an optimizer.
- keith_wynroe 5 Aug 2024 11:41 UTC
  3 points
  0
  Parent
  Thanks, I feel like I understand your perspective a bit better now.
  Re: your “old” frame: I agree that the fact we’re training an AI to be useful from our perspective will certainly constrain its preferences a lot, such that it’ll look like it has preferences over resources we think in terms of/won’t just be representable as a maximally random utility function. I think there’s a huge step from that though to “it’s a optimizer with respect to those resources” i.e there are a lot of partial orderings you can put over states where it broadly has preference orderings we like w.r.t resources without looking like a maximizer over those resources, and I don’t think that’s necessarily scary. I think some of this disagreement may be downstream of how much you think a superintelligence will “iron out wrinkles” like preference gaps internally though which is another can of worms
  Re: your new frame: I think I agree that looking like a long-term/distance planner is much scarier. Obviously implicitly assuming we’re restricting to some interesting set of resources, because otherwise we can reframe any myopic maximizer as long-term and vice-versa. But this is going round in circles a bit, typing this out I think the main crux here for me is what I said in the previous point in that I think there’s too much of a leap from “looks like it has preferences over this resource and long-term plans” vs. “is a hardcore optimizer of said resource”. Maybe this is just a separate issue though, not sure I have any local disagreements here
  Re: your last pont, thanks—I don’t think I have a problem with this, I think I was just misunderstanding the intended scope of the post
  - johnswentworth 5 Aug 2024 16:46 UTC
    2 points
    0
    Parent
    Obviously implicitly assuming we’re restricting to some interesting set of resources, because otherwise we can reframe any myopic maximizer as long-term and vice-versa.
    This part I think is false. The theorem in this post does not need any notion of resources, and neither does Utility Maximization = Description Length Minimization. We do need a notion of spacetime (in order to talk about stuff far away in space/time), but that’s a much weaker ontological assumption.
    - keith_wynroe 5 Aug 2024 19:03 UTC
      1 point
      0
      Parent
      I think what I’m getting at is more general than specifically talking about resources, I’m more getting at the degree of freedom in the problem description that lets you frame anything as technically optimizing something at a distance i.e. in ‘Utility Maximization = Description Length Minimization’ you can take any system, find its long-term and long-distance effects on some other region of space-time, and find a coding-scheme where those particular states have the shortest descriptions. The description length of the universe will by construction get minimized. Obviously this just corresponds to one of those (to us) very unnatural-looking “utility functions” over universe-histories or w/e
      If we’re first fixing the coding scheme then this seems to me to be equivalent to constraining the kinds of properties we’re allowing as viable targets of optimization
      I guess one way of looking at it is I don’t think it makes sense to talk about a system as being an optimizer/not an optimizer intrinsically. It’s a property of a system relative to a coding scheme/set of interesting properties/resources, everything is an optimizer relative to some encoding scheme. And all of the actual, empirical scariness of AI comes from how close the encoding scheme that by-definition makes it an optimizer is to our native encoding scheme—as you point out they’ll probably have some overlap but I don’t think that itself is scary
      - dxu 1 Nov 2024 23:35 UTC
        3 points
        1
        Parent
        All possible encoding schemes / universal priors differ from each other by at most a finite prefix. You might think this doesn’t achieve much, since the length of the prefix can be in principle unbounded; but in practice, the length of the prefix (or rather, the prior itself) is constrained by a system’s physical implementation. There are some encoding schemes which neither you nor any other physical entity will ever be able to implement, and so for the purposes of description length minimization these are off the table. And of the encoding schemes that remain on the table, virtually all of them will behave identically with respect to the description lengths they assign to “natural” versus “unnatural” optimization criteria.
        keith_wynroe 9 Nov 2024 17:22 UTC
        1 point
        0
        Parent
        The constant bound isn’t not that relevant just because of the in principal unbounded size, it also doesn’t constrain the induced probabilities in the second coding scheme much at all. It’s an upper bound on the maximum length, so you can still have the weightings in codings scheme B differ differ in relative length by a ton, leading to wildly different priors
        And of the encoding schemes that remain on the table, virtually all of them will behave identically with respect to the description lengths they assign to “natural” versus “unnatural” optimization criteria.
        I have no idea how you’re getting to this, not sure if it’s claiming a formal result or just like a hunch. But I disagree both that there is a neat correspondence between a system being physically realizable and its having a concise implementation as a TM. Even granting that point, I don’t think that nearly all or even most of these physically realisable systems will behave identically or even similarly w.r.t. how they assign codes to “natural” optimization criteria
        dxu 10 Nov 2024 5:46 UTC
        4 points
        0
        Parent
        
        The constant bound isn’t not that relevant just because of the in principal unbounded size, it also doesn’t constrain the induced probabilities in the second coding scheme much at all. It’s an upper bound on the maximum length, so you can still have the weightings in codings scheme B differ differ in relative length by a ton, leading to wildly different priors
        
        Your phrasing here is vague and somewhat convoluted, so I have difficulty telling if what you say is simply misleading, or false. Regardless:
        
        If you have UTM1 and UTM2, there is a constant-length prefix P such that UTM1 with P prepended to some further bitstring as input will compute whatever UTM2 computes with only that bitstring as input; we can say of P that it “encodes” UTM2 relative to UTM1. This being the case, each function indexed by UTM1 differs from its counterpart for UTM2 by a maximum of len(P), because whenever it’s the case that a given function would otherwise be encoded in UTM1 by a bitstring longer than len(P + [the shortest bitstring encoding the function in UTM2]), the prefixed version of that function simply is the shortest bitstring encoding it in UTM1.
        
        One of the consequences of this, however, is that this prefix-based encoding method is only optimal for functions whose prefix-free encodings (i.e. encodings that cannot be partitioned into substrings such that one of the substrings encodes another UTM) in UTM1 and UTM2 differ in length by more than len(P). And, since len(P) is a measure of UTM2′s complexity relative to UTM1, it follows directly that, for a UTM2 whose “coding scheme” is such that a function whose prefix-free encoding in UTM2 differs in length from its prefix-free encoding in UTM1 by some large constant (say, ~2^10^80), len(P) itself must be on the order of 2^10^80—in other words, UTM2 must have an astronomical complexity relative to UTM1.
        
        I have no idea how you’re getting to this, not sure if it’s claiming a formal result or just like a hunch. But I disagree both that there is a neat correspondence between a system being physically realizable and its having a concise implementation as a TM. Even granting that point, I don’t think that nearly all or even most of these physically realisable systems will behave identically or even similarly w.r.t. how they assign codes to “natural” optimization criteria
        
        For any physically realizable universal computational system, that system can be analogized to UTM1 in the above analysis. If you have some behavioral policy that is e.g. deontological in nature, that behavioral policy can in principle be recast as an optimization criterion over universe histories; however, this criterion will in all likelihood have a prefix-free description in UTM1 of length ~2^10^80. And, crucially, there will be no UTM2 in whose encoding scheme the criterion in question has a prefix-free description of much less than ~2^10^80, without that UTM2 itself having a description complexity of ~2^10^80 relative to UTM1—meaning, there is no physically realizable system that can implement UTM2.
        keith_wynroe 10 Nov 2024 15:42 UTC
        1 point
        0
        Parent
        I feel like this could branch out into a lot of small disagreements here but in the interest of keeping it streamlined:
        One of the consequences of this, however, is that this prefix-based encoding method is only optimal for functions whose prefix-free encodings (i.e. encodings that cannot be partitioned into substrings such that one of the substrings encodes another UTM) in UTM1 and UTM2 differ in length by more than len(P). And, since len(P) is a measure of UTM2′s complexity relative to UTM1, it follows directly that a UTM2 whose “coding scheme” is such that a function whose prefix-free encoding in UTM2 differs in length from its prefix-free encoding in UTM1 by some large constant (say, ~2^10^80), P itself must be on the order of 2^10^80—in other words, UTM2 must have an astronomical complexity relative to UTM1.
        I agree with all of this, and wasn’t gesturing at anything related to it, so I think we’re talking past eachother. My point was simply that two UTMs even with not very-large prefix encodings can wind up with extremely different priors, but I don’t think that’s too relevant to what your main point is
        For any physically realizable universal computational system, that system can be analogized to UTM1 in the above analysis. If you have some behavioral policy that is e.g. deontological in nature, that behavioral policy can in principle be recast as an optimization criterion over universe histories; however, this criterion will in all likelihood have a prefix-free description in UTM1 of length ~2^10^80. And, crucially, there will be no UTM2 in whose encoding scheme the criterion in question has a prefix-free description of much less than ~2^10^80, without that UTM2 itself having a description complexity of ~2^10^80 relative to UTM1—meaning, there is no physically realizable system that can implement UTM2.
        I think I disagree with almost all of this. You can fix some gerrymandered extant physical system right now that ends up looking like a garbled world-history optimizer, I doubt that it would take on the order of length ~2^10^80 to specify it. But granting that these systems would in fact have astronomical prefixes, I think this is a ponens/tollens situation: if these systems actually have a huge prefix, that tells me that some the encoding schemes of some physically realisable systems are deeply incompatible with mine, not that those systems which are out there right now aren’t physically realisible.
        I imagine an objection is that these physical systems are not actually world-history optimizers and are actually going to be much more compressible than I’m making them out to be, so your argument goes through. In which case I’m fine with this, this just seems like a differing definition of what counts as when two schemes are acting “virtually identically” w.r.t to optimization criteria. If your argument is valid but is bounding this similarity to include e.g random chunks of a rock floating through space, then I’m happy to concede that—seems quite trivial and not at all worrying from the original perspective of bounding the kinds of optimization criteria an AI might have