Jeremy Gillen answers Am I confused about the “malign universal prior” argument?

Jeremy Gillen 28 Aug 2024 14:59 UTC
15 points
0
(This reminded me of a couple of arguments I’ve had in the past, in person. I think you are missing something. But I’ve previously failed to communicate this thing. I hope I’m not misinterpreting your point, and sorry if this comment comes across as frustrated at some points.)
For anything like the malignity argument to work, we need this kind of “gap” to exist—a gap between the power needed to actually use the UP (or the speed prior, or whatever), and the power needed to merely “understand them well enough for control purposes.”
Maybe such a gap is possible! It would be very interesting if so.
Such a gap is so common that I’m worried I’ve missed your point. There are ~always a range of algorithms that solve the same problem and have very different levels of efficiency. This is clearly true of algorithms that predict the next bit. You are correct that a malign hypothesis needs to be running an algorithm that is more efficient than the outer algorithm.
Suppose that there is some search process that is looking through a collection of things, and you are an element of the collection. Then, in general, it’s difficult to imagine how you (just you) can reason about the whole search in such a way as to “steer it around” in your preferred direction.
I think this is easy to imagine. I’m an expert who is among 10 experts recruited to advise some government on making a decision. I can guess some of the signals that the government will use to choose who among us to trust most. I can guess some of the relative weaknesses of fellow experts. I can try to use this to manipulate the government into taking my opinion more seriously. I don’t need to create a clone government and hire 10 expert clones in order to do this.
If you are powerful enough to reason about the search (and do this well enough for steering), then in some sense the search is unnecessary—one could delete all the other elements of the search space, and just consult you about what the search might have done.
It’s true that if an induction algorithm is maximally compute efficient, then it shouldn’t have daemon problems. Because there is no way for a daemon to do better prediction than alternative hypotheses. But… actual algorithms that we build aren’t usually compute optimal, so there’s a risk they will find more compute efficient algorithms internally. (I’m not sure I understood what you’re saying here, so tell me if this is a non sequitur).
The argument is about the content of the “actual” UP, not the content of some computable approximation.

If the reasoning beings are considering—and trying to influence—some computable thing that isn’t the UP, we need to determine whether this thing has the right kind of relationship to the UP (whatever that means) for the influences upon it to “bubble up to” the UP itself.
You seem to be saying the behavior of a computable approximation is unrelated to the behavior of the idealization? Like, of course there are differences between reality and idealizations. Paul mentions this a few times in the original post, that the connection to real world algorithms and consequences isn’t clear. But I think you’re missing the whole point of doing theory work about idealized models. Theorizing about idealizations is easier (and possible). A common pattern for theorists is to work out the consequences of an idealized theory, then to apply this theory to reality, they try to approximately adjust for the most important differences between the theory and reality.
A good example is worst case runtime analysis. It’s very useful for predicting real world runtime. In some situations, it’s far too pessimistic (or optimistic!). But in those situations, there’s always a reason, there’s some factor that the worst case analysis isn’t taking into account. And with some familiarity with the field, you get to know these factors and how and when we can correct for them when transferring your knowledge to the real world.
Back to induction, specifically this line:
some computable thing that isn’t the UP, we need to determine whether this thing has the right kind of relationship to the UP
The reasoning beings inside the hypothesis are trying to make their computable approximation as similar to the UP as possible. Sure, they might make mistakes, or be limited in some way by what algorithms are possible. But, in our lack of knowledge about the details, we don’t have to get stuck in confusion about how exactly they might create an approximation. We can just assume they did a good job (as an idealization/approximation). This is the standard approximate way of predicting the consequences of competent agents. This is part of making an idealized theory.
If we later discover that all algorithms that are designed to predict the next bit in the real world have a particular property (i.e. are biased toward fast computations), then we can redo the malign prior theory in light of that knowledge. Maybe you think “biased toward fast computations” is obviously true of all induction algorithms? (it’s definitely not true, consider the runtime of theories in physics).
The UP is malign idea is the idea of optimization daemons applied directly to Solomonoff inductors. (Note the line: “When heavy optimization pressure on a system crystallizes it into an optimizer—especially one that’s powerful, or more powerful than the previous system, or misaligned with the previous system—we could term the crystallized optimizer a “daemon” of the previous system”).
If you wanted to see how optimization daemons could show up in more practical algorithms, you’ll probably end up at RFLO (although there are other situations where optimization daemons can show up that don’t quite fit the RFLO description).
- nostalgebraist 29 Aug 2024 19:35 UTC
  30 points
  10
  Parent
  I hope I’m not misinterpreting your point, and sorry if this comment comes across as frustrated at some points.
  I’m not sure you’re misinterpreting me per se, but there are some tacit premises in the background of my argument that you don’t seem to hold. Rather than responding point-by-point, I’ll just say some more stuff about where I’m coming from, and we’ll see if it clarifies things.
  You talk a lot about “idealized theories.” These can of course be useful. But not all idealizations are created equal. You have to actually check that your idealization is good enough, in the right ways, for the sorts of things you’re asking it to do.
  In physics and applied mathematics, one often finds oneself considering a system that looks like
  - some “base” system that’s well-understood and easy to analyze, plus
  - some additional nuance or dynamic that makes things much harder to analyze in general—but which we can safely assume has much smaller effects that the rest of the system
  We quantify the size of the additional nuance with a small parameter $ϵ$ . If $ϵ$ is literally 0, that’s just the base system, but we want to go a step further: we want to understand what happens when the nuance is present, just very small. So, we do something like formulating the solution as a power series in $ϵ$ , and truncating to first order. (This is perturbation theory, or more generally asymptotic analysis.)
  This sort of approximation gets better and better as $ϵ$ gets closer to 0, because this magnifies the difference in size between the truncated terms (of size $O (ϵ^{2})$ and smaller) and the retained $O (ϵ)$ term. In some sense, we are studying the $ϵ \to 0$ limit.
  But we’re specifically interested in the behavior of the system given a nonzero, but arbitrarily small, value of $ϵ$ . We want an approximation that works well if $ϵ = 10^{- 6}$ , and even better if $ϵ = 10^{- 12}$ , and so on. We don’t especially care about the literal $ϵ = 0$ case, except insofar as it sheds light on the very-small-but-nonzero behavior.
  Now, sometimes the $ϵ \to 0$ limit of the “very-small-but-nonzero behavior” simply is the $ϵ = 0$ behavior, the base system. That is, what you get at very small $ϵ$ looks just like the base system, plus some little $O (ϵ)$ -sized wrinkle.
  But sometimes – in so-called “singular perturbation” problems – it doesn’t. Here the system has qualitatively different behavior from the base system given any nonzero $ϵ$ , no matter how small.
  Typically what happens is that $ϵ$ ends up determining, not the magnitude of the deviation from the base system, but the “scale” of that deviation in space and/or time. So that in the limit, you get behavior with an $O (1)$ -sized difference from the base system’s behavior that’s constrained to a tiny region of space and/or oscillating very quickly.
  Boundary layers in fluids are a classic example. Boundary layers are tiny pockets of distinct behavior, occurring only in small $ϵ$ -sized regions and not in most of the fluid. But they make a big difference just by being present at all. Knowing that there’s a boundary layer around a human body, or an airplane wing, is crucial for predicting the thermal and mechanical interactions of those objects with the air around them, even though it takes up a tiny fraction of the available space, the rest of which is filled by the object and by non-boundary-layer air. (Meanwhile, the planetary boundary layer is tiny relative to the earth’s full atmosphere, but, uh, we live in it.)
  In the former case (“regular perturbation problems”), “idealized” reasoning about the $ϵ = 0$ case provides a reliable guide to the small-but-nonzero behavior. We want to go further, and understand the small-but-nonzero effects too, but we know they won’t make a qualitative difference.
  In the singular case, though, the $ϵ = 0$ “idealization” is qualitatively, catastrophically wrong. If you make an idealization that assumes away the possibility of boundary layers, then you’re going to be wrong about what happens in a fluid – even about the big, qualitative, $O (1)$ stuff.
  You need to know which kind of case you’re in. You need to know whether you’re assuming away irrelevant wrinkles, or whether you’re assuming away the mechanisms that determine the high-level, qualitative, $O (1)$ stuff.
  Back to the situation at hand.
  In reality, TMs can only do computable stuff. But for simplicity, as an “idealization,” we are considering a model where we pretend they have a UP oracle, and can exactly compute the UP.
  We are justifying this by saying that the TMs will try to approximate the UP, and that this approximation will be very good. So, the approximation error is an $O (ϵ)$ -sized “additional nuance” in the problem.
  Is this more like a regular perturbation problem, or more like a singular one? Singular, I think.
  The $ϵ = 0$ case, where the TMs can exactly compute the UP, is a problem involving self-reference. We have a UP containing TMs, which in turn contain the very same UP.
  Self-referential systems have a certain flavor, a certain “rigidity.” (I realize this is vague, sorry, I hope it’s clear enough what I mean.) If we have some possible behavior of the system $X$ , most ways of modifying $X$ (even slightly) will not produce behaviors which are themselves possible. The effect of the modification as it “goes out” along the self-referential path has to precisely match the “incoming” difference that would be needed to cause exactly this modification in the first place.
  “Stable time loop”-style time travel in science fiction is an example of this; it’s difficult to write, in part because of this “rigidity.” (As I know from experience :)
  On the other hand, the situation with a small-but-nonzero $ϵ$ is quite different.
  With literal self-reference, one might say that “the loop only happens once”: we have to precisely match up the outgoing effects (“UP inside a TM”) with the incoming causes (“UP^[1] with TMs inside”), but then we’re done. There’s no need to dive inside the UP that happens within a TM and study it, because we’re already studying it, it’s the same UP we already have at the outermost layer.
  But if the UP inside a given TM is merely an approximation, then what happens inside it is not the same as the UP we have at the outermost layer. It does not contain not the same TMs we already have.
  It contains some approximate thing, which (and this is the key point) might need to contain an even more coarsely approximated UP inside of its approximated TMs. (Our original argument for why approximation is needed might hold, again and equally well, at this level.) And the next level inside might be even more coarsely approximated, and so on.
  To determine the behavior of the outermost layer, we now need to understand the behavior of this whole series, because each layer determines what the next one up will observe.
  Does the series tend toward some asymptote? Does it reach a fixed point and then stay there? What do these asymptotes, or fixed points, actually look like? Can we avoid ever reaching a level of approximation that’s no longer $O (ϵ)$ but $O (1)$ , even as we descend through an $O (1 / ϵ)$ number of series iterations?
  I have no idea! I have not thought about it much. My point is simply that you have to consider the fact that approximation is involved in order to even ask the right questions, about asymptotes and fixed point and such. Once we acknowledge that approximation is involved, we get this series structure and care about its limiting behavior; this qualitative structure is not present at all in the idealized case where we imagine the TMs have UP oracles.
  I also want to say something about the size of the approximations involved.
  Above, I casually described the approximation errors as $O (ϵ)$ , and imagined an $ϵ \to 0$ limit.
  But in fact, we should not imagine that these errors can come as close to zero as we like. The UP is uncomptuable, and involves running every TM at once^[2]. Why would we imagine that a single TM can approximate this arbitrarily well?^[3]
  Like the gap between the finite and the infinite, or between polynomial and exponential runtime, gap between the uncomptuable and the comptuable is not to be trifled with.
  Finally: the thing we get when we equip all the TMs with UP oracles isn’t the UP, it’s something else. (As far as I know, anyway.) That is, the self-referential quality of this system is itself only approximate (and it is by no means clear that the approximation error is small – why would it be?). If we have the UP at the bottom, inside the TMs, then we don’t have it at the outermost layer. Ignoring this distinction is, I guess, part of the “idealization,” but it is not clear to me why we should feel safe doing so.
  1. ^
    The thing outside the TMs here can’t really be the UP, but I’ll ignore this now and bring it up again at the end.
  2. ^
    In particular, running them all at once and actually using the outputs, at some (“finite”) time at which one needs the outputs for making a decision. It’s possible to run every TM inside of a single TM, but only by incurring slowdowns that grow without bound across the series of TMs; this approach won’t get you all the information you need, at once, at any finite time.
  3. ^
    There may be some result along these lines that I’m unaware of. I know there are results showing that the UP and SI perform well relative to the best computable prior/predictor, but that’s not the same thing. Any given computable prior/predictor won’t “know” whether or not it’s the best out of the multitude, or how to correct itself if it isn’t; that’s the value added by UP / SI.
  - Jeremy Gillen 30 Aug 2024 13:07 UTC
    8 points
    0
    Parent
    Great explanation, you have found the crux. I didn’t know such problems were called singular perturbation problems.
    If I thought that reasoning about the UP was definitely a singular perturbation problem in the relevant sense, then I would agree with you (that the malign prior argument doesn’t really work). I think it’s probably not, but I’m not extremely confident.
    Your argument that it is a singular perturbation problem is that it involves self reference. I agree that self-reference is kinda special and can make it difficult to formally model things, but I will argue that it is often reasonable to just treat the inner approximations as exact.
    The reason is: Problems that involve self reference are often easy to approximate by using more coarse-grained models as you move deeper.
    One example as an intuition pump is an MCTS chess bot. In order to find a good move, it needs to think about its opponent thinking about itself, etc. We can’t compute this (because its exponential, not because its non-computable), but if we approximate the deeper layers by pretending they move randomly (!), it works quite well. Having a better move distribution works even better.
    Maybe you’ll object that this example isn’t precisely self-reference. But the same algorithm (usually) works for finding a nash equilibria on simultaneous move games, which do involve infinitely deep self reference.
    And another more general way of doing essentially the same thing is using a reflective oracle. Which I believe can also be used to describe a UP that can contain infinitely deep self-reference (see the last paragraph of the conclusion).^[1] I think the fact that Paul worked on this suggests that he did see the potential issues with self-reference and wanted better ways to reason formally about such systems.
    To be clear, I don’t think any of these examples tells us that the problem is definitely a regular perturbation problem. But I think these examples do suggest that assuming that it is regular is a very reasonable place to start, and probably tells us a lot about similar, more realistic, systems.
    On the gap between the computable and uncomputable: It’s not so bad to trifle a little. Diagonalization arguments can often be avoided with small changes to the setup, and a few of Paul’s papers are about doing exactly this.
    And the same argument works for a computable prior. E.g. we could make a prior over a finite set of total turing machines, such that it still contained universes with clever agents.
    Why would we imagine that a single TM can approximate this arbitrarily well?
    If I remember correctly, a single TM definitely can’t approximate it arbitrarily well. But my argument doesn’t depend on this.
    ^
    Don’t trust me on this though, my understanding of reflective oracles is very limited.
    - LGS 30 Aug 2024 20:55 UTC
      5 points
      0
      Parent
      Thanks for the link to reflective oracles!
      On the gap between the computable and uncomputable: It’s not so bad to trifle a little. Diagonalization arguments can often be avoided with small changes to the setup, and a few of Paul’s papers are about doing exactly this.
      
      I strongly disagree with this: diagonalization arguments often cannot be avoided at all, not matter how you change the setup. This is what vexed logicians in the early 20th century: no matter how you change your formal system, you won’t be able to avoid Godel’s incompleteness theorems.
      There is a trick that reliably gets you out of such paradoxes, however: switch to probabilistic mixtures. This is easily seen in a game setting: in rock-paper-scissors, there is no deterministic Nash equilibrium. Switch to mixed strategies, however, and suddenly there is always a Nash equilibrium.
      This is the trick that Paul is using: he is switching from deterministic Turing machines to randomized ones. That’s fine as far as it goes, but it has some weird side effects. One of them is that if a civilization is trying to predict the universal prior that is simulating itself, and tries to send a message, then it is likely that with “reflexive oracles” in place, the only message it can send is random noise. That is, Paul shows reflexive oracles exist in the same way that Nash equilibria exist; but there is no control over what the reflexive oracle actually is, and in paradoxical situations (like rock-paper-scissors) the Nash equilibrium is the boring “mix everything together uniformly”.
      The underlying issue is that a universe that can predict the universal prior, which in turn simulates the universe itself, can encounter a grandfather paradox. It can see its own future by looking at the simulation, and then it can do the opposite. The grandfather paradox is where the universe decides to kill the grandfather of a child that the simulation predicts.
      Paul solves this by only letting it see its own future using a “reflexive oracle” which essentially finds a fixed point (which is a probability distribution). The fixed point of a grandfather paradox is something like “half the time the simulation shows the grandchild alive, causing the real universe to kill the grandfather; the other half the time, the simulation shows the grandfather dead and the grandchild not existing”. Such a fixed point exists even when the universe tries to do the opposite of the prediction.
      The thing is, this fixed point is boring! Repeat this enough times, and it eventually just says “well my prediction about your future is random noise that doesn’t have to actually come true in your own future”. I suspect that if you tried to send a message through the universal prior in this setting, the message would consist of essentially uniformly random bits. This would depend on the details of the setup, I guess.
      - Jeremy Gillen 30 Aug 2024 21:08 UTC
        2 points
        0
        Parent
        I strongly disagree with this: diagonalization arguments often cannot be avoided at all, not matter how you change the setup. …
        There is a trick that reliably gets you out of such paradoxes, however: switch to probabilistic mixtures.
        Fair enough, the probabilistic mixtures thing was what I was thinking of as a change of setup, but reasonable to not consider it such.
        the message would consist of essentially uniformly random bits
        I don’t see how this is implied. If a fact is consistent across levels, and determined in a non-paradoxical way, can’t this become a natural fixed point that can be “transmitted” across levels? And isn’t this kind of knowledge all that is required for the malign prior argument to work?
        LGS 30 Aug 2024 21:21 UTC
        3 points
        0
        Parent
        The problem is that the act of leaving the message depends on the output of the oracle (otherwise you wouldn’t need the oracle at all, but you also would not know how to leave a message). If the behavior of the machine depends on the oracle’s actions, then we have to be careful with what the fixed point will be.
        
        For example, if we try to fight the oracle and do the opposite, we get the “noise” situation from the grandfather paradox.
        
        But if we try to cooperate with the oracle and do what it predicts, then there are many different fixed points and no telling which the oracle would choose (this is not specified in the setting).
        
        It would be great to see a formal model of the situation. I think any model in which such message transmission would work is likely to require some heroic assumptions which don’t correspond much to real life.
        Jeremy Gillen 30 Aug 2024 23:00 UTC
        2 points
        0
        Parent
        If the only transmissible message is essentially uniformly random bits, then of what value is the oracle?
        I claim the message can contain lots of information. E.g. if there are 2^100 potential actions, but only 2 fixed points, then 99 bits have been transmitted (relative to uniform).
        The rock-paper-scissors example is relatively special, in that the oracle can’t narrow down the space of actions at all.
        The UP situation looks to me to be more like the first situation than the second.
        LGS 31 Aug 2024 1:48 UTC
        3 points
        0
        Parent
        It would help to have a more formal model, but as far as I can tell the oracle can only narrow down its predictions of the future to the extent that those predictions are independent of the oracle’s output. That is to say, if the people in the universe ignore what the oracle says, then the oracle can give an informative prediction.
        This would seem to exactly rule out any type of signal which depends on the oracle’s output, which is precisely the types of signals that nostalgebraist was concerned about.
        Jeremy Gillen 31 Aug 2024 18:08 UTC
        4 points
        2
        Parent
        That can’t be right in general. Normal nash equilibria can narrow down predictions of actions. E.g. competition game. This is despite each player’s decision being dependent on the other player’s action.
        LGS 31 Aug 2024 19:17 UTC
        3 points
        0
        Parent
        That’s fair, yeah
        We need a proper mathematical model to study this further. I expect it to be difficult to set up because the situation is so unrealistic/impossible as to be hard to model. But if you do have a model in mind I’ll take a look
- faul_sname 29 Aug 2024 23:05 UTC
  4 points
  0
  Parent
  
  Suppose that there is some search process that is looking through a collection of things, and you are an element of the collection. Then, in general, it’s difficult to imagine how you (just you) can reason about the whole search in such a way as to “steer it around” in your preferred direction.
  
  I think this is easy to imagine. I’m an expert who is among 10 experts recruited to advise some government on making a decision. I can guess some of the signals that the government will use to choose who among us to trust most. I can guess some of the relative weaknesses of fellow experts. I can try to use this to manipulate the government into taking my opinion more seriously. I don’t need to create a clone government and hire 10 expert clones in order to do this.
  
  The other 9 experts can also make guesses about which the signals the government will use and what the relative weaknesses of their fellow experts are, and the other 9 experts can also act on those guesses. So in order to reason about what the outcome of the search will be, you have to reason about both yourself and also about the other 9 experts, unless you somehow know that you are much better than the other 9 experts at steering the outcome of the search as a whole. But in that case only you can steer the search . The other 9 experts would fail if they tried to use the same strategy you’re using.
  - Jeremy Gillen 30 Aug 2024 13:38 UTC
    4 points
    0
    Parent
    unless you somehow know that you are much better than the other 9 experts at steering the outcome of the search as a whole. But in that case only you can steer the search . The other 9 experts would fail if they tried to use the same strategy you’re using.
    Okay if you accept this modified scenario where one expert knows they are much better than the other 9, then this is sufficient as a scenario that nostalgebraist claimed was difficult to imagine. So that’s enough to prove the point I was trying to make.
    But the original example works too. It’s just a simultaneous move game. It’ll be won by whichever player is best at playing the game. It’s clearly possible to play the game well, despite the self-reference involved with thinking about how to play better.
  - Thane Ruthenis 30 Aug 2024 3:04 UTC
    4 points
    2
    Parent
    Consider a different problem: a group of people are posed some technical or mathematical challenge. Each individual person is given a different subset of the information about the problem, and each person knows what type of information every other participant gets.
    Trivial example: you’re supposed to find the volume of a pyramid, you (participant 1) are given its height and the apex angles for two triangular faces, participant 2 is given the radius of the sphere on which all of the pyramid’s vertices lie and all angles of the triangular faces, participant 3 is given the areas of all faces, et cetera.
    Given this setup, if you’re skilled at geometry, you can likely figure out which of the participants can solve the problem exactly, which can only put upper and lower bounds on the volume, and what those upper/lower bounds are for each participant. You don’t need to model your competitors’ mental states: all you need to do is reason about the object-level domain, plus take into account what information they have. No infinite recursion happens, because you can abstract out the particulars of how others’ minds work.
    This works assuming that everyone involved is perfectly skilled at geometry: that you don’t need to predict what mistakes the others would make (which would depend on the messy details of their minds).
    Speculatively, this would apply to deception as well. You don’t necessarily need to model others’ brain states directly. If they’re all perfectly skilled at deception, you can predict what deceptions they’d try to use and how effective they’d be based on purely objective information: the sociopolitical landscape, their individual skills and comparative advantages, et cetera. You can “skip to the end”: predict everyone playing their best-move-in-circumstances-where-everyone-else-plays-their-best-move-too.
    Objectively, the distribution of comparative advantages is likely very different, so even if everyone makes their best move, some would hopelessly lose. (E. g., imagine if one of the experts is a close friend of a government official and the other is a controversial figure who’d been previously judged guilty of fraud.)
    Speculatively, similar works for the MUP stuff. You don’t actually need to model the individual details of other universes. You can just use abstract reasoning to figure out what kinds of universes are dense across Tegmark IV, figure out what (distributions over) entities inhabit them, figure out (distributions over) how they’d reason, and what (distributions over) simulations they’d run, and to what (distribution over the) output this process converges given the objective material constraints involved. Then take actions that skew said distribution-over-the-output in a way you want.
    Again, this is speculative: I don’t know that there are any math proofs that this is possible. But it seems plausible enough that something-like-this might work, and my understanding is that the MUP argument (and other kinds of acausal-trade setups) indeed uses this as a foundational assumption. (I. e., it assumes that the problem is isomorphic (in a relevant sense) to my pyramid challenge above.)
    (IIRC, the Acausal Normalcy post outlines some of the relevant insights, though I think it doesn’t precisely focus on the topic at hand.)