Lucius Bushnaq comments on Changing my mind about Christiano’s malign prior argument

Lucius Bushnaq 4 Apr 2025 7:28 UTC
9 points
4
Thank you for this summary.
I still find myself unconvinced by all the arguments against the Solomonoff prior I have encountered. For this particular argument, as you say, there’s still many ways the conjectured counterexample of adversaria could fail if you actually tried to sit down and formalise it. Since the counterexample is designed to break a formalism that looks and feels really natural and robust to me, my guess is that the formalisation will indeed fall to one of these obstacles, or a different one.
In a way, that makes perfect sense; Solomonoff induction really can’t run in our universe! Any robot we could build to “use Solomonoff induction” would have to use some approximation, which the malign prior argument may or may not apply to.
You can just reason about Solomonoff induction with cut-offs instead. If you render the induction computable by giving it a uniform prior over all programs of some finite length $N$ ^[1] with runtime $\leq T$ , it still seems to behave sanely. As in, you can derive analogs of the key properties of normal Solomonoff induction for this cut-off induction. E.g. the induction will not make more than $K (P^{*})$ bits worth of prediction mistakes compared to any ‘efficient predictor’ program $P^{*}$ with runtime $< T$ and K-complexity $K (P^{*}) < N$ , it’s got a rough invariance to what Universal Turing Machine you run it on, etc. .
Since finite, computable things are easier for me to reason about, I mostly use this cut-off induction in my mental toy models of AGI these days.
EDIT: Apparently, this exists in the literature under the name AIXI-tl. I didn’t know that. Neat.
1. ^
  So, no prefix-free requirement.
What links here?
- Jeremy Gillen's comment on Jeremy Gillen’s Shortform by Jeremy Gillen (17 Sep 2025 5:33 UTC; 13 points)
- Garrett Baker 4 Apr 2025 8:09 UTC
  4 points
  2
  Parent
  I think I mostly agree with this, I think things possibly get more complicated when you throw decision theory into the mix. I think it unlikely I’m being adversarially simulated in part. I could believe that such malign prior problems are actually decision theory problems much more than epistemic problems. Eg “no, I am not going to do what the evil super-simple-simulators want me to do because they will try to invade my prior iff (I would act like they have invaded my prior iff they invade my prior)”.
  - Jeremy Gillen 4 Apr 2025 15:16 UTC
    2 points
    0
    Parent
    In order for a decision theory to choose actions, it has to have a model of the decision problem. The way it gets a model of this decision problem is...?
    - Garrett Baker 4 Apr 2025 16:55 UTC
      2 points
      0
      Parent
      Oh my point wasn’t against solomonoff in general, maybe more crisply, my clam is different decision theories will find different “pathologies” in the solomonoff prior, and in particular for causal and evidential decision theorists, I could totally buy the misaligned prior bit, and I could totally buy, if formalized, the whole thing rests on the interaction between bad decision theory and solomonoff.
      - Jeremy Gillen 4 Apr 2025 17:08 UTC
        2 points
        0
        Parent
        But why would you ever be able to solve the problem with a different decision theory? If the beliefs are manipulating it, it doesn’t matter what the decision theory is.
        Garrett Baker 4 Apr 2025 17:15 UTC
        2 points
        0
        Parent
        My world model would have a loose model of myself in it, and this will change which worlds I’m more or less likely to be found in. For example, a logical decision theorist, trying to model omega, will have very low probability that omega has predicted it will two box.
        Jeremy Gillen 4 Apr 2025 17:18 UTC
        2 points
        0
        Parent
        How does this connect to malign prior problems?
        Garrett Baker 4 Apr 2025 17:25 UTC
        2 points
        0
        Parent
        
        no, I am not going to do what the evil super-simple-simulators want me to do because they will try to invade my prior iff (I would act like they have invaded my prior iff they invade my prior)
        
        Jeremy Gillen 4 Apr 2025 17:44 UTC
        2 points
        0
        Parent
        Well my response to this was:
        In order for a decision theory to choose actions, it has to have a model of the decision problem. The way it gets a model of this decision problem is...?
        But I’ll expand: An agent doing that kind of game-theory reasoning needs to model the situation it’s in. And to do that modelling it needs a prior. Which might be malign.
        Malign agents in the prior don’t feel like malign agents in the prior, from the perspective of the agent with the prior. They’re just beliefs about the way the world is. You need beliefs in order to choose actions. You can’t just decide to act in a way that is independent of your beliefs, because you’ve decided your beliefs are out to get you.
        On top of this, how would you even decide that your beliefs are out to get you? Isn’t this also a belief?
        Garrett Baker 4 Apr 2025 18:37 UTC
        2 points
        0
        Parent
        Let $M$ be an agent which can be instantiated in a much simpler world and has different goals from our limited Bayesian agent $A$ . We say $M$ is malign with respect to $A$ if $p (q | O) < p (q_{M, A} | O)$ where $q$ is the “real” world and $q_{M, A}$ is the world where $M$ has decided to simulate all of $A$ ’s observations for the purpose of trying to invade their prior.
        
        Now what influences $p (q_{M, A} | O)$ ? Well $M$ will only simulate all of $A$ ’s observations if it expects this will give it some influence over $A$ . Let $L_{A}$ be an unformalized logical counterfactual operation that $A$ could make.
        
        Then $p (q_{M, A} | O, L_{A})$ is maximal when $L_{A}$ takes into account $M$ ‘s simulation, and $0$ when $L_{A}$ doesn’t take into account $M$ ‘s simulation. In particular, if $L_{A, \neg M}$ is a logical counterfactual which doesn’t take $M$ ’s simulation into account, then
        
        $p (q_{M, A} | O, L_{A, \neg M}) = 0 < p (q | O, L_{A, \neg M})$ So the way in which the agent “gets its beliefs” about the structure of the decision theory problem is via these logical-counterfactual-conditional operations, same as in causal decision theory, and same as in evidential decision theory.
        Jeremy Gillen 4 Apr 2025 20:05 UTC
        2 points
        0
        Parent
        I’m not sure what the type signature of $L_{A}$ is, or what it means to “not take into account $M$ ’s simulation”. When $A$ makes decisions about which actions to take, it doesn’t have the option of ignoring the predictions of its own world model. It has to trust its own world model, right? So what does it mean to “not take it into account”?
        So the way in which the agent “gets its beliefs” about the structure of the decision theory problem is via these logical-counterfactual-conditional operation
        I think you’ve misunderstood me entirely. Usually in a decision problem, we assume the agent has a perfectly true world model, and we assume that it’s in a particular situation (e.g. with omega and knowing how omega will react to different actions). But in reality, an agent has to learn which kind of world its in using an inductor. That’s all I meant by “get its beliefs”.
        Expand this thread
        Garrett Baker 4 Apr 2025 20:17 UTC
        1 point
        0
        Parent
        
        I’m not sure what the type signature of $L_{A}$ is, or what it means to “not take into account $M$ ’s simulation”
        
        I know you know about logical decision theory, and I know you know its not formalized, and I’m not going to be able to formalize it in a LessWrong comment, so I’m not sure what you want me to say here. Do you reject the idea of logical counterfactuals? Do you not see how they could be used here?
        
        I think you’ve misunderstood me entirely. Usually in a decision problem, we assume the agent has a perfectly true world model, and we assume that it’s in a particular situation (e.g. with omega and knowing how omega will react to different actions). But in reality, an agent has to learn which kind of world its in using an inductor. That’s all I meant by “get its beliefs”.
        
        Because we’re talking about priors and their influence, all of this is happening inside the agent’s brain. The agent is going about daily life, and thinks “hm, maybe there is an evil demon simulating me who will give me −10¹⁰10^10 utility if I don’t do what they want for my next action”. I don’t see why this is obviously ill-defined without further specification of the training setup.
        Jeremy Gillen 4 Apr 2025 20:44 UTC
        2 points
        0
        Parent
        Do you not see how they could be used here?
        This one. I’m confused about what the intuitive intended meaning of the symbol is. Sorry, I see why “type signature” was the wrong way to express that confusion. In my mind a logical counterfactual is a model of the world, with some fact changed, and the consequences of that fact propagated to the rest of the model. Maybe $L_{A}$ is a boolean fact that is edited? But if so I don’t know which fact it is, and I’m confused by the way you described it.
        Because we’re talking about priors and their influence, all of this is happening inside the agent’s brain. The agent is going about daily life, and thinks “hm, maybe there is an evil demon simulating me who will give me −10¹⁰10^10 utility if I don’t do what they want for my next action”. I don’t see why this is obviously ill-defined without further specification of the training setup.
        Can we replace this with: “The agent is going about daily life, and its (black box) world model suddenly starts predicting that most available actions actions lead to −10¹⁰ utility.”? This is what it’s like to be an agent with malign hypotheses in the world model. I think we can remove the additional complication of believing its in a simulation.
  - Lucius Bushnaq 4 Apr 2025 14:05 UTC
    2 points
    0
    Parent
    If you make an agent by sticking together cut-off Solomonoff induction and e.g. causal decision theory, I do indeed buy that this agent will have problems. Because causal decision theory has problems.
    - Cole Wyeth 4 Apr 2025 14:13 UTC
      2 points
      0
      Parent
      But how serious will these problems be? What if you encrypt the agent’s thoughts, add pain sensors, and make a few other simple patches to deal with embeddedness?
      I wouldn’t be comfortable handing the lightcone over to such a thing, but I don’t really expect it to fall over anytime soon.
- Cole Wyeth 4 Apr 2025 15:45 UTC
  2 points
  0
  Parent
  What you describe is not actually equivalent to AIXI-tl, which conducts a proof search to justify policies. Your idea has more in common with Schmidhuber’s speed prior.
- Jeremy Gillen 4 Apr 2025 15:12 UTC
  2 points
  0
  Parent
  One thing to keep in mind is that time cut-offs will usually rule out our own universe as a hypothesis. Our universe is insanely compute inefficient.
  So the “hypotheses” inside your inductor won’t actually end up corresponding to what we mean by a scientific hypothesis. The only reason this inductor will work at all is that it’s done a brute force search over a huge space of programs until it finds one that works. Plausibly it’ll just find a better efficient induction algorithm, with a sane prior.
  - Lucius Bushnaq 4 Apr 2025 15:21 UTC
    5 points
    1
    Parent
    That’s fine. I just want a computable predictor that works well. This one does.
    Also, scientific hypotheses in practice aren’t actually simple code for a costly simulation we run. We use approximations and abstractions to make things cheap. Most of our science outside particle physics is about finding more effective approximations for stuff.
    Edit: Actually, I don’t think this would yield you a different general predictor as the program dominating the posterior. General inductor program $P_{1}$ running program $P_{2}$ is pretty much never going to be the shortest implementation of $P_{2}$ .
    - Jeremy Gillen 4 Apr 2025 15:51 UTC
      4 points
      0
      Parent
      You also want one that generalises well, and doesn’t do preformative predictions, and doesn’t have goals of its own. If your hypotheses aren’t even intended to be reflections of reality, how do we know these properties hold?
      Also, scientific hypotheses in practice aren’t actually simple code for a costly simulation we run. We use approximations and abstractions to make things cheap. Most of our science outside particle physics is actually about finding more effective approximate models for things in different regimes.
      When we compare theories, we don’t consider the complexity of all the associated approximations and abstractions. We just consider the complexity of the theory itself.
      E.g. the theory of evolution isn’t quite code for a costly simulation. But it can be viewed as set of statements about such a simulation. And the way we compare the theory of evolution to alternatives doesn’t involve comparing the complexity of the set of approximations we used to work out the consequences of each theory.
      Edit to respond to your edit: I don’t see your reasoning, and that isn’t my intuition. For moderately complex worlds, it’s easy for the description length of the world to be longer than the description length of many kinds of inductor.
      - Lucius Bushnaq 4 Apr 2025 15:54 UTC
        2 points
        0
        Parent
        You also want one that generalises well, and doesn’t do preformative predictions, and doesn’t have goals of its own. If your hypotheses aren’t even intended to be reflections of reality, how do we know these properties hold?
        Because we have the prediction error bounds.
        When we compare theories, we don’t consider the complexity of all the associated approximations and abstractions. We just consider the complexity of the theory itself.
        E.g. the theory of evolution isn’t quite code for a costly simulation. But it can be viewed as set of statements about such a simulation. And the way we compare the theory of evolution to alternatives doesn’t involve comparing the complexity of the set of approximations we used to work out the consequences of each theory.
        Yes.
        Jeremy Gillen 4 Apr 2025 16:25 UTC
        2 points
        0
        Parent
        To respond to your edit: I don’t see your reasoning, and that isn’t my intuition. For moderately complex worlds, it’s easy for the description length of the world to be longer than the description length of many kinds of inductor.
        Because we have the prediction error bounds.
        Not ones that can rule out any of those things. My understanding is that the bounds are asymptotic or average-case in a way that makes them useless for this purpose. So if a mesa-inductor is found first that has a better prior, it’ll stick with the mesa-inductor. And if it has goals, it can wait as long as it wants to make a false prediction that helps achieve its goals. (Or just make false predictions about counterfactuals that are unlikely to be chosen).
        If I’m wrong then I’d be extremely interested in seeing your reasoning. I’d maybe pay $400 for a post explaining the reasoning behind why prediction error bounds rule out mesa-optimisers in the prior.
        Lucius Bushnaq 4 Apr 2025 16:44 UTC
        2 points
        0
        Parent
        The bound is the same one you get for normal Solomonoff induction, except restricted to the set of programs the cut-off induction runs over. It’s a bound on the total expected error in terms of CE loss that the predictor will ever make, summed over all datapoints.
        
        Look at the bound for cut-off induction in that post I linked, maybe? Hutter might also have something on it.
        Can also discuss on a call if you like.
        
        Note that this doesn’t work in real life, where the programs are not in fact restricted to outputting bit string predictions and can e.g. try to trick the hardware they’re running on.
        Jeremy Gillen 4 Apr 2025 17:35 UTC
        2 points
        0
        Parent
        Yeah I know that bound, I’ve seen a very similar one. The problem is that mesa-optimisers also get very good prediction error when averaged over all predictions. So they exist well below the bound. And they can time their deliberately-incorrect predictions carefully, if they want to survive for a long time.