“Malign universal prior” arguments basically assume a setup in which we have an agent with a big dumb hard-coded module whose goal is to find this agent’s location in Tegmark IV. (Or maybe perform some other important task that requires reasoning about Tegmark IV, but let’s run with that as the example.)
The agent might be generally intelligent, the Solomonoff-induction-approximating module might be sophisticated in all kind of ways, but it’s “dumb” or “naive” in an important sense: it’s just trying to generate the best-guess distribution over the universes the agent is in, no matter their contents, then blindly acts on it.
Importantly, this process doesn’t necessarily involve actually running any low-level simulations of other universes. Generally intelligent/abstract reasoning, some steps of which might literally replicate the reasoning steps of Paul’s post, would also fit the bill.
The MUP argument is that this is sufficient for alien consequentialists to take advantage. The agent is asking, “where am I most likely to be?”, and the alien consequentialists are skewing the distribution such that the most likely correct answer is “simulation-captured by acausal aliens” or whatever.
(And then the malign output is producing “predictions” about the future of the agent’s universe like “the false vacuum collapse is going to spontaneously trigger in the next five minutes unless you perform this specific sequence of actions that happen to rewrite your utility function in such-and-such ways”, and our big dumb agent is gormlessly buying this, and its “real” non-simulation-captured instance rewrites itself accordingly.)
Speed prior vs. complexity prior: a common guess regarding the structure of Tegmark IV is that this is how it works, it penalizes K-complexity but doesn’t care how much memory/compute it needs to allocate to run a universe. If that is true, then any sufficiently good approximation of Solomonoff induction – any sufficiently good procedure for getting an answer to “where am I most likely to be?”, including abstract reasoning – would take this principle into account, and bump up the probability of being in low-complexity universes.
This all seems to check out to me. Admittedly I didn’t actually confirm this with any proponents of the argument, though.
(Probably also worth stating that I don’t think the MUP is in any way relevant to real life. AI progress doesn’t seem to be on the track where it features AGIs that use big dumb “where am I?” modules. E. g., if an AGI is born of anything like an RL-trained LLM, seems unlikely that its “where am I?” reasoning would be naive in the relevant sense. It’d be able to “manually” filter out universes with malign consequentialists, given good decision theory. You know, like we can.
The MUP specifically applies to highly abstract agent-foundations designs where we hand-code each piece, that currently don’t seem practically tractable at all.)
I admit I’m not closely familiar with Tegmark’s views, but I know he has considered two distinct things that might be called “the Level IV multiverse”:
a “mathematical universe” in which all mathematical constructs exist
a more restrictive “computable universe” in which only computable things exist
In particular, Tegmark speculates that the computable universe is “distributed” following the UP (as you say in your final bullet point). This would mean e.g. that one shouldn’t be too surprised to find oneself living in a TM of any given K-complexity, despite the fact that “almost all” TMs have higher complexity (in the same sense that “almost all” natural numbers are greater than any given number n).
When you say “Tegmark IV,” I assume you mean the computable version—right? That’s the thing which Tegmark says might be distributed like the UP. If we’re in some uncomputable world, the UP won’t help us “locate” ourselves, but if the world has to be computable then we’re good[1].
With that out of the way, here is why this argument feels off to me.
First, Tegmark IV is an ontological idea, about what exists at the “outermost layer of reality.” There’s no one outside of Tegmark IV who’s using it to predict something else; indeed, there’s no one outside of it at all; it is everything that exists, full stop.
“Okay,” you might say, “but wait—we are all somewhere inside Tegmark IV, and trying to figure out just which part of it we’re in. That is, we are all attempting to answer the question, ‘what happens when you update the UP on my own observations?’ So we are all effectively trying to ‘make decisions on the basis of the UP,’ and vulnerable to its weirdness, insofar as it is weird.”
Sure. But in this picture, “we” (the UP-using dupes) and “the consequentialists” are on an even footing: we are both living in some TM or other, and trying to figure out which one.
In which case we have to ask: why would such entities ever come to such a destructive, bad-for-everyone (acausal) agreement?
Presumably the consequentalists don’t want to be duped; they would prefer to be able to locate themselves in Tegmark IV, and make decisions accordingly, without facing such irritating complications.
But, by writing to “output channels”[2] in the malign manner, the consequentalists are simply causing the multiverse to be the sort of place where those irritating complications happen to beings like them (beings in TMs trying to figure out which TM they’re in) -- and what’s more, they’re expending time and scarce resources to “purchase” this undesirable state of affairs!
In order for malignity to be worth it, we need something to break the symmetry between “dupes” (UP users) and “con men” (consequentialists), separating the world into two classes, so that the would-be con men can plausibly reason, “I may act in a malign way without the consequences raining down directly on my head.”
We have this sort of symmetry-breaker in the version of the argument that postulates, by fiat, a “UP-using dupe” somewhere, for some reason, and then proceeds to reason about the properties of the (potentially very different, not “UP-using”?) guys inside the TMs. A sort of struggle between conniving, computable mortals and overly-innocent, uncomputable angels. Here we might argue that things really will go wrong for the angels, that they will be the “dupes” of the mortals, who are not like them and who do not themselves get duped. (But I think this form of the argument has other problems, the ones I described in the OP.)
But if the reason we care about the UP is simply that we’re all in TMs, trying to find our location within Tegmark IV, then we’re all in this together. We can just notice that we’d all be better off if no one did the malign thing, and then no one will do it[3].
In other words, in your picture (and Paul’s), we are asked to imagine that the computable world abounds with malign, wised-up consequentialist con men, who’ve “read Paul’s post” (i.e. re-derived similar arguments) and who appreciate the implications. But if so, then where are the marks? If we’re not postulating some mysterious UP-using angel outside of the computable universe, then who is there to deceive? And if there’s no one to deceive, why go to the trouble?
I’m picturing a sort of acausal I’m-thinking-about-you-thinking-about me situation in which, although I might never actually read what’s written on those channels (after all I am not “outside” Tegmark IV looking in), nonetheless I can reason about what someone might write there, and thus it matters what is actually written there. I’ll only conclude “yeah that’s what I’d actually see if I looked” if the consequentialists convince me they’d really pull the trigger, even if they’re only pulling the trigger for the sake of convincing me, and we both know I’ll never really look.
Note that, in the version of this picture that involves abstract generalized reasoning rather than simulation of specific worlds, defection is fruitless: if you are trying to manipulate someone who is just thinking about whether beings will do X as a general rule, you don’t get anything out of raising your hand and saying “well, in reality, I will!” No one will notice; they aren’t actually looking at you, ever, just at the general trend. And of course “they” know all this, which raises “their” confidence that no one will raise their hand; and “you” know that “they” know, which makes “you” less interested in raising that same hand; and so forth.
When you say “Tegmark IV,” I assume you mean the computable version—right?
Yep.
We have this sort of symmetry-breaker in the version of the argument that postulates, by fiat, a “UP-using dupe” somewhere, for some reason
Correction: on my model, the dupe is also using an approximation of the UP, not the UP itself. I. e., it doesn’t need to be uncomputable. The difference between it and the con men is just the naivety of the design. It generates guesses regarding what universes it’s most likely to be in (potentially using abstract reasoning), but then doesn’t “filter” these universes; doesn’t actually “look inside” and determine if it’s a good idea to use a specific universe as a model. It doesn’t consider the possibility of being manipulated through it; doesn’t consider the possibility that it contains daemons.
I. e.: the real difference is that the “dupe” is using causal decision theory, not functional decision theory.
We can just notice that we’d all be better off if no one did the malign thing, and then no one will do it
I think that’s plausible: that there aren’t actually that many “UP-using dupes” in existence, so the con men don’t actually care to stage these acausal attacks.
But: if that is the case, it’s because the entities designing/becoming powerful agents considered the possibility of con men manipulating the UP, and so made sure that they’re not just naively using the unfiltered (approximation of the) UP.
That is: yes, it seems likely that the equilibrium state of affairs here is “nobody is actually messing with the UP”. But it’s because everyone knows the UP could be messed with in this manner, so no-one is using it (nor its computationally tractable approximations).
It might also not be the case, however. Maybe there are large swathes of reality populated by powerful yet naive agents, such that whatever process constructs them (some alien evolution analogue?), it doesn’t teach them good decision theory at all. So when they figure out Tegmark IV and the possibility of acausal attacks/being simulation-captured, they give in to whatever “demands” are posed them. (I. e., there might be entire “worlds of dupes”, somewhere out there among the mathematically possible.)
That said, the “dupe” label actually does apply to a lot of humans, I think. I expect that a lot of people, if they ended up believing that they’re in a simulation and that the simulators would do bad things to them unless they do X, would do X. The acausal con men would only care to actually do it, however, if a given person is (1) in the position where they could do something with large-scale consequences, (2) smart enough to consider the possibility of simulation-capture, (3) not smart enough to ignore blackmail.
But: if that is the case, it’s because the entities designing/becoming powerful agents considered the possibility of con men manipulating the UP, and so made sure that they’re not just naively using the unfiltered (approximation of the) UP.
I’m not sure of this. It seems at least possible that we could get an equilibrium where everyone does use the unfiltered UP (in some part of their reasoning process), trusting that no one will manipulate them because (a) manipulative behavior is costly and (b) no one has any reason to expect anyone else will reason differently from them, so if you choose to manipulate someone else you’re effectively choosing that someone else will manipulate you.
Perhaps I’m misunderstanding you. I’m imagining something like choosing one’s one decision procedure in TDT, where one ends up choosing a procedure that involves “the unfiltered UP” somewhere, and which doesn’t do manipulation. (If your procedure involved manipulation, so would your copy’s procedure, and you would get manipulated; you don’t want this, so you don’t manipulate, nor does your copy.) But you write
the real difference is that the “dupe” is using causal decision theory, not functional decision theory
whereas it seems to me that TDT/FDT-style reasoning is precisely what allows us to “naively” trust the UP, here, without having to do the hard work of “filtering.” That is: this kind of reasoning tells us to behave so that the UP won’t be malign; hence, the UP isn’t malign; hence, we can “naively” trust it, as though it weren’t malign (because it isn’t).
More broadly, though—we are now talking about something that I feel like I basically understand and basically agree with, and just arguing over the details, which is very much not the case with standard presentations of the malignity argument. So, thanks for that.
I’m not sure of this. It seems at least possible that we could get an equilibrium where everyone does use the unfiltered UP (in some part of their reasoning process), trusting that no one will manipulate them because (a) manipulative behavior is costly and (b) no one has any reason to expect anyone else will reason differently from them, so if you choose to manipulate someone else you’re effectively choosing that someone else will manipulate you.
Probably also worth stating that I don’t think the MUP is in any way relevant to real life.
I think it’s relevant because it illustrates an extreme variant of a very common problem, where “incorrectly specified” priors can cause unexpected behavior. It also illustrates the daemon problem, which I expect to be very relevant to real life.
A more realistic and straightforward example of the “incorrectly specified prior” problem: If the prior on an MCTS value head isn’t strong enough, it can overfit to value local instrumental goals too highly. Now your overall search process will only consider strategies that involve lots of this instrumental goal. So you end up with an agent that looks like it terminally values e.g. money, even though the goal in the “goal slot” is exactly correct and doesn’t include money.
Here’s my understanding of the whole thing:
“Malign universal prior” arguments basically assume a setup in which we have an agent with a big dumb hard-coded module whose goal is to find this agent’s location in Tegmark IV. (Or maybe perform some other important task that requires reasoning about Tegmark IV, but let’s run with that as the example.)
The agent might be generally intelligent, the Solomonoff-induction-approximating module might be sophisticated in all kind of ways, but it’s “dumb” or “naive” in an important sense: it’s just trying to generate the best-guess distribution over the universes the agent is in, no matter their contents, then blindly acts on it.
Importantly, this process doesn’t necessarily involve actually running any low-level simulations of other universes. Generally intelligent/abstract reasoning, some steps of which might literally replicate the reasoning steps of Paul’s post, would also fit the bill.
The MUP argument is that this is sufficient for alien consequentialists to take advantage. The agent is asking, “where am I most likely to be?”, and the alien consequentialists are skewing the distribution such that the most likely correct answer is “simulation-captured by acausal aliens” or whatever.
(And then the malign output is producing “predictions” about the future of the agent’s universe like “the false vacuum collapse is going to spontaneously trigger in the next five minutes unless you perform this specific sequence of actions that happen to rewrite your utility function in such-and-such ways”, and our big dumb agent is gormlessly buying this, and its “real” non-simulation-captured instance rewrites itself accordingly.)
Speed prior vs. complexity prior: a common guess regarding the structure of Tegmark IV is that this is how it works, it penalizes K-complexity but doesn’t care how much memory/compute it needs to allocate to run a universe. If that is true, then any sufficiently good approximation of Solomonoff induction – any sufficiently good procedure for getting an answer to “where am I most likely to be?”, including abstract reasoning – would take this principle into account, and bump up the probability of being in low-complexity universes.
This all seems to check out to me. Admittedly I didn’t actually confirm this with any proponents of the argument, though.
(Probably also worth stating that I don’t think the MUP is in any way relevant to real life. AI progress doesn’t seem to be on the track where it features AGIs that use big dumb “where am I?” modules. E. g., if an AGI is born of anything like an RL-trained LLM, seems unlikely that its “where am I?” reasoning would be naive in the relevant sense. It’d be able to “manually” filter out universes with malign consequentialists, given good decision theory. You know, like we can.
The MUP specifically applies to highly abstract agent-foundations designs where we hand-code each piece, that currently don’t seem practically tractable at all.)
Thanks.
I admit I’m not closely familiar with Tegmark’s views, but I know he has considered two distinct things that might be called “the Level IV multiverse”:
a “mathematical universe” in which all mathematical constructs exist
a more restrictive “computable universe” in which only computable things exist
(I’m getting this from his paper here.)
In particular, Tegmark speculates that the computable universe is “distributed” following the UP (as you say in your final bullet point). This would mean e.g. that one shouldn’t be too surprised to find oneself living in a TM of any given K-complexity, despite the fact that “almost all” TMs have higher complexity (in the same sense that “almost all” natural numbers are greater than any given number n).
When you say “Tegmark IV,” I assume you mean the computable version—right? That’s the thing which Tegmark says might be distributed like the UP. If we’re in some uncomputable world, the UP won’t help us “locate” ourselves, but if the world has to be computable then we’re good[1].
With that out of the way, here is why this argument feels off to me.
First, Tegmark IV is an ontological idea, about what exists at the “outermost layer of reality.” There’s no one outside of Tegmark IV who’s using it to predict something else; indeed, there’s no one outside of it at all; it is everything that exists, full stop.
“Okay,” you might say, “but wait—we are all somewhere inside Tegmark IV, and trying to figure out just which part of it we’re in. That is, we are all attempting to answer the question, ‘what happens when you update the UP on my own observations?’ So we are all effectively trying to ‘make decisions on the basis of the UP,’ and vulnerable to its weirdness, insofar as it is weird.”
Sure. But in this picture, “we” (the UP-using dupes) and “the consequentialists” are on an even footing: we are both living in some TM or other, and trying to figure out which one.
In which case we have to ask: why would such entities ever come to such a destructive, bad-for-everyone (acausal) agreement?
Presumably the consequentalists don’t want to be duped; they would prefer to be able to locate themselves in Tegmark IV, and make decisions accordingly, without facing such irritating complications.
But, by writing to “output channels”[2] in the malign manner, the consequentalists are simply causing the multiverse to be the sort of place where those irritating complications happen to beings like them (beings in TMs trying to figure out which TM they’re in) -- and what’s more, they’re expending time and scarce resources to “purchase” this undesirable state of affairs!
In order for malignity to be worth it, we need something to break the symmetry between “dupes” (UP users) and “con men” (consequentialists), separating the world into two classes, so that the would-be con men can plausibly reason, “I may act in a malign way without the consequences raining down directly on my head.”
We have this sort of symmetry-breaker in the version of the argument that postulates, by fiat, a “UP-using dupe” somewhere, for some reason, and then proceeds to reason about the properties of the (potentially very different, not “UP-using”?) guys inside the TMs. A sort of struggle between conniving, computable mortals and overly-innocent, uncomputable angels. Here we might argue that things really will go wrong for the angels, that they will be the “dupes” of the mortals, who are not like them and who do not themselves get duped. (But I think this form of the argument has other problems, the ones I described in the OP.)
But if the reason we care about the UP is simply that we’re all in TMs, trying to find our location within Tegmark IV, then we’re all in this together. We can just notice that we’d all be better off if no one did the malign thing, and then no one will do it[3].
In other words, in your picture (and Paul’s), we are asked to imagine that the computable world abounds with malign, wised-up consequentialist con men, who’ve “read Paul’s post” (i.e. re-derived similar arguments) and who appreciate the implications. But if so, then where are the marks? If we’re not postulating some mysterious UP-using angel outside of the computable universe, then who is there to deceive? And if there’s no one to deceive, why go to the trouble?
I don’t think this distinction actually matters for what’s below, I just mention it to make sure I’m following you.
I’m picturing a sort of acausal I’m-thinking-about-you-thinking-about me situation in which, although I might never actually read what’s written on those channels (after all I am not “outside” Tegmark IV looking in), nonetheless I can reason about what someone might write there, and thus it matters what is actually written there. I’ll only conclude “yeah that’s what I’d actually see if I looked” if the consequentialists convince me they’d really pull the trigger, even if they’re only pulling the trigger for the sake of convincing me, and we both know I’ll never really look.
Note that, in the version of this picture that involves abstract generalized reasoning rather than simulation of specific worlds, defection is fruitless: if you are trying to manipulate someone who is just thinking about whether beings will do X as a general rule, you don’t get anything out of raising your hand and saying “well, in reality, I will!” No one will notice; they aren’t actually looking at you, ever, just at the general trend. And of course “they” know all this, which raises “their” confidence that no one will raise their hand; and “you” know that “they” know, which makes “you” less interested in raising that same hand; and so forth.
Yep.
Correction: on my model, the dupe is also using an approximation of the UP, not the UP itself. I. e., it doesn’t need to be uncomputable. The difference between it and the con men is just the naivety of the design. It generates guesses regarding what universes it’s most likely to be in (potentially using abstract reasoning), but then doesn’t “filter” these universes; doesn’t actually “look inside” and determine if it’s a good idea to use a specific universe as a model. It doesn’t consider the possibility of being manipulated through it; doesn’t consider the possibility that it contains daemons.
I. e.: the real difference is that the “dupe” is using causal decision theory, not functional decision theory.
I think that’s plausible: that there aren’t actually that many “UP-using dupes” in existence, so the con men don’t actually care to stage these acausal attacks.
But: if that is the case, it’s because the entities designing/becoming powerful agents considered the possibility of con men manipulating the UP, and so made sure that they’re not just naively using the unfiltered (approximation of the) UP.
That is: yes, it seems likely that the equilibrium state of affairs here is “nobody is actually messing with the UP”. But it’s because everyone knows the UP could be messed with in this manner, so no-one is using it (nor its computationally tractable approximations).
It might also not be the case, however. Maybe there are large swathes of reality populated by powerful yet naive agents, such that whatever process constructs them (some alien evolution analogue?), it doesn’t teach them good decision theory at all. So when they figure out Tegmark IV and the possibility of acausal attacks/being simulation-captured, they give in to whatever “demands” are posed them. (I. e., there might be entire “worlds of dupes”, somewhere out there among the mathematically possible.)
That said, the “dupe” label actually does apply to a lot of humans, I think. I expect that a lot of people, if they ended up believing that they’re in a simulation and that the simulators would do bad things to them unless they do X, would do X. The acausal con men would only care to actually do it, however, if a given person is (1) in the position where they could do something with large-scale consequences, (2) smart enough to consider the possibility of simulation-capture, (3) not smart enough to ignore blackmail.
Cool, it sounds we basically agree!
I’m not sure of this. It seems at least possible that we could get an equilibrium where everyone does use the unfiltered UP (in some part of their reasoning process), trusting that no one will manipulate them because (a) manipulative behavior is costly and (b) no one has any reason to expect anyone else will reason differently from them, so if you choose to manipulate someone else you’re effectively choosing that someone else will manipulate you.
Perhaps I’m misunderstanding you. I’m imagining something like choosing one’s one decision procedure in TDT, where one ends up choosing a procedure that involves “the unfiltered UP” somewhere, and which doesn’t do manipulation. (If your procedure involved manipulation, so would your copy’s procedure, and you would get manipulated; you don’t want this, so you don’t manipulate, nor does your copy.) But you write
whereas it seems to me that TDT/FDT-style reasoning is precisely what allows us to “naively” trust the UP, here, without having to do the hard work of “filtering.” That is: this kind of reasoning tells us to behave so that the UP won’t be malign; hence, the UP isn’t malign; hence, we can “naively” trust it, as though it weren’t malign (because it isn’t).
More broadly, though—we are now talking about something that I feel like I basically understand and basically agree with, and just arguing over the details, which is very much not the case with standard presentations of the malignity argument. So, thanks for that.
Fair point! I agree.
I think it’s relevant because it illustrates an extreme variant of a very common problem, where “incorrectly specified” priors can cause unexpected behavior. It also illustrates the daemon problem, which I expect to be very relevant to real life.
A more realistic and straightforward example of the “incorrectly specified prior” problem: If the prior on an MCTS value head isn’t strong enough, it can overfit to value local instrumental goals too highly. Now your overall search process will only consider strategies that involve lots of this instrumental goal. So you end up with an agent that looks like it terminally values e.g. money, even though the goal in the “goal slot” is exactly correct and doesn’t include money.