Christopher King
The most important graph from the “faking alignment” paper is this one:
Christopher King’s Shortform
Also, you should care about worlds proportional to the square of their amplitude.
It’s actually interesting to consider why this must be the case. Without it, I concede that maybe some sort of Quantum Anthropic Shadow could be true. I’m thinking it would lead to lots of wacky consequences.
I suppose the main point you should draw from “Anthropic Blindness” to QI is that:
Quantum Immortality is not a philosophical consequence of MWI, it is an empirical hypothesis with a very low prior (due to complexity).
Death is not special. Assuming you have never gotten a Fedora up to this point, it is consistent to assume that that “Quantum Fedoralessness” is true. That is, if you keep flipping a quantum coin that has a 50% chance of giving you a Fedora, the universe will only have you experience the path that doesn’t give you the Fedora. Since you have never gotten a Fedora yet, you can’t rule this hypothesis out. The silliness of this example demonstrates why we should likewise be skeptical of Quantum Immortality.
A universe with classical mechanics, except that when you die the universe gets resampled, would be anthropic angelic.
Beings who save you are also anthropic angelic. For example, the fact that you don’t die while driving is because the engineers explicitly tried to minimize your chance of death. You can make inferences based on this. For example, even if you have never crashed, you can reason that during a crash you will endure less damage than other parts of the car, because the engineers wanted to save you more than they wanted to save the parts of the car.
No, the argument is that the traditional (weak) evidence for anthropic shadow is instead evidence of anthropic angel. QI is an example of anthropic angel, not anthropic shadow.
So for example, a statistically implausible number of LHC failures would be evidence for some sort of QI and also other related anthropic angel hypotheses, and they don’t need to be exclusive.
The more serious problem is that quantum immortality and angel immortality eventually merges
An interesting observation, but I don’t see how that is a problem with Anthropically Blind? I do not assert anywhere that QI and anthropic angel are contradictory. Rather, I give QI as an example of an anthropic angel.
“I am more likely to be born in the world where life extensions technologies are developing and alignment is easy”. Simple Bayesian update does not support this.
I mean, why not?
P(Life extension is developing and alignment is easy | I will be immortal) = P(Life extension is developing and alignment is easy) * (P(I will be immortal | Life extension is developing and alignment is easy) / P(I will be immortal))
Believing QI is the same as a Bayesian update on the event “I will become immortal”.
Imagine you are a prediction market trader, and a genie appears. You ask the genie “will I become immortal” and the genie answers “yes” and then disappears.
Would you buy shares on a Taiwan war happening?
If the answer is yes, the same thing should apply if a genie told you QI is true (unless the prediction market already priced QI in). No weird anthropics math necessary!
LDT decision theories are probably the best decision theories for problems in the fair problem class.
The post demonstrates why this statement is misleading.
If “play the ultimatum game against a LDT agent” is not in the fair problem class, I’d say that LDT shouldn’t be in the “fair agent class”. It is like saying that in a tortoise-only race, the best racer is a hare because a hare can beat all the tortoises.
So based on the definitions you gave I’d classify “LDT is the best decision theory for problems in the fair problem class” as not even wrong.
In particular, consider a class of allowable problems S, but then also say that an agent X is allowable only if “play a given game with X” is in S. Then the proof in the No agent is rational in every problem section of my proof goes through for allowable agents. (Note that that argument in that section is general enough to apply to agents that don’t give into $9 rock.)
Practically speaking: if you’re trying to follow decision theory X, than playing against other X is a reasonable problem
Another problem is, do you know how to formulate/formalize a version of LDT so that we can mathematically derive the game outcomes that you suggest here?
There is a no free lunch theorem for this. LDT (and everything else) can be irrational
I would suggest formulating this like a literal attention economy.
You set a price for your attention (probably like $1). The price at which even if the post is a waste of time, the money makes it worth it.
“Recommenders” can recommend content to you by paying the price.
If the content was worth your time, you pay the recommender the $1 back plus a couple cents.
The idea is that the recommenders would get good at predicting what posts you’d pay them for. And since you aren’t a causal decision theorist they know you won’t scam them. In particular, on average you should be losing money (but in exchange you get good content).
This doesn’t necessarily require new software. Just tell people to send PayPals with a link to the content.
With custom software, theoretically there could exist a secondary market for “shares” in the payout from step 3 to make things more efficient. That way the best recommenders could sell their shares and then use that money to recommend more content before you payout.
If the system is bad at recommending content, at least you get paid!
Yes this would be a no free lunch theorem for decision theory.
It is different from the “No free lunch in search and optimization” theorem though. I think people had an intuition that LDT will never regret its decision theory, because if there is a better decision theory than LDT will just copy it. You can think of this as LDT acting as tho it could self-modify. So the belief (which I am debunking) is that the environment can never punish the LDT agent; it just pretends to be the environment’s favorite agent.
The issue with this argument is that in the problem I published above, the problem itself contains a LDT agent, and that LDT agent can “punish” the first for acting like, or even pre-committing to, or even literally self-modifying to become $9 rock. It knows that the first agent didn’t have to do that.
So the first LDT agent will literally regret not being hardcoded to “output $9”.
This is very robust to what we “allow” agents to do (can they predict each other, how accurately can they predict each other, what counterfactuals are legit or not, etc...), because no matter what the rules are you can’t get more than $5 in expectation in a mirror match.
LDT (and everything else) can be irrational
I disagree with my characterization as thinking problems can be solved on paper
Would you say the point of MIRI was/is to create theory that would later lead to safe experiments (but that it hasn’t happened yet)? Sort of like how the Manhattan project discovered enough physics to not nuke themselves, and then started experimenting? 🤔
If you aren’t maximizing expected utility, you must choose one of the four axioms to abandon.
Maximizing expected utility in Chinese Roulette requires Bayesian updating.
Let’s say on priors that P(n=1) = p and that P(n=5) = 1-p. Call this instance of the game G_p.
Let’s say that you shoot instead of quit the first round. For G_1/2, there are four possibilities:
n = 1, vase destroyed: The probability of this scenario is 1⁄12. No further choices are needed.
n = 5, vase destroyed. The probability of this scenario is 5⁄12. No further choices are needed.
n = 1, vase survived: The probability of this scenario is 5⁄12. The player needs a strategy to continue playing.
n = 5, vase survived. The probability of this scenario is 1⁄12. The player needs a strategy to continue playing.
Notice that the strategy must be the same for 3 and 4 since the observations are the same. Call this strategy S.
The expected utility, which we seek to maximize, is:
E[U(shoot and then S)] = 0 + 5⁄12 * (R + E[U(S) | n = 1]) + 1⁄12 * (R + E[U(S) | n = 5])
Most of our utility is determined by the n = 1 worlds.
Manipulating the equation we get:
E[U(shoot and then S)] = R/2 + 1⁄2 * (5/6 * E[U(S) | n = 1] + 1⁄6 * E[U(S) | n = 5])
But the expression 5⁄6 * E[U(S) | n = 1] + 1⁄6 * E[U(S) | n = 5] is the expected utility if we were playing G_5/6. So the optimal S is the optimal strategy for G_5/6. This is the same as doing a Bayesian update (1:1 * 5:1 = 5:1 = 5⁄6).
The way anthropics twists things is that if this were russian roulette I might not be able to update after 20 Es that the gun is empty, since in all the world’s where I died there’s noone to observe what happened, so of course I find myself in the one world where by pure chance I survived.
This is incorrect due to the anthropic undeath argument. The vast majority of surviving worlds will be ones where the gun is empty, unless it is impossible to be so. This is exactly the same as a Bayesian update.
Human labor becomes worthless but you can still get returns from investments. For example, if you have land, you should rent the land to the AGI instead of selling it.
Now see if you can catch sandbagging in the scratchpad!