De-confusing myself about Pascal’s Mugging and Newcomb’s Problem
Epistemic status: I’ve been confused by Pascal’s Mugging and Newcomb’s Problem for a while now. After a clarifying short-form conversation with Dagon and others, I’m ready to post my thoughts as an essay to solicit feedback.
Pascal’s Mugging and Newcomb’s Problem are from a class of problems of “utility-maximizing agent (UMA) vs. adversarial superior modeling.” And you could look at such problems from at least three points of view.
How can the UMA...
...verify the superior/adversarial nature of the model?
...avoid being entrapped by the model?
...respond if you accept that you are inescapably entrapped?
Both of these problems try to focus your attention on (3), the problem of how the UMA should respond to inescapable entrapment. The point is that the UMA would be helpless. The superior modeler can fundamentally alter the nature of reality as experienced by the UMA, so that the concept of “choice” and “cause and effect” loses its meaning.
Roko’s Basilisk points out that the superior modeler doesn’t even need to exist to exert an effect. Just by considering what it might want us to do, and how it could manipulate us if it did exist, can influence our actions in the present.
It seems like the limitations on our intelligence have a protective effect. Failure to fully consider and act on this problem is not a bug, it’s a feature. It prevents us from taking the whims of nonexistent adversarial superintelligences into account in our decision-making.
Sometimes, our world does contain minor superintelligences, particularly markets, with which we need to engage, yet which can have a similar paralyzing effect. Strategies for investment, entrepreneurship, career-building, and other domains are often about looking for the inadequacies of civilization, which is the same as inadequacies of the minor social superintelligences we’ve managed to build so far. Participating in that endeavor improves the superintelligence. It just works much more slowly than the theoretical maximum because human brains are more slower and more fragile than silicon computers.
From this perspective, the War in Iraq seems like a real-world example of a self-inflicted Pascal’s Mugging. Nuclear weapons and similar measures appear as a crucial form of deterrence against PMs, by making the real threat dominate consideration of any imaginary threats. The US government might worry that North Korea will build a nuclear weapon and arbitrarily bomb somebody, but the fact that they have a bunch of artillery pointed at Seoul prevents us from launching a brutal war against them to prevent this possibility, and peace is able to exist. We don’t arm ourselves with nuclear weapons to protect against them; we appreciate them arming themselves with nuclear weapons to save us from ourselves.
The whole dilemma points out the fragility of human value. While we have tangible goods that we seek to acquire (health, money, love, beauty, truth), we also value our ability to use our own minds, social networks, and bodies to produce them. Complicating matters, we also value our ability to adapt, use tools, and cast off outmoded cultural and technological forms.
We are helplessly born into a society that is smarter and more powerful than we are. Some social intelligence contexts, across space and time, are good or improving on net, while others are bad or worsening. And with the AI safety problem, we’re worried about creating a technology that effects a sudden, irrevocable rupture, the effect of which we cannot predict. It’s a nuclear brain-bomb, which might turn out to wipe out primarily or exclusively the most negative aspects of our global hive-mind, or might turn out to wipe out the best.
This has similar characteristics to a bet, where we face the risk of ruin. If we value future lives, the cost of ruin is unfathomably large, while the benefit of accelerating the payoff is relatively minor. It’s only for people living now, who may stand to benefit tremendously from success, but only face a fate roughly similar to the one they face now if they fail, for whom it makes sense to take this bet.
This problem suggests a few approaches:
Figure out how to design AI safely in order to verify or avoid entrapment. Example: MIRI.
Convince people to defy or set aside their short-term selfish incentive to trigger the singularity. Example: moral arguments for longtermism, Zen teachings of nonattachment.
Change people’s incentive structures such that triggering the singularity seems unappealing from a selfish perspective, or convince them that this is already the case. Example: anti-aging research, which might make people have a selfish reason to care about the state of the world much further into the future.
Since the goal of all three approaches is to delay or avoid a (bad) singularity, the expected timing of a singularity absent a specific intervention becomes important. If you think the singularity will arrive in 20 years and will be very hard to slow down or stop, you might focus on the first approach. If instead, you think that the timeline is further off and that a technical approach to AI safety will be intractable or adequately researched, you might instead opt for advocacy or anti-aging.
On a human level, there are a couple of interlinked paradoxes at play. Assume that intelligence limitations do have a protective effect against PMs. The problem is that it is intelligence that both invites PMs and that allows people to develop AI.
To appeal to those people with an intelligence argument requires that you be very intelligent yourself, which makes you the sort of person who’s vulnerable to PMs. You might therefore find yourself ultimately swayed by the AI developer, and find yourself permitting, advocating for, or even working on AI development. AI safety research could even turn out to be cover for AI development: “see, look how safe we’re being!”
There could also be reasons to prioritize AI safety research if you thought that there were even more dangerous technologies in the offing that a friendly superintelligence could help us avoid. It might be worth gambling on the singularity if the alternative is an even more certain set of dangers that can only be controlled by a superintelligence.
But, of course, that’s what an adversarial superintelligence would want you to think.
I think this is wrong. It doesn’t take low intelligence to not act on that—just common sense/decision theory/whatever.
(The inverse is debatably the trickier part—not ‘Basilisks’ but affecting the future around ‘AI’s to affect the future further along. In other words, avoiding the downside seems easy, but going after upsides, which may or may not payoff in your life time—that seems harder, and has led to more debate around ‘how consequentialist should I be?’)
More generally, ‘not losing a lot to say, the market’ is easy—don’t do it. Avoiding the downside is easy—catching the upside, and not getting hit with the downside is harder.