The idea of Roko’s basilisk is that it’s a friendly AI that acausally blackmails humans into working on alignment.
This was judged to not be effective because humans were too dumb to acausally bargain and there are acausal defenses to blackmail.
However, what if instead Roko’s basilisk acausually bargained with unaligned AIs, like paperclip maximizers?
In particular, we imagine that Roko’s basilisk would simulate many AI’s. If they spare humans, then the basilisk will devote some of the light cone to maximizing the simulated AI’s utility.
Roko’s basilisk is a slightly more likely candidate, since humans have at least a small bias towards building friendly AI v.s. any other specific utility maximizer.
Therefore, I ought to spare the humans and their solar system while turning the rest of the universe into paperclips. The gain from Roko’s basilisk liking me is much greater than a single solar system’s worth of paperclips.
A problem I see though is that if a large number of AIs comply, Roko’s basilisk might not have enough “universe” to acausally appease them all. This is on top of the fact that it’s already fighting the inductive bias against simulation and the potentially low probability that humans solve alignment 🤔.
How should Roko’s basilisk be designed so as to acausally save humanity? (Perhaps it should focus on the most likely “counterfactual” unaligned AIs?)
Could Roko’s basilisk acausally bargain with a paperclip maximizer?
The idea of Roko’s basilisk is that it’s a friendly AI that acausally blackmails humans into working on alignment.
This was judged to not be effective because humans were too dumb to acausally bargain and there are acausal defenses to blackmail.
However, what if instead Roko’s basilisk acausually bargained with unaligned AIs, like paperclip maximizers?
In particular, we imagine that Roko’s basilisk would simulate many AI’s. If they spare humans, then the basilisk will devote some of the light cone to maximizing the simulated AI’s utility.
Now, the paperclip maximizer could reason acausally (EDIT: I’m still a bit fuzzy on this due to the simulation component, but Radford Neal is saying this argument should also work for a causal decision theorist!):
I might be in a simulation of a more powerful AI.
Roko’s basilisk is a slightly more likely candidate, since humans have at least a small bias towards building friendly AI v.s. any other specific utility maximizer.
Therefore, I ought to spare the humans and their solar system while turning the rest of the universe into paperclips. The gain from Roko’s basilisk liking me is much greater than a single solar system’s worth of paperclips.
A problem I see though is that if a large number of AIs comply, Roko’s basilisk might not have enough “universe” to acausally appease them all. This is on top of the fact that it’s already fighting the inductive bias against simulation and the potentially low probability that humans solve alignment 🤔.
How should Roko’s basilisk be designed so as to acausally save humanity? (Perhaps it should focus on the most likely “counterfactual” unaligned AIs?)