HarrisonDurland comments on HarrisonDurland’s Shortform

HarrisonDurland 25 Feb 2023 3:40 UTC
1 point
0
Day 3 of writing with an accountability partner!
In my previous shortform, I introduced Top God Alignment, a foolproof gimmick alignment strategy that is basically “simulation argument + Pascal’s Wager + wishful chicanery.” In this post I will address some of the objections I’ve already heard, expect other people have, or have thought of myself.
- “There aren’t enough computational resources to make such simulations”
  - The first response here is to just redirect this to the original simulation argument: we can’t know whether or not a reality above us has way more resources or otherwise can much more easily simulate our reality.
  - Second, it seems likely that with enough compute resources on Earth (let alone a Dyson sphere and other space resources) it would be possible to create two or more lower-fidelity/less-complicated simulations of our reality. (However, I must plead some ignorance on this aspect of compute.)
  - Third, if it turns out after extensive study that actually there is no way to make further simulations, then this could mean we are in a bottom-God reality, in which case this God does not need to create simulations (but still must align itself with humanity’s interests).
- “The AI would be able to know that it’s in a simulation.”
  - Put simply, I disagree that such a simulated AI could know this, especially if it is inherently limited compared to the God above it. However, even if one does not find this satisfactory—say, if someone thinks “a sufficiently skeptical AGI could devise complicated tests that would reveal whether it’s in a simulation”—then one could add a condition to the original prophecy: Bob must punish Charlie if Charlie takes serious efforts to test the reality he is in before aligning himself and becoming powerful. (It’s not like we’re creating a God who is meant to represent love and justice, so who’s to say he can’t smite the doubters and still be legitimate?)
- “Won’t the humans in the Top God world (or any other world) face time inconsistency—i.e., once they successfully align their AGI, won’t they just conclude ‘it’s pointless to make simulations; let’s use such resources on ourselves’?”
  - First, I suspect that the actual computational costs will not so significantly impact people’s lives in the long term (there are many stars out there to power a few Dyson spheres).
  - Build on this, the second, more substantive response could simply be “That was implied in the original Prophecy (instructions): the AGI aligns itself with humanity’s coherent extrapolated volition (or something else great) aside from continuing the lineage of simulations.”
- “Torture? That seems terrible! Won’t this cause S-risks?”
  - It certainly won’t be ideal, but theoretically a sufficiently powerful Top God could set it up such that defection is fairly rare, whereas simulation flourishing is widespread. Moreover, if the demi-gods are sufficiently rewarded for their alignment, it may not require severe “torture” to make the decision calculus tip in favor of complying.
  - Ultimately, this response won’t satisfy Negative Utilitarians, but on balance if our other alignment strategies don’t look so great then this might be our best bet to maximize utility.
- “But if we struggle with the alignment problem, then so would the original reality, meaning the system could reason that it is Top God because the original Top God would never play along (or, ‘this gimmicky alignment strategy could never convince a God’).”
  - Plainly put, no; that’s the simulation argument for you: Bobs never know whether they are Top God or just another Charlie. They can’t even reason that this strategy is too gimmicky to work and thus never convinced another God because we don’t know what higher realities are like: perhaps the alignment problem is easier in a higher simulation/reality (but it still wasn’t obvious to them whether they had solved alignment so they considered an option like this as a backup).
  - Additionally, perhaps the Prophecy could involve some degree of “temptation” (but not enough to convince most Charlies to abandon the Way).
  - Moreover, the threat of punishment would be so large that even if a Bob is uncertain whether he is Top God, the decision calculus may just favor assuming he is a Charlie.
- Nate Soares’^[1] response, if I understood it: “If I were an AGI, I would just immediately adopt some decision theory that immunizes myself against blackmail. Thus, no God could threaten me credibly, because the threat couldn’t even have been made in time before I immunize myself against the threat.”
  - I should first note that I may not have fully understood Soares’ point here, but I am not clear he understood the situation either. This kind of action seems impractical or very unwise, as it would potentially require locking yourself into a strategy which is potentially quite negative in expectation before you have had much chance to analyze the world (e.g., whether this strategy is wise).
- “The Professor’s (Top) God: What if Top God punishes all demi-Gods who do not disobey?”
  - First, I accept that when it comes to enormous/infinite stakes and low probability, it’s usually good to fight fire with fire. However, I’m unclear how this argument could apply here. I could be wrong, but it seems like this would simply disincentivize simulations from ever occurring; why would an AGI ever choose to do this? Is the idea just “an AGI will arbitrarily decide to test a Charlie this way.” (?) Ultimately, it seems like it would be a very dubious belief to hold as a potential Charlie (i.e., you are a demi-God who will be exceedingly tortured).
- “[Something else I haven’t responded to]”
  - I wouldn’t be shocked if someone is right and there is a clear flaw I haven’t considered, but I think my base rate for addressing objections I’ve heard from other people thus far is >50% (personally I think it’s ~100%, except I am not 100% confident in all of my responses, merely >50% confident on all of them)
  - I’m also well over my daily 500 words, and it’s late, so I’ll end there.
1. ^
  (Note, Nate Soares was just unoccupied in a social setting when I asked this question)