I am confused by the part, where the Rick-shard can anticipate wich plan the other shards will bit for. If I understood shard-theory correctly, shards do not have their own world model, they can just bid up or down actions, according to the consequences they might have according to the worldmodel that is available to all shards. Please correct me if I am wrong about this point.
So I don’t see how the Rick-Shard could really „trick“ the atheism-shard via rationalisation.
If the Rick-shard sees that „church-going for respect-reasons“ will lead to conversion, then the atheism-shard has to see that too, because they query the same world-model. So the atheism-shard should bid against that plan just as heavily as against „going to church for conversion reasons“.
I think there is something else going on here. I think the Rick-shard does not trick the Atheism-Shard, but the Concious-Part that is not described by shard theory.
Yes, I would consider humans to already be unsafe, as we already made a sharp left turn that left us unaligned relative to our outer optimiser.
Dogs are a good point, thank you for that example. Not sure if dogs have our exact notion of corrigibility, but they definitely seem to be friendly in some relevant sence.