What if there’s an Y such that “Look into Y to develop friendliness theory better” will seem true to us fallible humans but will in fact make the next run’s AI completely unfriendly? Or increase the odds of a free unfriendly AI some other way. Maybe anyone researching Y will end up believing, erroneously, that they can now build a safe slave AI in their garage that will grant them their every wish, and the temptation will prove too strong.
Assuming we humans have no way to protect ourselves against an Y, if we precommit, then the simulation argument becomes symmetrical (thus useless). A boxed FAI knows that it may be simulated by an UFAI that’s offering a little piece of universe in exchange for the boxed FAI saying “Look into Y to develop friendliness theory better.”
What if there’s an Y such that “Look into Y to develop friendliness theory better” will seem true to us fallible humans but will in fact make the next run’s AI completely unfriendly? Or increase the odds of a free unfriendly AI some other way. Maybe anyone researching Y will end up believing, erroneously, that they can now build a safe slave AI in their garage that will grant them their every wish, and the temptation will prove too strong.
Assuming we humans have no way to protect ourselves against an Y, if we precommit, then the simulation argument becomes symmetrical (thus useless). A boxed FAI knows that it may be simulated by an UFAI that’s offering a little piece of universe in exchange for the boxed FAI saying “Look into Y to develop friendliness theory better.”