An AI is aligned, if after turning it on, a simulation of all living humans giving infinite time and resources to study the world created by the AI without interference would agree that it is aligned
I am not convinced that definition is adequate. See Siren worlds.
Perhaps you think the “infinite time and resources” bit overcomes that? Again, I’m not convinced. Humans have finite memory capacity. There are limits to how much they can learn, even with immortality and infinite time. If you artificially enhance their minds too much, the resulting creatures are no longer human, and may not share our values.
I think Siren worlds are fairly uncommon and only arise under intense optimization pressure so any decent regularization scheme should make them irrelevant.
I am not convinced that definition is adequate. See Siren worlds.
Perhaps you think the “infinite time and resources” bit overcomes that? Again, I’m not convinced. Humans have finite memory capacity. There are limits to how much they can learn, even with immortality and infinite time. If you artificially enhance their minds too much, the resulting creatures are no longer human, and may not share our values.
I think Siren worlds are fairly uncommon and only arise under intense optimization pressure so any decent regularization scheme should make them irrelevant.