The only thing that I think constrains the ability to deceive in a simulation which I don’t see mentioned here are energy/physical constraints. It’s my assumption (could be wrong with very high intelligence, numbers, and energy) that it’s harder, even if only by a tiny, tiny bit to answer the simulation trials deceptively than it is to answer honestly. So I think if the simulation is able to ask enough questions/perform enough trials, it will eventually see time differences in the responses of different programs, with unaligned programs on average taking longer to get the correct answers. So I don’t think it’s fundamentally useless to test program behavior in simulations to assess utility function if there is some kind of constraint involved like time it takes to executes steps of each algorithm.
Yeah I think this would work if you conditioned on all of the programs you check being exactly equally intelligent. Say you have a hundred superintelligent programs in simulations and one of them is aligned, and they are all equally capable, then the unaligned ones will be slightly slower in coming up with aligned behavior maybe, or might have some other small disadvantage.
However, in the challenge described in the post it’s going to be hard to tell a level 999 aligned superintelligence from a level 1000 unaligned superintelligence.
I think the advantage of the aligned superintelligence will only be slight because finding the action that maximizes utility function u is just as computationally difficult whether you yourself value u or not. It may not be equally hard for humans regardless of whether the human really values u, but I don’t expect that to generalize across all possible minds.
This makes a good point!
The only thing that I think constrains the ability to deceive in a simulation which I don’t see mentioned here are energy/physical constraints. It’s my assumption (could be wrong with very high intelligence, numbers, and energy) that it’s harder, even if only by a tiny, tiny bit to answer the simulation trials deceptively than it is to answer honestly. So I think if the simulation is able to ask enough questions/perform enough trials, it will eventually see time differences in the responses of different programs, with unaligned programs on average taking longer to get the correct answers. So I don’t think it’s fundamentally useless to test program behavior in simulations to assess utility function if there is some kind of constraint involved like time it takes to executes steps of each algorithm.
Yeah I think this would work if you conditioned on all of the programs you check being exactly equally intelligent. Say you have a hundred superintelligent programs in simulations and one of them is aligned, and they are all equally capable, then the unaligned ones will be slightly slower in coming up with aligned behavior maybe, or might have some other small disadvantage.
However, in the challenge described in the post it’s going to be hard to tell a level 999 aligned superintelligence from a level 1000 unaligned superintelligence.
I think the advantage of the aligned superintelligence will only be slight because finding the action that maximizes utility function u is just as computationally difficult whether you yourself value u or not. It may not be equally hard for humans regardless of whether the human really values u, but I don’t expect that to generalize across all possible minds.