Yes nothing is a guarantee in probabilities but can’t we just make it very easy for it to perfectly achieve its objective if it doesn’t go exactly the way we want it to, we just make an easier solution exist than disempowering us or wiping us out.
I guess in the long run we still select for models that ultimately don’t wirehead. But this might eliminate a lot of obviously wrong alignment failures we miss.
Yes nothing is a guarantee in probabilities but can’t we just make it very easy for it to perfectly achieve its objective if it doesn’t go exactly the way we want it to, we just make an easier solution exist than disempowering us or wiping us out.
I guess in the long run we still select for models that ultimately don’t wirehead. But this might eliminate a lot of obviously wrong alignment failures we miss.