A toy model I find helpful is correlated vs uncorrelated safety measures. Suppose we have 3 safety measures. Suppose if even 1 safety measure succeeds, our AI remains safe. And suppose each safety measure has a 60% success rate in the event of an accident. If the safety measures are accurately described by independent random variables, our odds of safety in an accident are 1 − 0.4^3 = 94%. If the successes of the safety measures are perfectly correlated, failure of one implies certain failure of the others, and our odds of safety are only 1 − 0.4 = 60%.
In my mind, this is a good argument for working on ideas like safely interruptible agents, impact measures, and boxing. The chance of these ideas failing seems fairly independent from the chance of your value learning system failing.
But I think you could get a similar effect by having your AGI search for models whose failure probabilities are uncorrelated with one another. The better your AGI, the better this approach is likely to work.
A toy model I find helpful is correlated vs uncorrelated safety measures. Suppose we have 3 safety measures. Suppose if even 1 safety measure succeeds, our AI remains safe. And suppose each safety measure has a 60% success rate in the event of an accident. If the safety measures are accurately described by independent random variables, our odds of safety in an accident are 1 − 0.4^3 = 94%. If the successes of the safety measures are perfectly correlated, failure of one implies certain failure of the others, and our odds of safety are only 1 − 0.4 = 60%.
In my mind, this is a good argument for working on ideas like safely interruptible agents, impact measures, and boxing. The chance of these ideas failing seems fairly independent from the chance of your value learning system failing.
But I think you could get a similar effect by having your AGI search for models whose failure probabilities are uncorrelated with one another. The better your AGI, the better this approach is likely to work.