If two AIs debate each other in hopes of generating a good plan for a human, and something goes wrong, that’s not a recoverable error;
I alluded to this above in many examples, but let’s just do a theoretical calculation as well so it’s not just anecdotes.
Suppose we have some AI “Foo” that has some probability of failure, or P(Failure of Foo).
Then the overall probability of the system containing Foo failing is P(Failure of System) == P(Failure of Foo).
On the other hand, suppose we have some AI “Bar” whose goal is to detect when AI “Foo” has failed, e.g. when Foo erroneously creating a plan that would harm humans or attempt to deceive them.
We can now calculate the new likelihood of P(Failure of System) == P(Failure of Foo) * P(Failure of Bar), where P(Failure of Bar) is the likelihood that either Bar failed to detect the issue with Foo, or that Bar successfully detected the issue with Foo, but failed to notify us.
These probabilities can be related in some way, but they don’t have to be. It is possible to drastically reduce the probability of a system failing by adding components within that system, even if those new components have chances of failure themselves.
In particular, so long as the requirement allocated to Bar is narrow enough, we can make Bar more reliable than Foo, and then lower the overall chance of the system failing. One way this works is by limiting Bar’s functionalities so that if Bar failed, in isolation of Foo failing, the system is unaffected. In the context of fault tolerance, we’d refer to that as a one-fault tolerant system. We can tolerate Foo failing—Bar will catch it. And we can tolerate Bar failing—it doesn’t impact the system’s performance. We only have an issue if Foo failed and then Bar subsequently also failed.
I alluded to this above in many examples, but let’s just do a theoretical calculation as well so it’s not just anecdotes.
Suppose we have some AI “Foo” that has some probability of failure, or
P(Failure of Foo)
.Then the overall probability of the system containing Foo failing is
P(Failure of System) == P(Failure of Foo)
.On the other hand, suppose we have some AI “Bar” whose goal is to detect when AI “Foo” has failed, e.g. when Foo erroneously creating a plan that would harm humans or attempt to deceive them.
We can now calculate the new likelihood of
P(Failure of System) == P(Failure of Foo) * P(Failure of Bar)
, whereP(Failure of Bar)
is the likelihood that either Bar failed to detect the issue with Foo, or that Bar successfully detected the issue with Foo, but failed to notify us.These probabilities can be related in some way, but they don’t have to be. It is possible to drastically reduce the probability of a system failing by adding components within that system, even if those new components have chances of failure themselves.
In particular, so long as the requirement allocated to Bar is narrow enough, we can make Bar more reliable than Foo, and then lower the overall chance of the system failing. One way this works is by limiting Bar’s functionalities so that if Bar failed, in isolation of Foo failing, the system is unaffected. In the context of fault tolerance, we’d refer to that as a one-fault tolerant system. We can tolerate Foo failing—Bar will catch it. And we can tolerate Bar failing—it doesn’t impact the system’s performance. We only have an issue if Foo failed and then Bar subsequently also failed.