This post distills someexcellentwriting on decision theory from Abram Demski about spurious proofs and Troll Bridge problems, and it has significant parallels with this work. It recently won a $250 prize in UC Berkeley Effective Altruism’s AI safety distillation contest. The goal of this post is to make two difficult-to-understand topics, spurious proofs and troll bridge problems, as easy to understand as possible.
An Innocent Question
A highly intelligent system should have the ability to reason with logic. After all, being logical is a lot more useful than being illogical. Let’s design such a system – a robot named Rob.
The goal of a logical system is to prove things from axioms. Axioms are facts that are assumed to be true with no need for justification. They can either be facts about objects or inference rules. For example, “0 is a natural number” and “for any natural number, n, Successor(n) is also a natural number” are both axioms of a widely-studied logical system called Peano Arithmetic. Axioms are often applied to other axioms. For example, I can use the two aforementioned ones to prove that Successor(0), a.k.a. the number 1, is a natural number. We call such a combination of axioms that reveals a new fact a proof and the result a theorem.
The goal of using logic is to teach ourselves useful things by proving them. And toward this end, we sure hope that our logical system can’t prove contradictions. If so, it wouldn’t be useful at all! [1]
Now back to Rob. Suppose we want him to use some logical system by which he reasons about himself and the world. Exactly what logical axioms he uses aren’t important as long as they are sufficiently expressive.[2] As we are tinkering away in the process of designing him to be highly-intelligent, suppose that Rob asks us a simple question.
Answering this question will lead us through some strange dilemmas, but when we’re done, understanding the solution will teach an important lesson about instrumental rationality: making a hypothetical commitment to doing something that would hypothetically be insane isn’t just ok – it can be really useful. Ultimately, understanding this principle will help us to understand how even highly-advanced AI systems could subtly fail if they aren’t paradox-proof.
Obeying the Proof: What Could Go Wrong?
A lot.
The first answer one might imagine for this question is “It would seem weird (and maybe even contradictory?) to disobey a proof that you would take a certain action, so wouldn’t the most sensible option be to just obey the proof?”
This makes a lot of intuitive sense, but there is a problem: Löb’s Theorem. In English, Löb’s Theorem states that “using some sufficiently expressive logical system, if a provability predicate for some theorem X would imply X, then X.” A provability predicate for a theorem X is another logical statement that acts as a sort of proxy for X. The details, including what being such a “proxy” means are beyond the scope of this piece.[3] But you can find a very fun cartoon-style proof of it here.
Here, the technical details are important, but we can get a lot of mileage out of a simple intuitive understanding of Löb’s Theorem. A non-rigorous but intuitive version that we can use is “if you can prove that the hypothetical existence of a proof of X would imply X, then X.”
This is not trivial. It’s a statement about a hypothetical proof which is not the same thing as a proof that we just assume exists for the sake of conditional reasoning. Here, hypothetical reasoning is looser than conditional reasoning. For example, in conditional reasoning, we shouldn’t condition on an absurdity. But if we are doing hypothetical reasoning, reasoning about an absurd counterfactual is fair game. This is actually a roundabout way of describing the technical distinction between a provability predicate and a proof. In cases like Löb’s Theorem, the point about reasoning with provability predicates is that we don’t need to concern ourselves with whether the predicate corresponds to an actual theorem or not when turning the crank.
This might sound confusing because it is. But don’t worry. The details aren’t the point. Understanding the nuanced distinction between these things has a time and place in the formal study of logic. But here, it will suffice to get in the van and proceed with the intuitive understanding: “if you can prove that the hypothetical existence of a proof of X would imply X, then X.” Things should become clearer when we see an example next.
Suppose that we design Rob to commit to always obeying proofs that he will take certain actions. Then the problem is that Löb’s Theorem can be used to show that Rob will take any arbitrary action. And as a robot who obeys these proofs, he would then have to take such an action (assuming that he has a formal understanding of his proof-obeying behavior).
Here, since Rob is the type of agent that would obey proofs about what actions he would take, the hypothetical existence of a proof that Rob would take action A would imply that he would take A. So then by Löb’s Theorem, it’s immediate that it can be proven to Rob that he would take action A. And if this can be proven, then he would take that action.
The resulting behavior is concerning. Imagine that Rob is in a situation where he has two options: take $5, or take $10. This strategy could be used to prove that he would take $5 which would then lead him to take the $5 even though it is clearly the wrong choice.
Rob is left vulnerable to taking arbitrary actions as long as he is either presented with this type of proof or he thinks of one himself.[4] This would essentially be a backdoor that would allow for Rob to be manipulated into doing anything simply be presenting him with proofs! This is known as a vulnerability to “spurious proofs.”
So where did we go wrong? In order to not be vulnerable to spurious proofs, Rob needs to be willing to hypothetically disobey proofs sometimes about what actions will take. This can be viewed as a method of diagonalizing against them.
But isn’t this a recipe for contradictions? If Rob is willing to disobey proofs about what actions he will take, then wouldn’t he be demonstrating his own inconsistency if he ever does so? Yes, with an emphasis on the “if.” By hypothetically being willing to sometimes disobey such a proof, he precludes the possibility for such spurious proofs (if his logical system is consistent). So this would never be a problem in reality. There are no real contradictions here – only hypothetical ones. For a more in-depth discussion, see this paper.
So interestingly, it is possible that making a hypothetical commitment to doing something that would hypothetically be insane not only isn’t problematic, but it can be of great practical benefit to Rob! We will see another example of this principle next.
Disobeying the Proof: Here be Trolls
Now that we know the hazards of spurious proofs, suppose we design Rob to be willing to disobey a proof that he will take a certain action. Then we need to consider a new type of dilemma.
Suppose that Rob needs to cross a river and has two options. First, he could swim which would work, but he would get wet, so he would obtain a utility of 0. Alternatively, he could attempt to cross a bridge with a troll underneath. The catch is that the troll might blow up the bridge. If Rob crosses successfully, he gets a utility of 1. But if the bridge blows up, he gets a negative utility of -B for some value B > 0.
The problem is that this isn’t just any other troll – it has access to Rob’s source code and can perfectly read his mind. And this troll wants to play a game. The troll commits to blowing up the bridge if Rob crosses without believing that the expected value of crossing is positive. Another way to phrase this is that the troll will blow up the bridge if Rob crosses as an exploratory action rather than an exploitatory one. And yet another way to phrase this is simply that the troll will blow up the bridge if Rob crosses “for a dumb reason.”
What should Rob do? There is one very natural-seeming line of reasoning that would lead Rob to think that he should probably not cross. And spoiler alert – it has a subtle problem.
Suppose that hypothetically, Rob proves that crossing the bridge would lead to it blowing up. Then if he crossed, he would be inconsistent. And if so, the troll would blow up the bridge. So Rob can prove that a proof that crossing would result in the bridge blowing up would mean that crossing would result in the bridge blowing up. So Rob would conclude that he should not cross.
Here’s another intuitive explanation. Remember our intuitive understanding of Löb’s Theorem: “if you can prove that the hypothetical existence of a proof of X would imply X, then X.” So if Rob can prove that a proof that it would be dumb to cross the bridge would make it dumb to cross the bridge, which it would in this case, then it would indeed be dumb to cross the bridge. So he wouldn’t do it.
The proof sketch here seems valid, so what’s wrong? Upon deeper inspection, we can notice that this is a very, very peculiar conclusion to arrive at. Rob reaching this conclusion means that he believes the condition in which he would cross the bridge would prove his own logical inconsistency. It’s awfully weird to say that Rob should just go about hypothetical reasoning like this by conditioning on a contradiction that indicates the inconsistency of the very logical system that he is using to do the conditional reasoning.
So if the conditional reasoning proves itself absurd, then…um…
What if?! There’s an argument that Rob could be justified in reasoning that he could cross and things at least might be fine because the proof that they wouldn’t be fine is circular – it requires that one already accept that very proof as valid.
It seems much too rigid to say that Rob should never cross regardless of his prior beliefs. What if -B = −0.000001 and his prior that the bridge would blow up was 0.000001? In this case, the prior clearly says to cross, but Rob still would conclude from the Löbian proof that he shouldn’t. This is frustrating. Here, our troll will only punish Rob for crossing when his state of belief says it’s a bad idea. But even if his prior says crossing is a good idea, thanks to the Löbian proof, his posteriors won’t.
The funny thing is that when it comes to crossing, there’s nothing really stopping him from it. And if he actually did cross, the reasoning that says it would be bad to cross would be nullified as far as Rob is concerned.So does this mean that he should cross and apparently prove himself insane?
Not quite. That actually would be insane.
The solution isn’t to embrace irrationality. It’s to use the correct approach to hypothetical reasoning that allows us to avoid the Löbian proof in the first place. The key is to remember the distinction between hypothetical reasoning and conditional reasoning. In other words, we should remember that counterfactuals are not computed from conditionals.
Why does this fix the problem? In the reasoning above, we take the hypothetical too seriously. If we don’t require that Rob is consistent within the hypothetical, he could cross anyway inside of it. This prevents the Löbian proof from going through. Why? Because within the hypothetical, if it were proven that it would be dumb to cross the bridge, this would make the conditional expectations of doing so poor, but it doesn’t have to change the counterfactual expectations. Instead, these counterfactual expectations should be the same as Rob’s priors about crossing because the conditional line of reasoning gets Rob nowhere he wants to go since it stems from a contradiction. So within the hypothetical, Rob can still justifiably cross if his priors allow. And this would create a contradiction within the hypothetical, but who cares? We don’t need to care about hypothetical contradictions – only real ones.
The beauty of this all is that this doesn’t affect Rob in the real world at all. Since the Löbian proof doesn’t work, there is never a difference between the counterfactual and the conditional in practice. Rob is free to cross the bridge according to his prior beliefs.
Where are We Now?
The moral of the story is that making a hypothetical commitment to doing something that would hypothetically be insane can be very useful. We have now seen this principle twice. Once to avoid spurious proofs, and once to avoid manipulation by trolls.
To quote this post, “Counterfactual reasoning should match hypothetical reasoning in the real world, but shouldn’t necessarily match it hypothetically.”
This is a strange but wonderful conclusion. And the best thing about it is the lack of any practical downside to embracing hypothetical absurdity. In that sense it’s a bit like a superpower. By virtue of just thinking about problems like these with the right decision-theoretic framework, one can avoid certain types of manipulation with no real-world side effects – just hypothetical ones.[5]
When designing highly-intelligent AI systems like Rob, it will be crucial to make sure that they are predictable, controllable, and robust to failures. A system being liable to make certain mistakes may sometimes be a great thing. For example, in some circumstances, perhaps we want a system that we can exercise control over with spurious proofs. But it could also be a bad thing. We also don’t want a system that falls for proofs it shouldn’t.
Given these challenges posed by proofs plus other decision theoretic dilemmas, it’s important to be very careful about making systems think and behave how we want them to – not just most of the time – all of the time. Even if a system seems excellent at achieving its goals in the real world with normal tasks, that doesn’t mean that it won’t make mistakes if ever presented with strange or counterintuitive dilemmas like the ones covered here!
In fact, in systems like Peano Arithmetic, a single contradiction can be used to prove any arbitrary statement, making the whole logical system nonsense.
It’s worth noting that no contradictions can arrive from a Löbian proof that Rob would take $5 followed by a similar Löbian proof that he would take $10 because the second proof would not go through. As long as Rob can prove that different actions are mutually exclusive, he can no longer commit to obeying additional proofs beyond the first.
Pitfalls with Proofs
This post distills some excellent writing on decision theory from Abram Demski about spurious proofs and Troll Bridge problems, and it has significant parallels with this work. It recently won a $250 prize in UC Berkeley Effective Altruism’s AI safety distillation contest. The goal of this post is to make two difficult-to-understand topics, spurious proofs and troll bridge problems, as easy to understand as possible.
An Innocent Question
A highly intelligent system should have the ability to reason with logic. After all, being logical is a lot more useful than being illogical. Let’s design such a system – a robot named Rob.
The goal of a logical system is to prove things from axioms. Axioms are facts that are assumed to be true with no need for justification. They can either be facts about objects or inference rules. For example, “0 is a natural number” and “for any natural number, n, Successor(n) is also a natural number” are both axioms of a widely-studied logical system called Peano Arithmetic. Axioms are often applied to other axioms. For example, I can use the two aforementioned ones to prove that Successor(0), a.k.a. the number 1, is a natural number. We call such a combination of axioms that reveals a new fact a proof and the result a theorem.
The goal of using logic is to teach ourselves useful things by proving them. And toward this end, we sure hope that our logical system can’t prove contradictions. If so, it wouldn’t be useful at all! [1]
Now back to Rob. Suppose we want him to use some logical system by which he reasons about himself and the world. Exactly what logical axioms he uses aren’t important as long as they are sufficiently expressive.[2] As we are tinkering away in the process of designing him to be highly-intelligent, suppose that Rob asks us a simple question.
Answering this question will lead us through some strange dilemmas, but when we’re done, understanding the solution will teach an important lesson about instrumental rationality: making a hypothetical commitment to doing something that would hypothetically be insane isn’t just ok – it can be really useful. Ultimately, understanding this principle will help us to understand how even highly-advanced AI systems could subtly fail if they aren’t paradox-proof.
Obeying the Proof: What Could Go Wrong?
A lot.
The first answer one might imagine for this question is “It would seem weird (and maybe even contradictory?) to disobey a proof that you would take a certain action, so wouldn’t the most sensible option be to just obey the proof?”
This makes a lot of intuitive sense, but there is a problem: Löb’s Theorem. In English, Löb’s Theorem states that “using some sufficiently expressive logical system, if a provability predicate for some theorem X would imply X, then X.” A provability predicate for a theorem X is another logical statement that acts as a sort of proxy for X. The details, including what being such a “proxy” means are beyond the scope of this piece.[3] But you can find a very fun cartoon-style proof of it here.
Here, the technical details are important, but we can get a lot of mileage out of a simple intuitive understanding of Löb’s Theorem. A non-rigorous but intuitive version that we can use is “if you can prove that the hypothetical existence of a proof of X would imply X, then X.”
This is not trivial. It’s a statement about a hypothetical proof which is not the same thing as a proof that we just assume exists for the sake of conditional reasoning. Here, hypothetical reasoning is looser than conditional reasoning. For example, in conditional reasoning, we shouldn’t condition on an absurdity. But if we are doing hypothetical reasoning, reasoning about an absurd counterfactual is fair game. This is actually a roundabout way of describing the technical distinction between a provability predicate and a proof. In cases like Löb’s Theorem, the point about reasoning with provability predicates is that we don’t need to concern ourselves with whether the predicate corresponds to an actual theorem or not when turning the crank.
This might sound confusing because it is. But don’t worry. The details aren’t the point. Understanding the nuanced distinction between these things has a time and place in the formal study of logic. But here, it will suffice to get in the van and proceed with the intuitive understanding: “if you can prove that the hypothetical existence of a proof of X would imply X, then X.” Things should become clearer when we see an example next.
Suppose that we design Rob to commit to always obeying proofs that he will take certain actions. Then the problem is that Löb’s Theorem can be used to show that Rob will take any arbitrary action. And as a robot who obeys these proofs, he would then have to take such an action (assuming that he has a formal understanding of his proof-obeying behavior).
Here, since Rob is the type of agent that would obey proofs about what actions he would take, the hypothetical existence of a proof that Rob would take action A would imply that he would take A. So then by Löb’s Theorem, it’s immediate that it can be proven to Rob that he would take action A. And if this can be proven, then he would take that action.
The resulting behavior is concerning. Imagine that Rob is in a situation where he has two options: take $5, or take $10. This strategy could be used to prove that he would take $5 which would then lead him to take the $5 even though it is clearly the wrong choice.
Rob is left vulnerable to taking arbitrary actions as long as he is either presented with this type of proof or he thinks of one himself.[4] This would essentially be a backdoor that would allow for Rob to be manipulated into doing anything simply be presenting him with proofs! This is known as a vulnerability to “spurious proofs.”
So where did we go wrong? In order to not be vulnerable to spurious proofs, Rob needs to be willing to hypothetically disobey proofs sometimes about what actions will take. This can be viewed as a method of diagonalizing against them.
But isn’t this a recipe for contradictions? If Rob is willing to disobey proofs about what actions he will take, then wouldn’t he be demonstrating his own inconsistency if he ever does so? Yes, with an emphasis on the “if.” By hypothetically being willing to sometimes disobey such a proof, he precludes the possibility for such spurious proofs (if his logical system is consistent). So this would never be a problem in reality. There are no real contradictions here – only hypothetical ones. For a more in-depth discussion, see this paper.
So interestingly, it is possible that making a hypothetical commitment to doing something that would hypothetically be insane not only isn’t problematic, but it can be of great practical benefit to Rob! We will see another example of this principle next.
Disobeying the Proof: Here be Trolls
Now that we know the hazards of spurious proofs, suppose we design Rob to be willing to disobey a proof that he will take a certain action. Then we need to consider a new type of dilemma.
Suppose that Rob needs to cross a river and has two options. First, he could swim which would work, but he would get wet, so he would obtain a utility of 0. Alternatively, he could attempt to cross a bridge with a troll underneath. The catch is that the troll might blow up the bridge. If Rob crosses successfully, he gets a utility of 1. But if the bridge blows up, he gets a negative utility of -B for some value B > 0.
The problem is that this isn’t just any other troll – it has access to Rob’s source code and can perfectly read his mind. And this troll wants to play a game. The troll commits to blowing up the bridge if Rob crosses without believing that the expected value of crossing is positive. Another way to phrase this is that the troll will blow up the bridge if Rob crosses as an exploratory action rather than an exploitatory one. And yet another way to phrase this is simply that the troll will blow up the bridge if Rob crosses “for a dumb reason.”
What should Rob do? There is one very natural-seeming line of reasoning that would lead Rob to think that he should probably not cross. And spoiler alert – it has a subtle problem.
Suppose that hypothetically, Rob proves that crossing the bridge would lead to it blowing up. Then if he crossed, he would be inconsistent. And if so, the troll would blow up the bridge. So Rob can prove that a proof that crossing would result in the bridge blowing up would mean that crossing would result in the bridge blowing up. So Rob would conclude that he should not cross.
Here’s another intuitive explanation. Remember our intuitive understanding of Löb’s Theorem: “if you can prove that the hypothetical existence of a proof of X would imply X, then X.” So if Rob can prove that a proof that it would be dumb to cross the bridge would make it dumb to cross the bridge, which it would in this case, then it would indeed be dumb to cross the bridge. So he wouldn’t do it.
The proof sketch here seems valid, so what’s wrong? Upon deeper inspection, we can notice that this is a very, very peculiar conclusion to arrive at. Rob reaching this conclusion means that he believes the condition in which he would cross the bridge would prove his own logical inconsistency. It’s awfully weird to say that Rob should just go about hypothetical reasoning like this by conditioning on a contradiction that indicates the inconsistency of the very logical system that he is using to do the conditional reasoning.
So if the conditional reasoning proves itself absurd, then…um…
What if?! There’s an argument that Rob could be justified in reasoning that he could cross and things at least might be fine because the proof that they wouldn’t be fine is circular – it requires that one already accept that very proof as valid.
It seems much too rigid to say that Rob should never cross regardless of his prior beliefs. What if -B = −0.000001 and his prior that the bridge would blow up was 0.000001? In this case, the prior clearly says to cross, but Rob still would conclude from the Löbian proof that he shouldn’t. This is frustrating. Here, our troll will only punish Rob for crossing when his state of belief says it’s a bad idea. But even if his prior says crossing is a good idea, thanks to the Löbian proof, his posteriors won’t.
The funny thing is that when it comes to crossing, there’s nothing really stopping him from it. And if he actually did cross, the reasoning that says it would be bad to cross would be nullified as far as Rob is concerned. So does this mean that he should cross and apparently prove himself insane?
Not quite. That actually would be insane.
The solution isn’t to embrace irrationality. It’s to use the correct approach to hypothetical reasoning that allows us to avoid the Löbian proof in the first place. The key is to remember the distinction between hypothetical reasoning and conditional reasoning. In other words, we should remember that counterfactuals are not computed from conditionals.
Why does this fix the problem? In the reasoning above, we take the hypothetical too seriously. If we don’t require that Rob is consistent within the hypothetical, he could cross anyway inside of it. This prevents the Löbian proof from going through. Why? Because within the hypothetical, if it were proven that it would be dumb to cross the bridge, this would make the conditional expectations of doing so poor, but it doesn’t have to change the counterfactual expectations. Instead, these counterfactual expectations should be the same as Rob’s priors about crossing because the conditional line of reasoning gets Rob nowhere he wants to go since it stems from a contradiction. So within the hypothetical, Rob can still justifiably cross if his priors allow. And this would create a contradiction within the hypothetical, but who cares? We don’t need to care about hypothetical contradictions – only real ones.
The beauty of this all is that this doesn’t affect Rob in the real world at all. Since the Löbian proof doesn’t work, there is never a difference between the counterfactual and the conditional in practice. Rob is free to cross the bridge according to his prior beliefs.
Where are We Now?
The moral of the story is that making a hypothetical commitment to doing something that would hypothetically be insane can be very useful. We have now seen this principle twice. Once to avoid spurious proofs, and once to avoid manipulation by trolls.
To quote this post, “Counterfactual reasoning should match hypothetical reasoning in the real world, but shouldn’t necessarily match it hypothetically.”
This is a strange but wonderful conclusion. And the best thing about it is the lack of any practical downside to embracing hypothetical absurdity. In that sense it’s a bit like a superpower. By virtue of just thinking about problems like these with the right decision-theoretic framework, one can avoid certain types of manipulation with no real-world side effects – just hypothetical ones.[5]
When designing highly-intelligent AI systems like Rob, it will be crucial to make sure that they are predictable, controllable, and robust to failures. A system being liable to make certain mistakes may sometimes be a great thing. For example, in some circumstances, perhaps we want a system that we can exercise control over with spurious proofs. But it could also be a bad thing. We also don’t want a system that falls for proofs it shouldn’t.
Given these challenges posed by proofs plus other decision theoretic dilemmas, it’s important to be very careful about making systems think and behave how we want them to – not just most of the time – all of the time. Even if a system seems excellent at achieving its goals in the real world with normal tasks, that doesn’t mean that it won’t make mistakes if ever presented with strange or counterintuitive dilemmas like the ones covered here!
For further reading, I’d recommend Embedded Agency, and Achilles Heels for AGI/ASI via Decision Theoretic Adversaries.
In fact, in systems like Peano Arithmetic, a single contradiction can be used to prove any arbitrary statement, making the whole logical system nonsense.
The technical definition of “sufficiently expressive” is “at least as expressive as Peano Arithmetic”
Here, satisfying properties known as “adequacy”, “formal adequacy”, and “formal modus ponens” are required for being a proxy.
It’s worth noting that no contradictions can arrive from a Löbian proof that Rob would take $5 followed by a similar Löbian proof that he would take $10 because the second proof would not go through. As long as Rob can prove that different actions are mutually exclusive, he can no longer commit to obeying additional proofs beyond the first.
There is another “superpower” similar to this. See https://arxiv.org/abs/1710.05060.