You say that a “bad reason” is one such that the agents the procedure would think is bad.
To elaborate a little, one way we could think about this would be that “in a broad variety of situations” the agent would think this property sounded pretty bad.
For example, the hypothetical “PA proves ⊥” would be evaluated as pretty bad by a proof-based agent, in many situations; it would not expect its future self to make decisions well, so, it would often have pretty poor performance bounds for its future self (eg the lowest utility available in the given scenario).
So far so good—your condition seems like one which a counterfactual reasoner would broadly find concerning.
It also passes the sniff test of “would I think the agent is being dumb if it didn’t cross for this reason?” The fact that there’s a troll waiting to blow up a bridge if I’m empirically incorrect about that very setup should not, in itself, make me too reluctant to cross a bridge. If I’m very confident that the situation is indeed as described, then intuitively, I should confidently cross.
But it seems that, if I believe your proof, I would not believe this any more. You don’t prove whether the agent crosses or not, but you do claim to prove that if the agent crosses, it in fact gets blown up. It seems you think the correct counterfactual (for such an agent) is indeed that it would get blown up if it crosses:
Thus, either the agent doesn’t cross the bridge or it does and the bridge explodes.
So if the proof is to be believed, it seems like the philosophical argument falls flat? If the agent fails to cross for this reason, then it seems you think it is reasoning correctly. If it crosses and explodes, then it fails because it had wrong counterfactuals. This also does not seem like much of an indictment of how it was reasoning—garbage in, garbage out. We can concern ourselves with achieving more robust reasoners, for sure, so that sometimes garbage in → treasure out. But that’s a far cry from the usual troll bridge argument, where the agent has a 100% correct description of the situation, and nonetheless, appears to mishandle it.
To summarize:
The usual troll bridge argument proves that the agent doesn’t cross. You fail to do so. This makes your argument less convincing, because we don’t actually see that the agent engages in the weird behavior.
The usual troll bridge argument establishes a case where we intuitively disagree with the way the agent reasons for not crossing. You’ve agreed with such reasoning in the end.
EDIT: I think I should withdraw the first point here; the usual Troll Bridge argument also only proves an either-or in a sense. That is: it only proves not-cross if we assume PA is consistent. However, most people seem to think PA is consistent, which does seem somewhat different from your argument. In any case, I think the second point stands?
However, I grant that your result would be unsettling in a broader sense. It would force me to either abandon my theory, or, accept the conclusion that I should not cross a bridge with a troll under it if the troll blows up the bridge when I’m mistaken about the consequences of crossing.
If I bought your proof I think I’d also buy your conclusion, namely that crossing the bridge does in fact blow up such an agent. So I’d then be comfortable accepting my own theory of counterfactuals (and resigning myself to never cross such bridges).
However, I don’t currently see how your proof goes through.
2. Suppose A=′Cross′. Then if the agent crosses it must be because either it used the chicken rule or because its counterfactual environment doesn’t line up with reality in this case. Either way, this is a bad reason for crossing, so the bridge gets blown up. Thus, the AI gets −10 utility.
I don’t get this step. How does the agent conclude that its counterfactual environment doesn’t line up with reality in this case? By supposition, it has proved that crossing is bad. But this does not (in itself) imply that crossing is bad. For counterfactuals not to line up with reality means that the counterfactuals are one way, and the reality is another. So presumably to show that crossing is bad, you first have to show that crossing gets −10 utility, correct? Or rather, the agent has to conclude this within the hypothetical. But here you are not able to conclude that without first having shown that crossing is bad (rather than only that the agent has concluded it is so). See what I mean? You seem to be implicitly stripping off the “⊢” from the first premise in your argument, in order to then justify explicitly doing so.
Perhaps you are implicitly assuming that ⊢A=′Cross′⟹U=−10 implies that the correct counterfactual on crossing is −10. But that is precisely what the original Troll Bridge argument disputes, by disputing proof-based decision theory.
Oh, I’m sorry; you’re right. I messed up on step two of my proposed proof that your technique would be vulnerable to the same problem.
However, it still seems to me that agents using your technique would also be concerning likely to fail to cross, or otherwise suffer from other problems. Like last time, suppose ⊢(A=′Cross′⟹U=−10) and that A=′Cross′. So if the agent decides to cross, it’s either because of the chicken rule, because not crossing counterfactually results in utility ≤ −10, or because crossing counterfactually results in utility greater than −10.
If the agent crosses because of the chicken rule, then this is a bad reason, so the bridge will blow up.
I had already assumed that not crossing counterfactually results in utility greater than −10, so it can’t be the middle case.
Suppose instead the crossing counterfactual results in a utility greater than −10 utility. This seems very strange. By assumption, it’s provable using the AI’s proof system that (A=′Cross⟹U=−10). And the AI’s counterfactual environment is supposed to line up with reality.
So, in other words, the AI has decided to cross and has already proven that crossing entails it will get −10 utility. And if the counterfactual environment assigns greater than −10 utility, then that counterfactual environment provably, within the agent’s proof system, doesn’t line up with reality. So how do you get an AI to both believe it will cross, believe crossing entails −10 utility, and still counterfactually thinks that crossing will result in greater than −10 utility?
In this situation, the AI can prove, within its own proof system, that the counterfactual environment of getting > −10 utility is wrong. So I guess we need an agent that allows itself to use a certain counterfactual environment even though the AI already proved that it’s wrong. I’m concerned about the functionality of such an agent. If it already ignores clear evidence that it’s counterfactual environment is wrong in reality, then that would really make me question that agent’s ability to use counterfactual environments in other situations that line up in reality.
So it seems to me that for an agent using your take on counterfactuals to cross, it would need to either think that not crossing counterfactually results in utility ≤−10, or to ignore conclusive evidence that the counterfactual environment it’s using for its chosen action would in fact not line up with reality. Both of these options seem rather concerning to me.
Also, even if you do decide to let the AI ignore conclusive evidence (to the AI) that crossing makes utility be −10, I’m concerned the bridge would get blown up anyways. I know we haven’t formalized “a bad reason”, but we’ve taken it to mean something like, “something that seems like a bad reason to the AI”. If the AI wants its counterfactual environments to line up with reality, and it can clearly see that, for the action it decides to take, it doesn’t line up with reality, then this seems like a “bad” reason to me.
Suppose instead the crossing counterfactual results in a utility greater than −10 utility. This seems very strange. By assumption, it’s provable using the AI’s proof system that (A=′Cross⟹U=−10). And the AI’s counterfactual environment is supposed to line up with reality.
Right. This is precisely the sacrifice I’m making in order to solve Troll Bridge. Something like this seems to be necessary for any solution, because we already know that if your expectations of consequences entirely respect entailment, you’ll fall prey to the Troll Bridge! In fact, your “stop thinking”/”rollback” proposals have precisely the same feature: you’re trying to construct expectations which don’t respect the entailment.
So I think if you reject this, you just have to accept Troll Bridge.
In this situation, the AI can prove, within its own proof system, that the counterfactual environment of getting > −10 utility is wrong. So I guess we need an agent that allows itself to use a certain counterfactual environment even though the AI already proved that it’s wrong. I’m concerned about the functionality of such an agent. If it already ignores clear evidence that it’s counterfactual environment is wrong in reality, then that would really make me question that agent’s ability to use counterfactual environments in other situations that line up in reality.
Well, this is precisely not what I mean when I say that the counterfactuals line up with reality. What I mean is that they should be empirically grounded, so, in cases where the condition is actually met, we see the predicted result.
Rather than saying this AI’s counterfactual expectations are “wrong in reality”, you should say they are “wrong in logic” or something like that. Otherwise you are sneaking in an assumption that (a) counterfactual scenarios are real, and (b) they really do respect entailment.
We can become confident in my strange counterfactual by virtue of having seen it play out many times, eg, crossing similar bridges many times. This is the meat of my take on counterfactuals: to learn them in a way that respects reality, rather than trying to deduce them. To impose empiricism on them, ie, the idea that they must make accurate predictions in the cases we actually see.
And it simply is the case that if we prefer such empirical beliefs to logic, here, we can cross. So in this particular example, we see a sort of evidence that respecting entailment is a wrong principle for counterfactual expectations. The 5&10 problem can also be thought of as evidence against entailment as counterfactual.
Also, even if you do decide to let the AI ignore conclusive evidence (to the AI) that crossing makes utility be −10, I’m concerned the bridge would get blown up anyways. I know we haven’t formalized “a bad reason”, but we’ve taken it to mean something like, “something that seems like a bad reason to the AI”. If the AI wants its counterfactual environments to line up with reality, and it can clearly see that, for the action it decides to take, it doesn’t line up with reality, then this seems like a “bad” reason to me.
You have to realize that reasoning in this way amounts to insisting that the correct answer to Troll Bridge is not crossing, because the troll bridge variant you are proposing just punishes anyone whose reasoning differs from entailment. And again, you were also proposing a version of “ignore the conclusive evidence”. It’s just that on your theory, it is really evidence, so you have to figure out a way to ignore it. On my theory, it’s not really evidence, so we can update on such information and still cross.
But also, if the agent works as I describe, it will never actually see such a proof. So it isn’t so much that its counterfactuals actually disagree with entailment. It’s just that they can hypothetically disagree.
To elaborate a little, one way we could think about this would be that “in a broad variety of situations” the agent would think this property sounded pretty bad.
For example, the hypothetical “PA proves ⊥” would be evaluated as pretty bad by a proof-based agent, in many situations; it would not expect its future self to make decisions well, so, it would often have pretty poor performance bounds for its future self (eg the lowest utility available in the given scenario).
So far so good—your condition seems like one which a counterfactual reasoner would broadly find concerning.
It also passes the sniff test of “would I think the agent is being dumb if it didn’t cross for this reason?” The fact that there’s a troll waiting to blow up a bridge if I’m empirically incorrect about that very setup should not, in itself, make me too reluctant to cross a bridge. If I’m very confident that the situation is indeed as described, then intuitively, I should confidently cross.
But it seems that, if I believe your proof, I would not believe this any more. You don’t prove whether the agent crosses or not, but you do claim to prove that if the agent crosses, it in fact gets blown up. It seems you think the correct counterfactual (for such an agent) is indeed that it would get blown up if it crosses:
So if the proof is to be believed, it seems like the philosophical argument falls flat? If the agent fails to cross for this reason, then it seems you think it is reasoning correctly. If it crosses and explodes, then it fails because it had wrong counterfactuals. This also does not seem like much of an indictment of how it was reasoning—garbage in, garbage out. We can concern ourselves with achieving more robust reasoners, for sure, so that sometimes garbage in → treasure out. But that’s a far cry from the usual troll bridge argument, where the agent has a 100% correct description of the situation, and nonetheless, appears to mishandle it.
To summarize:
The usual troll bridge argument proves that the agent doesn’t cross. You fail to do so. This makes your argument less convincing, because we don’t actually see that the agent engages in the weird behavior.
The usual troll bridge argument establishes a case where we intuitively disagree with the way the agent reasons for not crossing. You’ve agreed with such reasoning in the end.
EDIT: I think I should withdraw the first point here; the usual Troll Bridge argument also only proves an either-or in a sense. That is: it only proves not-cross if we assume PA is consistent. However, most people seem to think PA is consistent, which does seem somewhat different from your argument. In any case, I think the second point stands?
However, I grant that your result would be unsettling in a broader sense. It would force me to either abandon my theory, or, accept the conclusion that I should not cross a bridge with a troll under it if the troll blows up the bridge when I’m mistaken about the consequences of crossing.
If I bought your proof I think I’d also buy your conclusion, namely that crossing the bridge does in fact blow up such an agent. So I’d then be comfortable accepting my own theory of counterfactuals (and resigning myself to never cross such bridges).
However, I don’t currently see how your proof goes through.
I don’t get this step. How does the agent conclude that its counterfactual environment doesn’t line up with reality in this case? By supposition, it has proved that crossing is bad. But this does not (in itself) imply that crossing is bad. For counterfactuals not to line up with reality means that the counterfactuals are one way, and the reality is another. So presumably to show that crossing is bad, you first have to show that crossing gets −10 utility, correct? Or rather, the agent has to conclude this within the hypothetical. But here you are not able to conclude that without first having shown that crossing is bad (rather than only that the agent has concluded it is so). See what I mean? You seem to be implicitly stripping off the “⊢” from the first premise in your argument, in order to then justify explicitly doing so.
Perhaps you are implicitly assuming that ⊢A=′Cross′⟹U=−10 implies that the correct counterfactual on crossing is −10. But that is precisely what the original Troll Bridge argument disputes, by disputing proof-based decision theory.
Oh, I’m sorry; you’re right. I messed up on step two of my proposed proof that your technique would be vulnerable to the same problem.
However, it still seems to me that agents using your technique would also be concerning likely to fail to cross, or otherwise suffer from other problems. Like last time, suppose ⊢(A=′Cross′⟹U=−10) and that A=′Cross′. So if the agent decides to cross, it’s either because of the chicken rule, because not crossing counterfactually results in utility ≤ −10, or because crossing counterfactually results in utility greater than −10.
If the agent crosses because of the chicken rule, then this is a bad reason, so the bridge will blow up.
I had already assumed that not crossing counterfactually results in utility greater than −10, so it can’t be the middle case.
Suppose instead the crossing counterfactual results in a utility greater than −10 utility. This seems very strange. By assumption, it’s provable using the AI’s proof system that (A=′Cross⟹U=−10). And the AI’s counterfactual environment is supposed to line up with reality.
So, in other words, the AI has decided to cross and has already proven that crossing entails it will get −10 utility. And if the counterfactual environment assigns greater than −10 utility, then that counterfactual environment provably, within the agent’s proof system, doesn’t line up with reality. So how do you get an AI to both believe it will cross, believe crossing entails −10 utility, and still counterfactually thinks that crossing will result in greater than −10 utility?
In this situation, the AI can prove, within its own proof system, that the counterfactual environment of getting > −10 utility is wrong. So I guess we need an agent that allows itself to use a certain counterfactual environment even though the AI already proved that it’s wrong. I’m concerned about the functionality of such an agent. If it already ignores clear evidence that it’s counterfactual environment is wrong in reality, then that would really make me question that agent’s ability to use counterfactual environments in other situations that line up in reality.
So it seems to me that for an agent using your take on counterfactuals to cross, it would need to either think that not crossing counterfactually results in utility ≤−10, or to ignore conclusive evidence that the counterfactual environment it’s using for its chosen action would in fact not line up with reality. Both of these options seem rather concerning to me.
Also, even if you do decide to let the AI ignore conclusive evidence (to the AI) that crossing makes utility be −10, I’m concerned the bridge would get blown up anyways. I know we haven’t formalized “a bad reason”, but we’ve taken it to mean something like, “something that seems like a bad reason to the AI”. If the AI wants its counterfactual environments to line up with reality, and it can clearly see that, for the action it decides to take, it doesn’t line up with reality, then this seems like a “bad” reason to me.
Right. This is precisely the sacrifice I’m making in order to solve Troll Bridge. Something like this seems to be necessary for any solution, because we already know that if your expectations of consequences entirely respect entailment, you’ll fall prey to the Troll Bridge! In fact, your “stop thinking”/”rollback” proposals have precisely the same feature: you’re trying to construct expectations which don’t respect the entailment.
So I think if you reject this, you just have to accept Troll Bridge.
Well, this is precisely not what I mean when I say that the counterfactuals line up with reality. What I mean is that they should be empirically grounded, so, in cases where the condition is actually met, we see the predicted result.
Rather than saying this AI’s counterfactual expectations are “wrong in reality”, you should say they are “wrong in logic” or something like that. Otherwise you are sneaking in an assumption that (a) counterfactual scenarios are real, and (b) they really do respect entailment.
We can become confident in my strange counterfactual by virtue of having seen it play out many times, eg, crossing similar bridges many times. This is the meat of my take on counterfactuals: to learn them in a way that respects reality, rather than trying to deduce them. To impose empiricism on them, ie, the idea that they must make accurate predictions in the cases we actually see.
And it simply is the case that if we prefer such empirical beliefs to logic, here, we can cross. So in this particular example, we see a sort of evidence that respecting entailment is a wrong principle for counterfactual expectations. The 5&10 problem can also be thought of as evidence against entailment as counterfactual.
You have to realize that reasoning in this way amounts to insisting that the correct answer to Troll Bridge is not crossing, because the troll bridge variant you are proposing just punishes anyone whose reasoning differs from entailment. And again, you were also proposing a version of “ignore the conclusive evidence”. It’s just that on your theory, it is really evidence, so you have to figure out a way to ignore it. On my theory, it’s not really evidence, so we can update on such information and still cross.
But also, if the agent works as I describe, it will never actually see such a proof. So it isn’t so much that its counterfactuals actually disagree with entailment. It’s just that they can hypothetically disagree.