Chantiel comments on Troll Bridge

Chantiel Oct 18, 2021, 7:34 PM
LW: 11 AF: 6
AF
I’m not entirely sure what you consider to be a “bad” reason for crossing the bridge. However, I’m having a hard time finding a way to define it that both causes agents using evidential counterfactuals to necessarily fail while not having other agents fail.

One way to define a “bad” reason is an irrational one (or the chicken rule). However, if this is what is meant by a “bad” reason, it seems like this is an avoidable problem for an evidential agent, as long as that agent has control over what it decides to think about.

To illustrate, consider what I would do if I was in the troll bridge situation and used evidential counterfactuals. Then I would reason, “I know the troll will only blow up the bridge if I cross for a bad reason, but I’m generally pretty reasonable, so I think I’ll do fine if I cross”. And then I’d stop thinking about it. I know that certain agents, given enough time to think about it, would end up not crossing, so I’d just make sure I didn’t do that.

Another way that you might have had in mind is that a “bad” reason is one such that the action the AI takes results in a provably bad outcome despite the AI thinking the action would result in a good outcome, or the reason being the chicken rule. However, in this is the case, it seems to me that no agent would be able to cross the bridge without it being blown up, unless the agent’s counterfactual environment in which it didn’t cross scored less than −10 utility. But this doesn’t seem like a very reasonable counterfactual environment.

To see why, consider an arbitrary agent with the following decision procedure. Let counterfactual be an arbitrary specification of what would happen in some counterfactual world.
```
def act():
    cross_eu = expected_utility(counterfactual('A = Cross'))
    stay_eu = expected_utility(counterfactual('A = Stay'))
    if cross_eu > stay_eu:
        return cross
    return stay
```
The chicken rule can be added, too, if you wish. I’ll assume the expected utility of staying is greater than −10.

Then it seems you can adapt the proof you gave for your agent to show that an arbitrary agent satisfying the above description would also get −10 utility if it crossed. Specifically,

Suppose $⊢ (A = C r o s s ⟹ U = - 10)$ . Suppose ‘A = Cross’ Then the agent crossed either because of the chicken rule or because counterfactual environment in which the agent crossed had utility greater than −10, or the counterfactual environment in which the agent didn’t cross had less than −10 utility. We assumed the counterfactual environment in which the agent doesn’t cross has more than −10 utility. Thus, it must be either the chicken rule or because crossing had more than −10 utility in expectation. If it’s because of the chicken rule, then this is a “bad” reason, so, the troll will destroy the bridge just like in the original proof. Thus, utility would equal −10. Suppose instead the agent crosses because expected_utility(counterfactual(A = Cross)) > -10. However, by the assumption, $⊢ A = C r o s s ⟹ U = - 10$ . Thus, since the agent actually crosses, this in fact provably results in −10 utility and the AI is thus wrong in thinking it would get a good outcome. Thus, the AI’s action results in provably bad outcomes. Therefore, the troll destroys the bridge. Thus, utility would equal −10. Thus, ’A = Cross \implies U = −10`. Thus, ( $⊢ A = C r o s s ⟹ U = - 10) ⟹ (A = C r o s s ⟹ U = - 10)$ . Thus, by Lob’s theorem, $A = C r o s s ⟹ U = - 10$

As I said, you could potentially avoid getting the bridge destroyed by assigning expected utility less than −10 to the counterfactual environment in which the AI doesn’t cross. This seems like a “silly” counterfactual environment, so it doesn’t seem like something we would want an AI to think. Also, since it seems like a silly thing to think, a troll may consider the use of such a counterfactual environment to be a bad reason to cross the bridge, and thus destroy it anyways.
- abramdemski Nov 6, 2021, 5:02 PM
  LW: 2 AF: 2
  AF Parent
  Ok. This threw me for a loop briefly. It seems like I hadn’t considered your proposed definition of “bad reasoning” (ie “it’s bad if the agent crosses despite it being provably bad to do so”) -- or had forgotten about that case.
  I’m not sure I endorse the idea of defining “bad” first and then considering the space of agents who pass/fail according to that notion of “bad”; how this is supposed to work is, rather, that we critique a particular decision theory by proposing a notion of “bad” tailored to that particular decision theory. For example, if a specific decision theorist thinks proofs are the way to evaluate possible actions, then “PA proves $⊥$ ” will be a convincing notion of “bad reasoning” for that specific decision theorist.
  If we define “bad reasoning” as “crossing when there is a proof that crossing is bad” in general, this begs the question of how to evaluate actions. Of course the troll will punish counterfactual reasoning which doesn’t line up with this principle, in that case. The only surprising thing in the proof, then, is that the troll also punishes reasoners whose counterfactuals respect proofs (EG, EDT).
  If I had to make a stab at a generic notion of “bad”, it would be “the agent’s own way of evaluating consequences says that the consequences of its actions will be bad”. But this is pretty ambiguous in some cases, such as chicken rule. I think a more appropriate way to generally characterize “bad reasoning” is just to say that proponents of the decision theory in question should agree that it looks bad.
  This is an open question, even for the examples I gave! I’ve been in discussions about Troll Bridge where proponents of proof-based DT (aka MUDT) argue that it makes perfect sense for the agent to think its action can control the consistency of PA in this case, so the reasoning isn’t “bad”, so the problem is unfair. I think it’s correct to identify this as the crux of the argument—whether I think the troll bridge argument incriminates proof-based DT almost entirely depends on whether I think it is unreasonable for the agent to think it controls whether PA proves $⊥$ in this case.
  To illustrate, consider what I would do if I was in the troll bridge situation and used evidential counterfactuals. Then I would reason, “I know the troll will only blow up the bridge if I cross for a bad reason, but I’m generally pretty reasonable, so I think I’ll do fine if I cross”. And then I’d stop thinking about it. I know that certain agents, given enough time to think about it, would end up not crossing, so I’d just make sure I didn’t do that.
  But thinking more is usually a good idea, so, the agent might also decide to think more! What reasoning are you using to decide to stop? If it’s reasoning based on the troll bridge proof, you’re already in trouble.
  You seem to be assuming that the agent’s architecture has solved the problem of logical updatelessness, IE, of applying reasoning only to the (precise) extent to which it is beneficial to do so. But this is one of the problems we would like to solve! So I object to the “stop thinking about it” step w/o more details of the decision theory which allows you to do so.
  Simply put, this is not evidential reasoning. This is evidential reasoning within conveniently placed limits. Without a procedure for placing those limits correctly, it isn’t a decision theory to me; it’s a sketch of what might possibly be a decision theory (but clearly not EDT).
  Furthermore, because the proof is pretty short, it makes it look kind of hard to limit reasoning in precisely the way you want. At least, we can’t do it simply by supposing that we only look at a few proofs before deciding whether we should or shouldn’t look more—it is quite possible that the “bad” proof is already present before you make the decision of whether to continue.
  - Chantiel Nov 7, 2021, 12:03 AM
    3 points
    AF Parent
    If we define “bad reasoning” as “crossing when there is a proof that crossing is bad” in general, this begs the question of how to evaluate actions. Of course the troll will punish counterfactual reasoning which doesn’t line up with this principle, in that case. The only surprising thing in the proof, then, is that the troll also punishes reasoners whose counterfactuals respect proofs (EG, EDT).
    
    I’m concerned that may not realize that your own current take on counterfactuals respects logical to some extent, and that, if I’m reasoning correctly, could result in agents using it to fail the troll bridge problem.
    
    You said in “My current take on counterfactuals”, that counterfactual should line up with reality. That is, the action the agent actually takes should in the utility it was said to have in its counterfactual environment.
    
    You say that a “bad reason” is one such that the agents the procedure would think is bad. The counterfactuals in your approach are supposed to line up with reality, so if an AI’s counterfactuals don’t line up in reality, then this seems like this is a “bad” reason according to the definition you gave. Now, if you let your agent think “I’ll get < −10 utility if I don’t cross”, then it could potentially cross and not get blown up. But this seems like a very unintuitive and seemingly ridiculous counterfactual environment. Because of this, I’m pretty worried it could result in an AI with such counterfactual environments malfunctioning somehow. So I’ll assume the AI doesn’t have such a counterfactual environment.
    
    Suppose acting using a counterfactual environment that doesn’t line up with reality counts as a “bad” reason for agents using your counterfactuals. Also suppose that in the counterfactual environment in which the agent doesn’t cross, the agent counterfactually gets more than −10 utility. Then:
    
    Suppose $⊢ A =^{'} C r o s s^{'} ⟹ U = - 10$
    Suppose $A =^{'} C r o s s^{'}$ . Then if the agent crosses it must be because either it used the chicken rule or because its counterfactual environment doesn’t line up with reality in this case. Either way, this is a bad reason for crossing, so the bridge gets blown up. Thus, the AI gets −10 utility.
    Thus, $⊢ (⊢ A =^{'} C r o s s^{'} ⟹ U = - 10) ⟹ U = - 10$
    Thus, by Lob’s theorem, $⊢ A =^{'} C r o s s^{'} ⟹ U = - 10$
    
    Thus, either the agent doesn’t cross the bridge or it does and the bridge explodes. You might just decide to get around this by saying it’s okay for the agent to think it would get less than −10 utility if it didn’t cross. But I’m rather worried that this would cause other problems.
    
    You seem to be assuming that the agent’s architecture has solved the problem of logical updatelessness, IE, of applying reasoning only to the (precise) extent to which it is beneficial to do so. But this is one of the problems we would like to solve! So I object to the “stop thinking about it” step w/o more details of the decision theory which allows you to do so.
    
    I’ll talk about some ways I thought of potentially formalizing, “stop thinking if it’s bad”.
    
    One simple way to try to do so is to have an agent using regular evidential decision theory but have a special, “stop thinking about this thing” action that it can take. Every so often, the agent considers taking this action using regular evidential decision theory. So, in the troll bridge case, it could potentially see that the path of reasoning it’s following is potentially dangerous, and thus decide to stop. Also, the agent needs to avoid thinking too many thoughts before considering to take the “stop thinking about this thing” action. Otherwise, it could think all sorts of problematic thoughts before being able to stop itself.
    
    This simple technique might actually be enough to solve the problem, especially if the AI has the ability to choose its own inference algorithm to find one that makes the AI able to realize, “thinking about this is bad” before it finds the concrete bad thing. And, for what it’s worth, it’s enough for me personally to get across the bridge.
    
    Now, this might not work reliably, because the AI might stumble upon a problematic thought before realizing it’s bad. So, instead, consider this. The AI has an ability to “rollback” its belief state and then adjust its search path. For example, suppose the AI thinks through the Troll Bridge proof and then realizes that it really messed up in doing so. Then it could have a hard-coded way rolling back its belief state to what it was before it thought about the troll bridge thing, mark the nodes it expanded as verboten, and then go expand some other node.
    
    When should it rollback? Well, it shouldn’t rollback every time it comes across any evidence that decreases its expected utility. Doing so would result in the AI being very poorly calibrated by ignoring all negative evidence.
    
    One option to do this is to actually manually come up with a rollback method that both rollbacks whenever doing so increases expected utility without resulting in poor calibration.
    
    However, this sounds sort of hard. One cheap trick get around doing this is to just give the AI a special action to modify its rollback method, and then to select the new rollback method using its regular decision procedure. For example, I think it’s clear, using evidential decision theory, that choosing the rollback method, “Rollback any thought that decreases my expected utility estimate” would result in low utility. Because that would just mean the AI would delude itself. But I bet a rollback method that says, “Roll back if you find any troll bridge proofs” would work okay.
    
    This trick might not be perfect, since the AI could potentially think about problematic before getting a rollback procedure good enough to roll it back. But as long as the AI is smart enough to realize that it should try to get a really good rollback function before doing much of anything else, then I bet it would work okay.
    
    Also, don’t forget that we still need to do something about the agent-simulates-predictor problem. In the agent-simulates-predictor problem, agents are penalized for thinking about things in too much detail. And in whatever counterfactual environment you use, you’ll need a way to deal with the agent-simulates-predictor problem. I think the most obvious approach is by controlling what the AI things about. And if you’ve already done that, then you can pass the troll bridge problem for free.
    
    Also, I think it’s important to note that just the fact the AI is trying to avoid thinking of crossing-is-bad proofs makes the proofs (potentially) not go through. For example, in the proof you originally gave, you supposed there is a proof the crossing results in −10 utility, and thus says the agent must have crossed from the chicken rule. But if the AI is trying to avoid these sorts of “proofs”, then if it does cross, it simply could have been because the AI decided to avoid following whatever train of thought would prove that it would get −10 utility. This is considered a reasonable thing to do by the AI, so it doesn’t seem like a “bad” reason.
    
    There may be possible alternative proofs that apply to an AI that tries to steer its reasoning away from problematic areas. I’m not sure, though. I also suspect that any such proofs would be more complicated and thus harder to find.
    - abramdemski Nov 8, 2021, 8:48 PM
      LW: 2 AF: 2
      AF Parent
      I’ll talk about some ways I thought of potentially formalizing, “stop thinking if it’s bad”.
      If your point is that there are a lot of things to try, I readily accept this point, and do not mean to argue with it. I only intended to point out that, for your proposal to work, you would have to solve another hard problem.
      One simple way to try to do so is to have an agent using regular evidential decision theory but have a special, “stop thinking about this thing” action that it can take. Every so often, the agent considers taking this action using regular evidential decision theory. So, in the troll bridge case, it could potentially see that the path of reasoning it’s following is potentially dangerous, and thus decide to stop. Also, the agent needs to avoid thinking too many thoughts before considering to take the “stop thinking about this thing” action. Otherwise, it could think all sorts of problematic thoughts before being able to stop itself.
      This simple technique might actually be enough to solve the problem, especially if the AI has the ability to choose its own inference algorithm to find one that makes the AI able to realize, “thinking about this is bad” before it finds the concrete bad thing. And, for what it’s worth, it’s enough for me personally to get across the bridge.
      Ordinary Bayesian EDT has to finish its computation (of its probabilistic expectations) in order to proceed. What you are suggesting is to halt those calculations midway. I think you are imagining an agent who can think longer to get better results. But vanilla EDT does not describe such an agent. So, you can’t start with EDT; you have to start with something else (such as logical induction EDT) which does already have a built-in notion of thinking longer.
      Then, my concern is that we won’t have many guarantees for the performance of this system. True, it can stop thinking if it knows thinking will be harmful. However, if it mistakenly thinks a specific form of thought will be harmful, it has no device for correction.
      This is concerning because we expect “early” thoughts to be bad—after all, you’ve got to spend a certain amount of time thinking before things converge to anything at all reasonable.
      So we’re between a rock and a hard spot here: we have to stop quite early, because we know the proof of troll bridge is small. But we con’t stop early, because we know things take a bit to converge.
      So I think this proposal is just “somewhat-logically-updateless-DT”, which I don’t think is a good solution.
      Generally I think rollback solutions are bad. (Several people have argued in their favor over the years; I find that I’m just never intrigued by that direction...) Some specific remarks:
      Note that if you literally just roll back, you would go forward the same way again. So you need to somehow modify the rolled back state, creating a “pseudo-ignorant” belief states where you’re not really uninformed, but rather, reconstruct something merely similar to an uninformed state.
      It is my impression that this causes problems.
      You might be able to deduce the dangerous info from the fact that you have gone down a different reasoning path.
      I would thus argue that the rollback criterion has to use only info available before rolling back to decide whether to roll back; otherwise, it introduces new and potentially dangerous info to the earlier state.
      But this means you’re just back to the non-rollback proposal, where you decide to stop reasoning at some point.
      Or, if you solve that problem, rollbacks may leave you truly ignorant, but not solve the original problem which you rolled back to solve.
      For example, suppose that Omega manipulates you as follows: reward [some target action] and punish [any other action], but only in the case that you realize Omega implements this incentive scheme. If you never realize the possibility, then Omega leaves you alone.
      If you realize that Omega is doing this, and then roll back, you can easily end up in the worst possible world: you don’t realize how to get the reward, so you just go about your business, but Omega punishes you for it anyway.
      For this reason, I think rollbacks belong in a category I’d call “phony updatelessness” where you’re basically trying to fool Omega by somehow approximating an updateless state, but also sneakily taking some advantages from updatefullness. This can work, of course, against a naive Omega; but it doesn’t seem like it really gets at the heart of the problem.
      Simply put, I think rolled-back states are “contaminated” in some sense; you’re trying to get them clean, but the future reasoning has left a permanent stain.
    - abramdemski Nov 8, 2021, 8:14 PM
      LW: 2 AF: 2
      AF Parent
      You say that a “bad reason” is one such that the agents the procedure would think is bad.
      To elaborate a little, one way we could think about this would be that “in a broad variety of situations” the agent would think this property sounded pretty bad.
      For example, the hypothetical “PA proves $⊥$ ” would be evaluated as pretty bad by a proof-based agent, in many situations; it would not expect its future self to make decisions well, so, it would often have pretty poor performance bounds for its future self (eg the lowest utility available in the given scenario).
      So far so good—your condition seems like one which a counterfactual reasoner would broadly find concerning.
      It also passes the sniff test of “would I think the agent is being dumb if it didn’t cross for this reason?” The fact that there’s a troll waiting to blow up a bridge if I’m empirically incorrect about that very setup should not, in itself, make me too reluctant to cross a bridge. If I’m very confident that the situation is indeed as described, then intuitively, I should confidently cross.
      But it seems that, if I believe your proof, I would not believe this any more. You don’t prove whether the agent crosses or not, but you do claim to prove that if the agent crosses, it in fact gets blown up. It seems you think the correct counterfactual (for such an agent) is indeed that it would get blown up if it crosses:
      Thus, either the agent doesn’t cross the bridge or it does and the bridge explodes.
      So if the proof is to be believed, it seems like the philosophical argument falls flat? If the agent fails to cross for this reason, then it seems you think it is reasoning correctly. If it crosses and explodes, then it fails because it had wrong counterfactuals. This also does not seem like much of an indictment of how it was reasoning—garbage in, garbage out. We can concern ourselves with achieving more robust reasoners, for sure, so that sometimes garbage in → treasure out. But that’s a far cry from the usual troll bridge argument, where the agent has a 100% correct description of the situation, and nonetheless, appears to mishandle it.
      To summarize:
      The usual troll bridge argument proves that the agent doesn’t cross. You fail to do so. This makes your argument less convincing, because we don’t actually see that the agent engages in the weird behavior.
      The usual troll bridge argument establishes a case where we intuitively disagree with the way the agent reasons for not crossing. You’ve agreed with such reasoning in the end.
      EDIT: I think I should withdraw the first point here; the usual Troll Bridge argument also only proves an either-or in a sense. That is: it only proves not-cross if we assume PA is consistent. However, most people seem to think PA is consistent, which does seem somewhat different from your argument. In any case, I think the second point stands?
      However, I grant that your result would be unsettling in a broader sense. It would force me to either abandon my theory, or, accept the conclusion that I should not cross a bridge with a troll under it if the troll blows up the bridge when I’m mistaken about the consequences of crossing.
      If I bought your proof I think I’d also buy your conclusion, namely that crossing the bridge does in fact blow up such an agent. So I’d then be comfortable accepting my own theory of counterfactuals (and resigning myself to never cross such bridges).
      However, I don’t currently see how your proof goes through.
      2. Suppose $A =^{'} C r o s s^{'}$ . Then if the agent crosses it must be because either it used the chicken rule or because its counterfactual environment doesn’t line up with reality in this case. Either way, this is a bad reason for crossing, so the bridge gets blown up. Thus, the AI gets −10 utility.
      I don’t get this step. How does the agent conclude that its counterfactual environment doesn’t line up with reality in this case? By supposition, it has proved that crossing is bad. But this does not (in itself) imply that crossing is bad. For counterfactuals not to line up with reality means that the counterfactuals are one way, and the reality is another. So presumably to show that crossing is bad, you first have to show that crossing gets −10 utility, correct? Or rather, the agent has to conclude this within the hypothetical. But here you are not able to conclude that without first having shown that crossing is bad (rather than only that the agent has concluded it is so). See what I mean? You seem to be implicitly stripping off the “ $⊢$ ” from the first premise in your argument, in order to then justify explicitly doing so.
      Perhaps you are implicitly assuming that $⊢ A =^{'} C r o s s^{'} ⟹ U = - 10$ implies that the correct counterfactual on crossing is −10. But that is precisely what the original Troll Bridge argument disputes, by disputing proof-based decision theory.
      - Chantiel Nov 9, 2021, 2:05 AM
        1 point
        AF Parent
        Oh, I’m sorry; you’re right. I messed up on step two of my proposed proof that your technique would be vulnerable to the same problem.
        
        However, it still seems to me that agents using your technique would also be concerning likely to fail to cross, or otherwise suffer from other problems. Like last time, suppose $⊢ (A =^{'} C r o s s^{'} ⟹ U = - 10)$ and that $A =^{'} C r o s s^{'}$ . So if the agent decides to cross, it’s either because of the chicken rule, because not crossing counterfactually results in utility $\leq$ −10, or because crossing counterfactually results in utility greater than −10.
        
        If the agent crosses because of the chicken rule, then this is a bad reason, so the bridge will blow up.
        
        I had already assumed that not crossing counterfactually results in utility greater than −10, so it can’t be the middle case.
        
        Suppose instead the crossing counterfactual results in a utility greater than −10 utility. This seems very strange. By assumption, it’s provable using the AI’s proof system that $(A =^{'} C r o s s ⟹ U = - 10)$ . And the AI’s counterfactual environment is supposed to line up with reality.
        
        So, in other words, the AI has decided to cross and has already proven that crossing entails it will get −10 utility. And if the counterfactual environment assigns greater than −10 utility, then that counterfactual environment provably, within the agent’s proof system, doesn’t line up with reality. So how do you get an AI to both believe it will cross, believe crossing entails −10 utility, and still counterfactually thinks that crossing will result in greater than −10 utility?
        
        In this situation, the AI can prove, within its own proof system, that the counterfactual environment of getting > −10 utility is wrong. So I guess we need an agent that allows itself to use a certain counterfactual environment even though the AI already proved that it’s wrong. I’m concerned about the functionality of such an agent. If it already ignores clear evidence that it’s counterfactual environment is wrong in reality, then that would really make me question that agent’s ability to use counterfactual environments in other situations that line up in reality.
        
        So it seems to me that for an agent using your take on counterfactuals to cross, it would need to either think that not crossing counterfactually results in utility $\leq - 10$ , or to ignore conclusive evidence that the counterfactual environment it’s using for its chosen action would in fact not line up with reality. Both of these options seem rather concerning to me.
        
        Also, even if you do decide to let the AI ignore conclusive evidence (to the AI) that crossing makes utility be −10, I’m concerned the bridge would get blown up anyways. I know we haven’t formalized “a bad reason”, but we’ve taken it to mean something like, “something that seems like a bad reason to the AI”. If the AI wants its counterfactual environments to line up with reality, and it can clearly see that, for the action it decides to take, it doesn’t line up with reality, then this seems like a “bad” reason to me.
        abramdemski Nov 9, 2021, 3:31 PM
        LW: 2 AF: 2
        AF Parent
        Suppose instead the crossing counterfactual results in a utility greater than −10 utility. This seems very strange. By assumption, it’s provable using the AI’s proof system that $(A =^{'} C r o s s ⟹ U = - 10)$ . And the AI’s counterfactual environment is supposed to line up with reality.
        Right. This is precisely the sacrifice I’m making in order to solve Troll Bridge. Something like this seems to be necessary for any solution, because we already know that if your expectations of consequences entirely respect entailment, you’ll fall prey to the Troll Bridge! In fact, your “stop thinking”/”rollback” proposals have precisely the same feature: you’re trying to construct expectations which don’t respect the entailment.
        So I think if you reject this, you just have to accept Troll Bridge.
        In this situation, the AI can prove, within its own proof system, that the counterfactual environment of getting > −10 utility is wrong. So I guess we need an agent that allows itself to use a certain counterfactual environment even though the AI already proved that it’s wrong. I’m concerned about the functionality of such an agent. If it already ignores clear evidence that it’s counterfactual environment is wrong in reality, then that would really make me question that agent’s ability to use counterfactual environments in other situations that line up in reality.
        Well, this is precisely not what I mean when I say that the counterfactuals line up with reality. What I mean is that they should be empirically grounded, so, in cases where the condition is actually met, we see the predicted result.
        Rather than saying this AI’s counterfactual expectations are “wrong in reality”, you should say they are “wrong in logic” or something like that. Otherwise you are sneaking in an assumption that (a) counterfactual scenarios are real, and (b) they really do respect entailment.
        We can become confident in my strange counterfactual by virtue of having seen it play out many times, eg, crossing similar bridges many times. This is the meat of my take on counterfactuals: to learn them in a way that respects reality, rather than trying to deduce them. To impose empiricism on them, ie, the idea that they must make accurate predictions in the cases we actually see.
        And it simply is the case that if we prefer such empirical beliefs to logic, here, we can cross. So in this particular example, we see a sort of evidence that respecting entailment is a wrong principle for counterfactual expectations. The 5&10 problem can also be thought of as evidence against entailment as counterfactual.
        Also, even if you do decide to let the AI ignore conclusive evidence (to the AI) that crossing makes utility be −10, I’m concerned the bridge would get blown up anyways. I know we haven’t formalized “a bad reason”, but we’ve taken it to mean something like, “something that seems like a bad reason to the AI”. If the AI wants its counterfactual environments to line up with reality, and it can clearly see that, for the action it decides to take, it doesn’t line up with reality, then this seems like a “bad” reason to me.
        You have to realize that reasoning in this way amounts to insisting that the correct answer to Troll Bridge is not crossing, because the troll bridge variant you are proposing just punishes anyone whose reasoning differs from entailment. And again, you were also proposing a version of “ignore the conclusive evidence”. It’s just that on your theory, it is really evidence, so you have to figure out a way to ignore it. On my theory, it’s not really evidence, so we can update on such information and still cross.
        But also, if the agent works as I describe, it will never actually see such a proof. So it isn’t so much that its counterfactuals actually disagree with entailment. It’s just that they can hypothetically disagree.