Suppose instead the crossing counterfactual results in a utility greater than −10 utility. This seems very strange. By assumption, it’s provable using the AI’s proof system that (A=′Cross⟹U=−10). And the AI’s counterfactual environment is supposed to line up with reality.
Right. This is precisely the sacrifice I’m making in order to solve Troll Bridge. Something like this seems to be necessary for any solution, because we already know that if your expectations of consequences entirely respect entailment, you’ll fall prey to the Troll Bridge! In fact, your “stop thinking”/”rollback” proposals have precisely the same feature: you’re trying to construct expectations which don’t respect the entailment.
So I think if you reject this, you just have to accept Troll Bridge.
In this situation, the AI can prove, within its own proof system, that the counterfactual environment of getting > −10 utility is wrong. So I guess we need an agent that allows itself to use a certain counterfactual environment even though the AI already proved that it’s wrong. I’m concerned about the functionality of such an agent. If it already ignores clear evidence that it’s counterfactual environment is wrong in reality, then that would really make me question that agent’s ability to use counterfactual environments in other situations that line up in reality.
Well, this is precisely not what I mean when I say that the counterfactuals line up with reality. What I mean is that they should be empirically grounded, so, in cases where the condition is actually met, we see the predicted result.
Rather than saying this AI’s counterfactual expectations are “wrong in reality”, you should say they are “wrong in logic” or something like that. Otherwise you are sneaking in an assumption that (a) counterfactual scenarios are real, and (b) they really do respect entailment.
We can become confident in my strange counterfactual by virtue of having seen it play out many times, eg, crossing similar bridges many times. This is the meat of my take on counterfactuals: to learn them in a way that respects reality, rather than trying to deduce them. To impose empiricism on them, ie, the idea that they must make accurate predictions in the cases we actually see.
And it simply is the case that if we prefer such empirical beliefs to logic, here, we can cross. So in this particular example, we see a sort of evidence that respecting entailment is a wrong principle for counterfactual expectations. The 5&10 problem can also be thought of as evidence against entailment as counterfactual.
Also, even if you do decide to let the AI ignore conclusive evidence (to the AI) that crossing makes utility be −10, I’m concerned the bridge would get blown up anyways. I know we haven’t formalized “a bad reason”, but we’ve taken it to mean something like, “something that seems like a bad reason to the AI”. If the AI wants its counterfactual environments to line up with reality, and it can clearly see that, for the action it decides to take, it doesn’t line up with reality, then this seems like a “bad” reason to me.
You have to realize that reasoning in this way amounts to insisting that the correct answer to Troll Bridge is not crossing, because the troll bridge variant you are proposing just punishes anyone whose reasoning differs from entailment. And again, you were also proposing a version of “ignore the conclusive evidence”. It’s just that on your theory, it is really evidence, so you have to figure out a way to ignore it. On my theory, it’s not really evidence, so we can update on such information and still cross.
But also, if the agent works as I describe, it will never actually see such a proof. So it isn’t so much that its counterfactuals actually disagree with entailment. It’s just that they can hypothetically disagree.
Right. This is precisely the sacrifice I’m making in order to solve Troll Bridge. Something like this seems to be necessary for any solution, because we already know that if your expectations of consequences entirely respect entailment, you’ll fall prey to the Troll Bridge! In fact, your “stop thinking”/”rollback” proposals have precisely the same feature: you’re trying to construct expectations which don’t respect the entailment.
So I think if you reject this, you just have to accept Troll Bridge.
Well, this is precisely not what I mean when I say that the counterfactuals line up with reality. What I mean is that they should be empirically grounded, so, in cases where the condition is actually met, we see the predicted result.
Rather than saying this AI’s counterfactual expectations are “wrong in reality”, you should say they are “wrong in logic” or something like that. Otherwise you are sneaking in an assumption that (a) counterfactual scenarios are real, and (b) they really do respect entailment.
We can become confident in my strange counterfactual by virtue of having seen it play out many times, eg, crossing similar bridges many times. This is the meat of my take on counterfactuals: to learn them in a way that respects reality, rather than trying to deduce them. To impose empiricism on them, ie, the idea that they must make accurate predictions in the cases we actually see.
And it simply is the case that if we prefer such empirical beliefs to logic, here, we can cross. So in this particular example, we see a sort of evidence that respecting entailment is a wrong principle for counterfactual expectations. The 5&10 problem can also be thought of as evidence against entailment as counterfactual.
You have to realize that reasoning in this way amounts to insisting that the correct answer to Troll Bridge is not crossing, because the troll bridge variant you are proposing just punishes anyone whose reasoning differs from entailment. And again, you were also proposing a version of “ignore the conclusive evidence”. It’s just that on your theory, it is really evidence, so you have to figure out a way to ignore it. On my theory, it’s not really evidence, so we can update on such information and still cross.
But also, if the agent works as I describe, it will never actually see such a proof. So it isn’t so much that its counterfactuals actually disagree with entailment. It’s just that they can hypothetically disagree.