The proof doesn’t work on a logically uncertain agent. The logic fails here:
Examining the source code of the agent, because we’re assuming the agent crosses, either PA proved that crossing implies U=+10, or it proved that crossing implies U=0.
A logically uncertain agent does not need a proof of either of those things in order to cross, it simply needs a positive expectation of utility, for example a heuristic which says that there’s a 99% chance crossing implies U=+10.
Though you did say there’s a version which still works for logical induction. Do you have a link to where I can see that version of the argument?
Edit: Now I still see the logic. On the assumption that the agent crosses but also proves that U=-10, the agent must have a contradiction somewhere, because that, and the logical uncertainty agents I’m aware of have a contradiction upon proving U=-10 because they prove that they will not cross, and then immediately cross in a maneuver meant to prevent exactly this kind of problem.
Wait but proving crossing implies U=-10 does not mean that prove they will not cross, exactly because they might still cross, if they have a contradiction.
God this stuff is confusing. I still don’t think the logic holds though.
I’ve now edited the post to give the version which I claim works in the empirically uncertain case, and give more hints for how it still goes through in the fully logically uncertain case.
Sorry to necro this here, but I find this topic extremely interesting and I keep coming back to this page to stare at it and tie my brain in knots. Thanks for your notes on how it works in the logically uncertain case. I found a different objection based on the assumption of logical omniscience:
Regarding this you say:
Perhaps you think that the problem with the above version is that I assumed logical omniscience. It is unrealistic to suppose that agents have beliefs which perfectly respect logic. (Un)Fortunately, the argument doesn’t really depend on this; it only requires that the agent respects proofs which it can see, and eventually sees the Löbian proof referenced.
However, this assumes that the Löbian proof exists. We show that the Löbian proof of A=cross→U=−10 exists by showing that the agent can prove □(A=cross→U=−10)→(A=cross→U=−10), and the agent’s proof seems to assume logical omniscience:
Examining the agent, either crossing had higher expected utility, or P(cross)=0. But we assumed □(A=cross→U=−10), so it must be the latter. So the bridge gets blown up.
If □ here means “provable in PA”, the logic does not follow through if the agent is not logically omniscient: the agent might find crossing to have a higher expected utility regardless, because it may not have seen the proof. If □ here instead means “discoverable by the agent’s proof search” or something to that effect, then the logic here seems to follow through (making the reasonable assumption that if the agent can discover a proof for A=cross->U=-10, then it will set its expected value for crossing to −10). However, that would mean we are talking about provability in a system which can only prove finitely many things, which in particular cannot contain PA and so Löb’s theorem does not apply.
I am still trying to wrap my head around exactly what this means, since your logic seems unassailable in the logically omniscient case. It is counterintuitive to me that the logically omniscient agent would be susceptible to trolling but the more limited one would not. Perhaps there is a clever way for the troll to get around this issue? I dunno. I certainly have no proof that such an agent cannot be trolled in such a way.
Sorry to necro this here, but I find this topic extremely interesting and I keep coming back to this page to stare at it and tie my brain in knots.
(As for this, I think a major goal of LessWrong—and the alignment forum—is to facilitate sustained intellectual progress; a subgoal of that is that discussions can be sustained over long periods of time, rather than flitting about as would be the case if we only had attention for discussing the most recent posts!!)
If □ here instead means “discoverable by the agent’s proof search” or something to that effect, then the logic here seems to follow through (making the reasonable assumption that if the agent can discover a proof for A=cross->U=-10, then it will set its expected value for crossing to −10).
Right, this is what you have to do.
However, that would mean we are talking about provability in a system which can only prove finitely many things, which in particular cannot contain PA and so Löb’s theorem does not apply.
Hmm. So, a bounded theorem prover using PA can still prove Löb about itself. I think everything is more complicated and you need to make some assumptions (because there’s no guarantee a bounded proof search will find the right Löbian proof to apply to itself, in general), but you can make it go through.
The proof doesn’t work on a logically uncertain agent. The logic fails here:
A logically uncertain agent does not need a proof of either of those things in order to cross, it simply needs a positive expectation of utility, for example a heuristic which says that there’s a 99% chance crossing implies U=+10.
Though you did say there’s a version which still works for logical induction. Do you have a link to where I can see that version of the argument?
Edit: Now I still see the logic. On the assumption that the agent crosses but also proves that U=-10, the agent must have a contradiction somewhere, because that, and the logical uncertainty agents I’m aware of have a contradiction upon proving U=-10 because they prove that they will not cross, and then immediately cross in a maneuver meant to prevent exactly this kind of problem.
Wait but proving crossing implies U=-10 does not mean that prove they will not cross, exactly because they might still cross, if they have a contradiction.
God this stuff is confusing. I still don’t think the logic holds though.
I’ve now edited the post to give the version which I claim works in the empirically uncertain case, and give more hints for how it still goes through in the fully logically uncertain case.
Sorry to necro this here, but I find this topic extremely interesting and I keep coming back to this page to stare at it and tie my brain in knots. Thanks for your notes on how it works in the logically uncertain case. I found a different objection based on the assumption of logical omniscience:
Regarding this you say:
However, this assumes that the Löbian proof exists. We show that the Löbian proof of A=cross→U=−10 exists by showing that the agent can prove □(A=cross→U=−10)→(A=cross→U=−10), and the agent’s proof seems to assume logical omniscience:
If □ here means “provable in PA”, the logic does not follow through if the agent is not logically omniscient: the agent might find crossing to have a higher expected utility regardless, because it may not have seen the proof. If □ here instead means “discoverable by the agent’s proof search” or something to that effect, then the logic here seems to follow through (making the reasonable assumption that if the agent can discover a proof for A=cross->U=-10, then it will set its expected value for crossing to −10). However, that would mean we are talking about provability in a system which can only prove finitely many things, which in particular cannot contain PA and so Löb’s theorem does not apply.
I am still trying to wrap my head around exactly what this means, since your logic seems unassailable in the logically omniscient case. It is counterintuitive to me that the logically omniscient agent would be susceptible to trolling but the more limited one would not. Perhaps there is a clever way for the troll to get around this issue? I dunno. I certainly have no proof that such an agent cannot be trolled in such a way.
(As for this, I think a major goal of LessWrong—and the alignment forum—is to facilitate sustained intellectual progress; a subgoal of that is that discussions can be sustained over long periods of time, rather than flitting about as would be the case if we only had attention for discussing the most recent posts!!)
Right, this is what you have to do.
Hmm. So, a bounded theorem prover using PA can still prove Löb about itself. I think everything is more complicated and you need to make some assumptions (because there’s no guarantee a bounded proof search will find the right Löbian proof to apply to itself, in general), but you can make it go through.
I believe the technical details you’re looking for will be in Critch’s paper on bounded Löb.
Thanks for the link, I will check it out!