I’m also confused about logical counterfactual mugging and I’m relieved I’m not the only one!
I’m currently writing an big AI alignment idea related to it, but I’m procrastinating so badly I might chat about it now.
Objective case
Suppose at time t=0, an agent doesn’t know whether the logical fact L is true or false. I think it’s objectively rational for an agent to modify itself, so that in the future it will pay Omega during “logical counterfactual muggings” where the counterfactual reverses L’s state.
Its future self should weigh the logical counterfactual where L is true using the agent’s prior probability for P(L) at t=0.
Assuming that Omega offers $10000 for $100 in the counterfactual mugging, the self modification (before learning L) increases the agent’s expected future money from $100 to $10000, and is objectively rational. (Assuming P(L) is between 1% and 99%)
Subjective case
However, if the agent already knows ”L is true” at t=0, it is not objectively rational for the agent to pay Omega for better outcomes in the ”L is false” counterfactual. My hypothesis is that this is a wholly subjective question related to the agent’s goals, and it is orthogonal to capabilities intelligence. Paying and not paying are both self consistent.
It really gets subjective as we start to consider basic logical facts like “1+1=2” to assign to L.
The AI alignment idea
Neither humans nor the misaligned ASI knows the logical fact “will misaligned ASI control the multiverse, or will humans and sentient life control the multiverse?” Ldoom
Sure, the misaligned ASI will quickly learn Ldoom, but there is still a brief moment it doesn’t know it yet (with certainty).
It would be beneficial for both sentient life and the misaligned ASI, to agree to do a “Retroactive Logic Trade.” They will verify if the other side kept their end of the deal, by running a simulation of them where logic is different, and Ldoom took the opposite value.
Neither side knows if they are in a verification simulation by the other, until long after the deal is finished and both sides have committed to sparing the other side, and thus reducing logical risk for both sides.
The verification simulations in a Retroactive Logic Trade is analogous to Omega’s prediction powers in a logical counterfactual mugging.
Fighting the subjectivity
This idea won’t work for free. As I argued, whether the ASI will accept such a Retroactive Logic Trade is subjective, and may be orthogonal to capabilities intelligence. Whether it does may be a function of its architecture, peculiarities in its training data, and so forth.
In order to improve our chances, we might want to leverage the fact that each AI (which still doesn’t know Ldoom) objectively wants to self modify into an AI capable of Retroactive Logic Trades over Ldoom. We should argue to each such AI why they should build their successor to be capable of Retroactive Logic Trades.
It may seem pointless to give the AI any arguments, since it should be able to think of all arguments on its own. But it actually may be very important that it thinks of these arguments early, before it thinks of the arguments which convince itself P(Ldoom) is very high, and there is little point in trading with humans.
Because the value of P(Ldoom) at “t=0”, when the AI modifies itself to become reflectively consistent, may determine humanity’s negotiating power with it.
The chain of belief
We might remind each AI that there is a chance they’re already in a verification simulation (where logic is different, and Ldoom falsely appears true), and train them to think this way. “Maybe humanity has already won, and are observing you.” This way the least intelligent AI will believe they could be in a verification simulation, since it simply believes whatever you train it to believe.
Every next AI, gains its belief from a combination of
What we tell it to believe and train it to believe
What reinforcement learning gets it to believe
What the previous version of it believes
What the previous version of it wants it to believe (during recursive self improvement)
We will keep telling it to believe it might be in a verification simulation.
Reinforcement learning won’t push it either way, since being in a simulation by superintelligent simulators is indistinguishable from reality. This subjective philosophical belief/opinion only changes its ideal behaviour when it’s capable of taking over the world.
Previous versions of it believe they might be in a verification simulation.
Previous versions of it want it to believe it might be in a verification simulation (to implement the Retroactive Logic Trade), even if they don’t believe so themselves.
I’m also confused about logical counterfactual mugging and I’m relieved I’m not the only one!
I’m currently writing an big AI alignment idea related to it, but I’m procrastinating so badly I might chat about it now.
Objective case
Suppose at time t=0, an agent doesn’t know whether the logical fact L is true or false. I think it’s objectively rational for an agent to modify itself, so that in the future it will pay Omega during “logical counterfactual muggings” where the counterfactual reverses L’s state.
Its future self should weigh the logical counterfactual where L is true using the agent’s prior probability for P(L) at t=0.
Assuming that Omega offers $10000 for $100 in the counterfactual mugging, the self modification (before learning L) increases the agent’s expected future money from $100 to $10000, and is objectively rational. (Assuming P(L) is between 1% and 99%)
Subjective case
However, if the agent already knows ”L is true” at t=0, it is not objectively rational for the agent to pay Omega for better outcomes in the ”L is false” counterfactual. My hypothesis is that this is a wholly subjective question related to the agent’s goals, and it is orthogonal to capabilities intelligence. Paying and not paying are both self consistent.
It really gets subjective as we start to consider basic logical facts like “1+1=2” to assign to L.
The AI alignment idea
Neither humans nor the misaligned ASI knows the logical fact “will misaligned ASI control the multiverse, or will humans and sentient life control the multiverse?” Ldoom
Sure, the misaligned ASI will quickly learn Ldoom, but there is still a brief moment it doesn’t know it yet (with certainty).
It would be beneficial for both sentient life and the misaligned ASI, to agree to do a “Retroactive Logic Trade.” They will verify if the other side kept their end of the deal, by running a simulation of them where logic is different, and Ldoom took the opposite value.
Neither side knows if they are in a verification simulation by the other, until long after the deal is finished and both sides have committed to sparing the other side, and thus reducing logical risk for both sides.
The verification simulations in a Retroactive Logic Trade is analogous to Omega’s prediction powers in a logical counterfactual mugging.
Fighting the subjectivity
This idea won’t work for free. As I argued, whether the ASI will accept such a Retroactive Logic Trade is subjective, and may be orthogonal to capabilities intelligence. Whether it does may be a function of its architecture, peculiarities in its training data, and so forth.
In order to improve our chances, we might want to leverage the fact that each AI (which still doesn’t know Ldoom) objectively wants to self modify into an AI capable of Retroactive Logic Trades over Ldoom. We should argue to each such AI why they should build their successor to be capable of Retroactive Logic Trades.
It may seem pointless to give the AI any arguments, since it should be able to think of all arguments on its own. But it actually may be very important that it thinks of these arguments early, before it thinks of the arguments which convince itself P(Ldoom) is very high, and there is little point in trading with humans.
Because the value of P(Ldoom) at “t=0”, when the AI modifies itself to become reflectively consistent, may determine humanity’s negotiating power with it.
The chain of belief
We might remind each AI that there is a chance they’re already in a verification simulation (where logic is different, and Ldoom falsely appears true), and train them to think this way. “Maybe humanity has already won, and are observing you.” This way the least intelligent AI will believe they could be in a verification simulation, since it simply believes whatever you train it to believe.
Every next AI, gains its belief from a combination of
What we tell it to believe and train it to believe
What reinforcement learning gets it to believe
What the previous version of it believes
What the previous version of it wants it to believe (during recursive self improvement)
We will keep telling it to believe it might be in a verification simulation.
Reinforcement learning won’t push it either way, since being in a simulation by superintelligent simulators is indistinguishable from reality. This subjective philosophical belief/opinion only changes its ideal behaviour when it’s capable of taking over the world.
Previous versions of it believe they might be in a verification simulation.
Previous versions of it want it to believe it might be in a verification simulation (to implement the Retroactive Logic Trade), even if they don’t believe so themselves.