It should be possible to defend the Oracle against humans and physics so long as its box self-destructs in case of erasure and subsequent tampering, therefore giving the Oracle whatever reward was last set to be the default.
I don’t understand your point here, or maybe I didn’t get my original point across. Suppose (in the event of erasure) someone tries to attack the Oracle, and “box self-destructs in case of erasure and subsequent tampering, therefore giving the Oracle whatever reward was last set to be the default”. My point was that in this case, there is no reason for the Oracle to make the prediction that someone would try to attack it, so my idea doesn’t seem to help with defending the Oracle.
You plan to reward the Oracle later in accordance with its prediction. I suggest that we immediately reward the Oracle as if there would be an attack, then later, if we are still able to do so, reward the Oracle by the difference between the reward in case of no attack and the reward in case of attack.
It’s an interesting idea that could perhaps work if we could directly program the Oracle to care about its reward signal. However if we’re using ML to train the Oracle, like in Stuart’s Self-confirming prophecies, and simplified Oracle designs, the Oracle will still learn not to make “there will be a successful attack” predictions because the Oracle only survives in timelines where such predictions do not end up with a positive reward.
However this discussion has made me realize that if an attempted attack doesn’t succeed with probability 1, the Oracle will make a “there will be a failed attack” prediction, which we can use as a warning signal (as long as we keep in mind that the attack might actually have a high probability of success).
In the event of erasure, randomly decide how many ressources to allocate to preventing an attack this week.
Ask the Oracle to predict the probability distribution over given advice. Compare to the hardcoded distribution to deduce attack severity and how much budget to allocate.
Purchase erasure insurance to have enough counterfactual power to affect even global physics experiments. Finding trustworthy insurers won’t be a problem, because, like, we have an Oracle.
Is even more power than the market has needed? Ask the Oracle “How likely is a randomly selected string to prove P=NP constructively and usefully?”. If this number is not superexponentially close to 0, define erasure from now on as a random string winning the P=NP lottery. Then we will always counterfactually have as much power as we need. Perhaps this one is too much power, because even our Oracle might have trouble viewing a P=NP singularity.
I don’t understand your point here, or maybe I didn’t get my original point across. Suppose (in the event of erasure) someone tries to attack the Oracle, and “box self-destructs in case of erasure and subsequent tampering, therefore giving the Oracle whatever reward was last set to be the default”. My point was that in this case, there is no reason for the Oracle to make the prediction that someone would try to attack it, so my idea doesn’t seem to help with defending the Oracle.
You plan to reward the Oracle later in accordance with its prediction. I suggest that we immediately reward the Oracle as if there would be an attack, then later, if we are still able to do so, reward the Oracle by the difference between the reward in case of no attack and the reward in case of attack.
It’s an interesting idea that could perhaps work if we could directly program the Oracle to care about its reward signal. However if we’re using ML to train the Oracle, like in Stuart’s Self-confirming prophecies, and simplified Oracle designs, the Oracle will still learn not to make “there will be a successful attack” predictions because the Oracle only survives in timelines where such predictions do not end up with a positive reward.
However this discussion has made me realize that if an attempted attack doesn’t succeed with probability 1, the Oracle will make a “there will be a failed attack” prediction, which we can use as a warning signal (as long as we keep in mind that the attack might actually have a high probability of success).
In the event of erasure, randomly decide how many ressources to allocate to preventing an attack this week.
Ask the Oracle to predict the probability distribution over given advice. Compare to the hardcoded distribution to deduce attack severity and how much budget to allocate.
Purchase erasure insurance to have enough counterfactual power to affect even global physics experiments. Finding trustworthy insurers won’t be a problem, because, like, we have an Oracle.
Is even more power than the market has needed? Ask the Oracle “How likely is a randomly selected string to prove P=NP constructively and usefully?”. If this number is not superexponentially close to 0, define erasure from now on as a random string winning the P=NP lottery. Then we will always counterfactually have as much power as we need. Perhaps this one is too much power, because even our Oracle might have trouble viewing a P=NP singularity.