It’s an interesting idea that could perhaps work if we could directly program the Oracle to care about its reward signal. However if we’re using ML to train the Oracle, like in Stuart’s Self-confirming prophecies, and simplified Oracle designs, the Oracle will still learn not to make “there will be a successful attack” predictions because the Oracle only survives in timelines where such predictions do not end up with a positive reward.
However this discussion has made me realize that if an attempted attack doesn’t succeed with probability 1, the Oracle will make a “there will be a failed attack” prediction, which we can use as a warning signal (as long as we keep in mind that the attack might actually have a high probability of success).
In the event of erasure, randomly decide how many ressources to allocate to preventing an attack this week.
Ask the Oracle to predict the probability distribution over given advice. Compare to the hardcoded distribution to deduce attack severity and how much budget to allocate.
Purchase erasure insurance to have enough counterfactual power to affect even global physics experiments. Finding trustworthy insurers won’t be a problem, because, like, we have an Oracle.
Is even more power than the market has needed? Ask the Oracle “How likely is a randomly selected string to prove P=NP constructively and usefully?”. If this number is not superexponentially close to 0, define erasure from now on as a random string winning the P=NP lottery. Then we will always counterfactually have as much power as we need. Perhaps this one is too much power, because even our Oracle might have trouble viewing a P=NP singularity.
It’s an interesting idea that could perhaps work if we could directly program the Oracle to care about its reward signal. However if we’re using ML to train the Oracle, like in Stuart’s Self-confirming prophecies, and simplified Oracle designs, the Oracle will still learn not to make “there will be a successful attack” predictions because the Oracle only survives in timelines where such predictions do not end up with a positive reward.
However this discussion has made me realize that if an attempted attack doesn’t succeed with probability 1, the Oracle will make a “there will be a failed attack” prediction, which we can use as a warning signal (as long as we keep in mind that the attack might actually have a high probability of success).
In the event of erasure, randomly decide how many ressources to allocate to preventing an attack this week.
Ask the Oracle to predict the probability distribution over given advice. Compare to the hardcoded distribution to deduce attack severity and how much budget to allocate.
Purchase erasure insurance to have enough counterfactual power to affect even global physics experiments. Finding trustworthy insurers won’t be a problem, because, like, we have an Oracle.
Is even more power than the market has needed? Ask the Oracle “How likely is a randomly selected string to prove P=NP constructively and usefully?”. If this number is not superexponentially close to 0, define erasure from now on as a random string winning the P=NP lottery. Then we will always counterfactually have as much power as we need. Perhaps this one is too much power, because even our Oracle might have trouble viewing a P=NP singularity.