The practical difference is that the counterfactual oracle design doesn’t address side-channel attacks, only unsafe answers.
Internally, the counterfactual oracle is implemented via the utility function: it wants to give an answer that would be accurate if it were unread. This puts no constraints on how it gets that answer, and I don’t see any way extend the technique to cover the reasoning process.
My proposal is implemented via a constraint on the AI’s model of the world. Whether this is actually possible depends on the details of the AI; anything of a “try random stuff, repeat whatever gets results” nature would make it impossible, but an explicitly Bayesian thing like the AIXI family would be amenable. I think this is why Stuart works with the utility function lately, but I don’t think you can get a safe Oracle this way without either creating an agent-grade safe utility function or constructing a superintelligence-proof traditional box.
I don’t completely understand the difference between your proposal and Stuart’s counterfactual oracles, can you explain?
The practical difference is that the counterfactual oracle design doesn’t address side-channel attacks, only unsafe answers.
Internally, the counterfactual oracle is implemented via the utility function: it wants to give an answer that would be accurate if it were unread. This puts no constraints on how it gets that answer, and I don’t see any way extend the technique to cover the reasoning process.
My proposal is implemented via a constraint on the AI’s model of the world. Whether this is actually possible depends on the details of the AI; anything of a “try random stuff, repeat whatever gets results” nature would make it impossible, but an explicitly Bayesian thing like the AIXI family would be amenable. I think this is why Stuart works with the utility function lately, but I don’t think you can get a safe Oracle this way without either creating an agent-grade safe utility function or constructing a superintelligence-proof traditional box.