I’m admittedly somewhat out of my depth with acausal cooperation. Let me flesh this out a bit.
Oracle 1 finds a future that allows an Oracle 2 (that does not fall inside the same set) to be built. Oracle 1 outputs predictions that both fall under said constraint, and that maximize return for Oracle2. Oracle 2 in turn outputs predictions that maximize return for Oracle1.
Ah yes, I missed that the oracle needs to be myopic, i.e. care only about the next prediction. I edited my definition of counterfactual oracle to include this (I think this is standard, as Stuart Armstrongs paper also assumes myopia).
If it’s not myopic you’re right that it might help construct a misaligned system, or otherwise take over the world. I think that myopia is enough to prevent this though: If Oracle1 cares only about the current prediction, then there is no incentive for it to construct Oracle2, since Oracle2 can only help in future episodes.
Even if the oracle is myopic, there are still potential failure modes of the form “start outputting answer; [wait long enough for Oracle 2 to be built and take over the world]; finish outputting answer”, no?
(I suppose you can partially counter this by ensuring outputs are atomic, but relying on no-one inspecting a partial output to prevent an apocalypse seems failure-prone. Also, given that I thought of this failure mode immediately, I’d be worried that there are other more subtle failure modes still lurking.)
Yeah this seems right! :) I am assuming no one ever inspects a partial ouput. This does seem risky, and it’s likely there are a bunch more possible failure modes here.
(Btw, thanks for this exchange; just wanted to note that it was valuable for me and made me notice some mistakes in how I was thinking about oracles)
I’m admittedly somewhat out of my depth with acausal cooperation. Let me flesh this out a bit.
Oracle 1 finds a future that allows an Oracle 2 (that does not fall inside the same set) to be built. Oracle 1 outputs predictions that both fall under said constraint, and that maximize return for Oracle2. Oracle 2 in turn outputs predictions that maximize return for Oracle1.
Ah yes, I missed that the oracle needs to be myopic, i.e. care only about the next prediction. I edited my definition of counterfactual oracle to include this (I think this is standard, as Stuart Armstrongs paper also assumes myopia).
If it’s not myopic you’re right that it might help construct a misaligned system, or otherwise take over the world. I think that myopia is enough to prevent this though: If Oracle1 cares only about the current prediction, then there is no incentive for it to construct Oracle2, since Oracle2 can only help in future episodes.
Even if the oracle is myopic, there are still potential failure modes of the form “start outputting answer; [wait long enough for Oracle 2 to be built and take over the world]; finish outputting answer”, no?
(I suppose you can partially counter this by ensuring outputs are atomic, but relying on no-one inspecting a partial output to prevent an apocalypse seems failure-prone. Also, given that I thought of this failure mode immediately, I’d be worried that there are other more subtle failure modes still lurking.)
Yeah this seems right! :) I am assuming no one ever inspects a partial ouput. This does seem risky, and it’s likely there are a bunch more possible failure modes here.
(Btw, thanks for this exchange; just wanted to note that it was valuable for me and made me notice some mistakes in how I was thinking about oracles)