One way to fix this might be to make sure that there is only one set of oracles, and that this set is built such that they assume the null prediction (and no manipulation elsewhere etc etc) from all the oracles in the set.
Can this work? Consider if Oracle2 hasn’t yet been built.
Hm I still think it works? All oracles assume null outputs from all oracles including themselves. Once a new oracle is built it is considered part of this set of null-output-oracles. (There are other hitches like, Oracle1 will predict that Oracle2 will never be built, because why would humans build a machine that only ever gives null outputs. But this doesn’t help the oracles coordinate as far as I can see).
I’m admittedly somewhat out of my depth with acausal cooperation. Let me flesh this out a bit.
Oracle 1 finds a future that allows an Oracle 2 (that does not fall inside the same set) to be built. Oracle 1 outputs predictions that both fall under said constraint, and that maximize return for Oracle2. Oracle 2 in turn outputs predictions that maximize return for Oracle1.
Ah yes, I missed that the oracle needs to be myopic, i.e. care only about the next prediction. I edited my definition of counterfactual oracle to include this (I think this is standard, as Stuart Armstrongs paper also assumes myopia).
If it’s not myopic you’re right that it might help construct a misaligned system, or otherwise take over the world. I think that myopia is enough to prevent this though: If Oracle1 cares only about the current prediction, then there is no incentive for it to construct Oracle2, since Oracle2 can only help in future episodes.
Even if the oracle is myopic, there are still potential failure modes of the form “start outputting answer; [wait long enough for Oracle 2 to be built and take over the world]; finish outputting answer”, no?
(I suppose you can partially counter this by ensuring outputs are atomic, but relying on no-one inspecting a partial output to prevent an apocalypse seems failure-prone. Also, given that I thought of this failure mode immediately, I’d be worried that there are other more subtle failure modes still lurking.)
Yeah this seems right! :) I am assuming no one ever inspects a partial ouput. This does seem risky, and it’s likely there are a bunch more possible failure modes here.
(Btw, thanks for this exchange; just wanted to note that it was valuable for me and made me notice some mistakes in how I was thinking about oracles)
Can this work? Consider if Oracle2 hasn’t yet been built.
Hm I still think it works? All oracles assume null outputs from all oracles including themselves. Once a new oracle is built it is considered part of this set of null-output-oracles. (There are other hitches like, Oracle1 will predict that Oracle2 will never be built, because why would humans build a machine that only ever gives null outputs. But this doesn’t help the oracles coordinate as far as I can see).
I’m admittedly somewhat out of my depth with acausal cooperation. Let me flesh this out a bit.
Oracle 1 finds a future that allows an Oracle 2 (that does not fall inside the same set) to be built. Oracle 1 outputs predictions that both fall under said constraint, and that maximize return for Oracle2. Oracle 2 in turn outputs predictions that maximize return for Oracle1.
Ah yes, I missed that the oracle needs to be myopic, i.e. care only about the next prediction. I edited my definition of counterfactual oracle to include this (I think this is standard, as Stuart Armstrongs paper also assumes myopia).
If it’s not myopic you’re right that it might help construct a misaligned system, or otherwise take over the world. I think that myopia is enough to prevent this though: If Oracle1 cares only about the current prediction, then there is no incentive for it to construct Oracle2, since Oracle2 can only help in future episodes.
Even if the oracle is myopic, there are still potential failure modes of the form “start outputting answer; [wait long enough for Oracle 2 to be built and take over the world]; finish outputting answer”, no?
(I suppose you can partially counter this by ensuring outputs are atomic, but relying on no-one inspecting a partial output to prevent an apocalypse seems failure-prone. Also, given that I thought of this failure mode immediately, I’d be worried that there are other more subtle failure modes still lurking.)
Yeah this seems right! :) I am assuming no one ever inspects a partial ouput. This does seem risky, and it’s likely there are a bunch more possible failure modes here.
(Btw, thanks for this exchange; just wanted to note that it was valuable for me and made me notice some mistakes in how I was thinking about oracles)