The idea with the 50 year delay is to be sure that no trace remain of the contents of the box. If we have a highly destructive ending to the privacy box, we can move s closer—but I wanted to move away from my previous “clouds of anti-matter plus bomb” scenario.
Instead of conditioning on the start of the AI perhaps its possible to directly measure the effect of the AI by restricting the contents of the privacy box directly. Let U be the region containing everything but the privacy box, let PB be the region containing the privacy box, and let O contain the output state. Ensure U intersect PB is O. Let U(t), PB(t), and O(t) be functions that yield the information content of the regions after time t (including the changes to the regions as things move around), and let U(0), PB(0), and O(0) be defined based on the initial background information B. Let T be the time at which the AI should be done and have altered O to indicate its output. Let Õ(t) represent a particular random state of the output. Let U_B(t) be information content of a universe U(t) + PB(0), e.g. the time-evolution of the universe while keeping PB constant. So long as P(S | U(T), PB(T), O(T)) / P(S | U_B(T), Õ(T)) = 1 I think that the AI’s impact is reduced to the region of the privacy box for the duration of T, and O. The equation implies that U(T) + PB(T) = U_B(T), or in other words that the privacy box in its final state is indistinguishable from its state in the background information. So long as the AI is restartable with a new background B and time 0, it doesn’t matter if the AI is accidentally started or ran more than once as long as it can correctly identify PB. It becomes more difficult (hopefully impossible) to directly observe whether the AI started or not, but a portion of O could be set aside for signalling the status of the AI.
I am not completely confident that the above approach works or that it covers the loopholes that your approach did, so it might make sense to add the privacy box conditional to the original one so that the AI has to satisfy both conditions.
Instead of conditioning on the start of the AI perhaps its possible to directly measure the effect of the AI by restricting the contents of the privacy box directly. Let U be the region containing everything but the privacy box, let PB be the region containing the privacy box, and let O contain the output state. Ensure U intersect PB is O. Let U(t), PB(t), and O(t) be functions that yield the information content of the regions after time t (including the changes to the regions as things move around), and let U(0), PB(0), and O(0) be defined based on the initial background information B. Let T be the time at which the AI should be done and have altered O to indicate its output. Let Õ(t) represent a particular random state of the output. Let U_B(t) be information content of a universe U(t) + PB(0), e.g. the time-evolution of the universe while keeping PB constant. So long as P(S | U(T), PB(T), O(T)) / P(S | U_B(T), Õ(T)) = 1 I think that the AI’s impact is reduced to the region of the privacy box for the duration of T, and O. The equation implies that U(T) + PB(T) = U_B(T), or in other words that the privacy box in its final state is indistinguishable from its state in the background information. So long as the AI is restartable with a new background B and time 0, it doesn’t matter if the AI is accidentally started or ran more than once as long as it can correctly identify PB. It becomes more difficult (hopefully impossible) to directly observe whether the AI started or not, but a portion of O could be set aside for signalling the status of the AI.
I am not completely confident that the above approach works or that it covers the loopholes that your approach did, so it might make sense to add the privacy box conditional to the original one so that the AI has to satisfy both conditions.
I’ll think about this—for the moment, it doesn’t seem to add much, to X vs ¬X, but I may be wrong...