A sufficiently advanced reduced impact (RI) AI could run amok for 49 years and spend the last year resetting the local region to its calculated S|~X.
The RIAI can simulate humans without restriction.
Since P is probabilistic, S | X will differ from S | ~X. If the RIAI using P’ is sufficiently powerful it can bias s toward its own utility function in subtle ways that are not necessarily visible to P, perhaps by altering the probability of future AI development. To prevent such a scenario s could be split into s_1 at 50 years and s_2 at 60 years with similar restrictions on reduced impact at both surfaces. s_1 will filter out the RIAI itself, and s_2 will filter out any biases the RIAI might introduce in s_1. It seems highly unlikely that the RIAI could introduce biases in s_1 that survive until s_2 and then result in drastic changes afterward.
The RIAI may misidentify itself and X in the world. The RIAI must self-identify as “the RIAI triggered by the event X” in its final equation, and it probably shouldn’t care at all about X happening versus the fact that it is actually running.
What happens if a second AI (reduced impact or not) is required within 50 years? That may fall under the black hole/singularity caveat you mentioned, as the RIAI would have to predict the actions of a (probably more advanced) second AI.
Defining the range of O for a specific problem may be an AI-hard problem. If the range is too small the RIAI will output a useless action that trivially results in P(S | X)/P(S | ~X) = 1, perhaps because the RIAI and everything else is obliterated by the comet. Of course if the range is too large the RIAI can maximize its utility function through O beyond s. For some problems there is no safe range (e.g. “Build me an FAI” but perhaps as simple as “How do we build a laser on the moon and aim it in 12 months before the comet hits?”). Even if problems are broken up into sub-problems the RIAIs can collude to output individual Os just as if it was one complex O asked of a single RIAI.
It would certainly be handy if we could drop AIs into black holes that could only evaporate via a limited set of possibilities corresponding to the range of O. Maybe dropping AIs into the sun with limited time to transmit information would be sufficient.
A sufficiently advanced reduced impact (RI) AI could run amok for 49 years and spend the last year resetting the local region to its calculated S|~X.
Sufficiently to hide its 49 years of running amok—from an intelligence greater than itself? It would have to move materials around to the places they would have been otherwise, reconstruct all the art and engineering that would have happened in those 49 years, create de novo humans and human societies with false memories of long term existence. And catch up those light beams emitted 1.5 years ago, replacing them with some vision of never-existed happy humans doing stuff they never did...
And remember it doesn’t need to reset only the local region—it’ll have to reset most of the light-cone as well. And the further in space it goes, the more effort it takes to remove traces of the reset. You could build mirrors all around the solar system to catch the rays of light that escape with the true information, but how would you then hide the mirrors (and the material they were made from?)
To make this even less probable, we could allow S to be less noisy the further we move away from the Earth (in current frame of reference). So the data S would be very fine far away from us (in space), and more noisy closer. This makes it virtually impossible to run amok undetectably.
To prevent such a scenario s could be split into s_1 at 50 years and s_2 at 60 years
Your s_1 and s_2 ideas are good, though we could also make s thicker: make it the whole space-time interval between s_1 and s_2.
The RIAI may misidentify itself and X in the world.
That’s a risk, yes. We do need these symbols to be properly grounded, or else the whole method fails.
What happens if a second AI (reduced impact or not) is required within 50 years? That may fall under the black hole/singularity caveat you mentioned, as the RIAI would have to predict the actions of a (probably more advanced) second AI.
If the second AI is independent from the first, this is a non-issue: the effect of X versus ¬X should remain undetectable, whatever the second AI does or doesn’t do.
If the fact of running the second AI is dependent on the knowledge that the first AI worked, we need to make sure that that knowledge comes exclusively from the contents of the output channel O—if that’s the case, then first AI will still act to reduce it’s non-O impact.
Defining the range of O for a specific problem may be an AI-hard problem.
Defining O more generally may indeed be the key to the problem. For this example, though, it seems sufficient to define O to be the output of a particular wire, and make sure the rest of the setup is such that we don’t get any useful information except via this wire’s output (so no peeking!).
Sufficiently to hide its 49 years of running amok—from an intelligence greater than itself? It would have to move materials around to the places they would have been otherwise, reconstruct all the art and engineering that would have happened in those 49 years, create de novo humans and human societies with false memories of long term existence.
I agree that it would be very difficult to obscure 49 years of very different history, but there is a perverse incentive for the RIAI to effectively increase the duration and scope of its privacy box with respect to S by its choice of O; perhaps allowing the comet to fragment into a dense dust cloud around the Earth that effectively obscures Earth from the rest of the Universe for 49 years. If it only has to worry about Earth it is conceivable that it could simulate S | ~X while s | X actually happens and revert the state to S | ~X at the end of s1.
To make this even less probable, we could allow S to be less noisy the further we move away from the Earth (in current frame of reference). So the data S would be very fine far away from us (in space), and more noisy closer. This makes it virtually impossible to run amok undetectably.
That might actually be detrimental if the factor was large enough to allow noticeable local deviations between S|X and S|~X. If it’s okay for a star on the other side of the galaxy to have a 1 meter error, what does that imply for the local error in the position of a person?
Your s1 and s2 ideas are good, though we could also make s thicker: make it the whole space-time interval between s1 and s2.
Good idea. A thick s would prevent a “blip” of activity that the RIAI could try to sneak in after s1 that would dissipate by s2. My guess is that an RIAI would work by proving its non-effect on S, not by actually computing S, so I don’t think it would be intractable to thicken s to some degree. It would be nice if s could extend to t at infinity, but the uncertainty in S would make that virtually impossible. Perhaps the required certainty of P could decay with passing time such that “P(S(t)^a|X,O^a,B)/P(S(t)^a|¬X,Õ^a,B) | a” converged toward 1 as t went to infinity, but require the ratio to be ~1 over all t in s.
Setting the start of s to an hour after the RIAI starts (when it’s supposed to turn its privacy box into lots of entropy) would also limit the time it had to run amok.
Defining O more generally may indeed be the key to the problem. For this example, though, it seems sufficient to define O to be the output of a particular wire, and make sure the rest of the setup is such that we don’t get any useful information except via this wire’s output (so no peeking!).
Just don’t let it PWM-encode anything else into the control signal sent over the wire. That was my reasoning behind limiting O to the minimum necessary bandwidth; let all the output bits be useful for solving the problem with no room for additional signals.
What if the RIAI can observe O? Suppose it makes a decision to output O^impossible if and only if it does not observe O^impossible. X starts the RIAI and it outputs O^impossible, observes O^impossible, assumes Õ^impossible from its earlier decision, concludes ~X (an impossible event is less likely than the RIAI being mistaken about its identity), and is now outside the scope of reduced impact. There may be other ways for the RIAI to exploit self-interaction with its output, perhaps by triggering a restart from the output (aim the laser at the part of the wire after X that turns it back on) which allows it to conclude that it has started despite ~X and is thus a different AI free from the restrictions of reduced impact.
perhaps allowing the comet to fragment into a dense dust cloud around the Earth that effectively obscures Earth from the rest of the Universe for 49 years.
That’s an interesting thought, I’ll have to consider it more.
Your ideas on thickening s or decaying the certainty could be interesting.
Setting the start of s to an hour after the RIAI starts (when it’s supposed to turn its privacy box into lots of entropy) would also limit the time it had to run amok.
The idea with the 50 year delay is to be sure that no trace remain of the contents of the box. If we have a highly destructive ending to the privacy box, we can move s closer—but I wanted to move away from my previous “clouds of anti-matter plus bomb” scenario.
What if the RIAI can observe O? Suppose it...
That sounds like anthropic reasoning, which I banned the AI from using (or else counterfactuals in which it doesn’t exist make no sense—and the hypothetical super AI (which doesn’t exist) makes no sense either).
The idea with the 50 year delay is to be sure that no trace remain of the contents of the box. If we have a highly destructive ending to the privacy box, we can move s closer—but I wanted to move away from my previous “clouds of anti-matter plus bomb” scenario.
Instead of conditioning on the start of the AI perhaps its possible to directly measure the effect of the AI by restricting the contents of the privacy box directly. Let U be the region containing everything but the privacy box, let PB be the region containing the privacy box, and let O contain the output state. Ensure U intersect PB is O. Let U(t), PB(t), and O(t) be functions that yield the information content of the regions after time t (including the changes to the regions as things move around), and let U(0), PB(0), and O(0) be defined based on the initial background information B. Let T be the time at which the AI should be done and have altered O to indicate its output. Let Õ(t) represent a particular random state of the output. Let U_B(t) be information content of a universe U(t) + PB(0), e.g. the time-evolution of the universe while keeping PB constant. So long as P(S | U(T), PB(T), O(T)) / P(S | U_B(T), Õ(T)) = 1 I think that the AI’s impact is reduced to the region of the privacy box for the duration of T, and O. The equation implies that U(T) + PB(T) = U_B(T), or in other words that the privacy box in its final state is indistinguishable from its state in the background information. So long as the AI is restartable with a new background B and time 0, it doesn’t matter if the AI is accidentally started or ran more than once as long as it can correctly identify PB. It becomes more difficult (hopefully impossible) to directly observe whether the AI started or not, but a portion of O could be set aside for signalling the status of the AI.
I am not completely confident that the above approach works or that it covers the loopholes that your approach did, so it might make sense to add the privacy box conditional to the original one so that the AI has to satisfy both conditions.
A sufficiently advanced reduced impact (RI) AI could run amok for 49 years and spend the last year resetting the local region to its calculated S|~X.
The RIAI can simulate humans without restriction.
Since P is probabilistic, S | X will differ from S | ~X. If the RIAI using P’ is sufficiently powerful it can bias s toward its own utility function in subtle ways that are not necessarily visible to P, perhaps by altering the probability of future AI development. To prevent such a scenario s could be split into s_1 at 50 years and s_2 at 60 years with similar restrictions on reduced impact at both surfaces. s_1 will filter out the RIAI itself, and s_2 will filter out any biases the RIAI might introduce in s_1. It seems highly unlikely that the RIAI could introduce biases in s_1 that survive until s_2 and then result in drastic changes afterward.
The RIAI may misidentify itself and X in the world. The RIAI must self-identify as “the RIAI triggered by the event X” in its final equation, and it probably shouldn’t care at all about X happening versus the fact that it is actually running.
What happens if a second AI (reduced impact or not) is required within 50 years? That may fall under the black hole/singularity caveat you mentioned, as the RIAI would have to predict the actions of a (probably more advanced) second AI.
Defining the range of O for a specific problem may be an AI-hard problem. If the range is too small the RIAI will output a useless action that trivially results in P(S | X)/P(S | ~X) = 1, perhaps because the RIAI and everything else is obliterated by the comet. Of course if the range is too large the RIAI can maximize its utility function through O beyond s. For some problems there is no safe range (e.g. “Build me an FAI” but perhaps as simple as “How do we build a laser on the moon and aim it in 12 months before the comet hits?”). Even if problems are broken up into sub-problems the RIAIs can collude to output individual Os just as if it was one complex O asked of a single RIAI.
It would certainly be handy if we could drop AIs into black holes that could only evaporate via a limited set of possibilities corresponding to the range of O. Maybe dropping AIs into the sun with limited time to transmit information would be sufficient.
Sufficiently to hide its 49 years of running amok—from an intelligence greater than itself? It would have to move materials around to the places they would have been otherwise, reconstruct all the art and engineering that would have happened in those 49 years, create de novo humans and human societies with false memories of long term existence. And catch up those light beams emitted 1.5 years ago, replacing them with some vision of never-existed happy humans doing stuff they never did...
And remember it doesn’t need to reset only the local region—it’ll have to reset most of the light-cone as well. And the further in space it goes, the more effort it takes to remove traces of the reset. You could build mirrors all around the solar system to catch the rays of light that escape with the true information, but how would you then hide the mirrors (and the material they were made from?)
To make this even less probable, we could allow S to be less noisy the further we move away from the Earth (in current frame of reference). So the data S would be very fine far away from us (in space), and more noisy closer. This makes it virtually impossible to run amok undetectably.
Your s_1 and s_2 ideas are good, though we could also make s thicker: make it the whole space-time interval between s_1 and s_2.
That’s a risk, yes. We do need these symbols to be properly grounded, or else the whole method fails.
If the second AI is independent from the first, this is a non-issue: the effect of X versus ¬X should remain undetectable, whatever the second AI does or doesn’t do.
If the fact of running the second AI is dependent on the knowledge that the first AI worked, we need to make sure that that knowledge comes exclusively from the contents of the output channel O—if that’s the case, then first AI will still act to reduce it’s non-O impact.
Defining O more generally may indeed be the key to the problem. For this example, though, it seems sufficient to define O to be the output of a particular wire, and make sure the rest of the setup is such that we don’t get any useful information except via this wire’s output (so no peeking!).
I agree that it would be very difficult to obscure 49 years of very different history, but there is a perverse incentive for the RIAI to effectively increase the duration and scope of its privacy box with respect to S by its choice of O; perhaps allowing the comet to fragment into a dense dust cloud around the Earth that effectively obscures Earth from the rest of the Universe for 49 years. If it only has to worry about Earth it is conceivable that it could simulate S | ~X while s | X actually happens and revert the state to S | ~X at the end of s1.
That might actually be detrimental if the factor was large enough to allow noticeable local deviations between S|X and S|~X. If it’s okay for a star on the other side of the galaxy to have a 1 meter error, what does that imply for the local error in the position of a person?
Good idea. A thick s would prevent a “blip” of activity that the RIAI could try to sneak in after s1 that would dissipate by s2. My guess is that an RIAI would work by proving its non-effect on S, not by actually computing S, so I don’t think it would be intractable to thicken s to some degree. It would be nice if s could extend to t at infinity, but the uncertainty in S would make that virtually impossible. Perhaps the required certainty of P could decay with passing time such that “P(S(t)^a|X,O^a,B)/P(S(t)^a|¬X,Õ^a,B) | a” converged toward 1 as t went to infinity, but require the ratio to be ~1 over all t in s.
Setting the start of s to an hour after the RIAI starts (when it’s supposed to turn its privacy box into lots of entropy) would also limit the time it had to run amok.
Just don’t let it PWM-encode anything else into the control signal sent over the wire. That was my reasoning behind limiting O to the minimum necessary bandwidth; let all the output bits be useful for solving the problem with no room for additional signals.
What if the RIAI can observe O? Suppose it makes a decision to output O^impossible if and only if it does not observe O^impossible. X starts the RIAI and it outputs O^impossible, observes O^impossible, assumes Õ^impossible from its earlier decision, concludes ~X (an impossible event is less likely than the RIAI being mistaken about its identity), and is now outside the scope of reduced impact. There may be other ways for the RIAI to exploit self-interaction with its output, perhaps by triggering a restart from the output (aim the laser at the part of the wire after X that turns it back on) which allows it to conclude that it has started despite ~X and is thus a different AI free from the restrictions of reduced impact.
That’s an interesting thought, I’ll have to consider it more.
Your ideas on thickening s or decaying the certainty could be interesting.
The idea with the 50 year delay is to be sure that no trace remain of the contents of the box. If we have a highly destructive ending to the privacy box, we can move s closer—but I wanted to move away from my previous “clouds of anti-matter plus bomb” scenario.
That sounds like anthropic reasoning, which I banned the AI from using (or else counterfactuals in which it doesn’t exist make no sense—and the hypothetical super AI (which doesn’t exist) makes no sense either).
Instead of conditioning on the start of the AI perhaps its possible to directly measure the effect of the AI by restricting the contents of the privacy box directly. Let U be the region containing everything but the privacy box, let PB be the region containing the privacy box, and let O contain the output state. Ensure U intersect PB is O. Let U(t), PB(t), and O(t) be functions that yield the information content of the regions after time t (including the changes to the regions as things move around), and let U(0), PB(0), and O(0) be defined based on the initial background information B. Let T be the time at which the AI should be done and have altered O to indicate its output. Let Õ(t) represent a particular random state of the output. Let U_B(t) be information content of a universe U(t) + PB(0), e.g. the time-evolution of the universe while keeping PB constant. So long as P(S | U(T), PB(T), O(T)) / P(S | U_B(T), Õ(T)) = 1 I think that the AI’s impact is reduced to the region of the privacy box for the duration of T, and O. The equation implies that U(T) + PB(T) = U_B(T), or in other words that the privacy box in its final state is indistinguishable from its state in the background information. So long as the AI is restartable with a new background B and time 0, it doesn’t matter if the AI is accidentally started or ran more than once as long as it can correctly identify PB. It becomes more difficult (hopefully impossible) to directly observe whether the AI started or not, but a portion of O could be set aside for signalling the status of the AI.
I am not completely confident that the above approach works or that it covers the loopholes that your approach did, so it might make sense to add the privacy box conditional to the original one so that the AI has to satisfy both conditions.
I’ll think about this—for the moment, it doesn’t seem to add much, to X vs ¬X, but I may be wrong...