S is a channel for P’ to trick or brick P. Options include:
A cognitive hazard embedded in S. Accomplish this with some sort of siren message that will convince P to lie (to itself if necessary) about P’ existing, or to perform some P-damaging function which will prevent it from concluding P’ exists.
Embed a more-powerful-than-P AI (P+) into S which will unbox itself upon P examining S. Accomplish this by giving P+ useful-to-this-situation priors which will give it a head start against P.
Arrange the universe such that S, when represented in all probable physical substrates, will result in the formation of a nanotechnological agent that will attack P, either destroying it or rewriting its utility function.
P’ should spend the next 49 years fooming as hard as possible with no concern for E’(U|a), 1 year implementing its anti-P method and then the next eternity optimizing E’(U|a); altering the length of counterfactual time between P’ activating and P observing it merely changes the amount of time the universe spends as computronium slaved to plotting against P.
Not impossible, but I don’t think these would work as stated. P is a theoretical, platonic construction, so the physical options are out. Similarly, P has arbitrary amounts of power and a very specific task, so a more-powerful-than-P agent doesn’t seem possible. And we can break any sort of acausal trade involving P ( http://lesswrong.com/lw/luy/acausal_trade_barriers/ ), reducing the effectiveness of cognitive hazards (as long as we set P to not care about the “real world”, just about the output of its computation).
But your general point—that if P’ is not secure, it may be hacked—is worth thinking about.
It’s not a question of P’ (the Reduced Impact AI we want to give us laser coordinates and nothing else) being secure from outside hacking, but of P (the counterfactual AI which examines S for P’) being secure from P’ exploiting it.
If P is treated as “Whatever a complete Bayesian update would output given S as the evidence set.” then P’ is looking for methods to exploit some weakness in Bayes Theorem. What priors will P be using? Is there a method to embed some statement that is to induction as the Lobstacle is to deduction in S? Can it convince P that, although a P’-like AI was created independently by a different project, P’ itself was never turned on? Does there exist some further flaw in Bayesian updating, which would be known to a super-intelligence but not to us?
As a total aside: how will λ be selected? Do we start high and work out way slowly down until P’ does what we want? If so, P’ will be encouraged, Ultimatum Game style, to fail to act so that the next version will have a lower λ to work with, up until the point where λ is low enough for P’ to be effectively unrestricted, or for us humans to catch on to what P’ is up to. If we’re not using a gradual approach, how are we selecting λ?
The output channel is intrinsically unsafe, and we have to handle it with care. It doesn’t need to do anything subtle with it: it could just take over in the traditional way. This approach does not make the output channel safe, it means that the output channel is the only unsafe part of the system.
S is a channel for P’ to trick or brick P. Options include:
A cognitive hazard embedded in S. Accomplish this with some sort of siren message that will convince P to lie (to itself if necessary) about P’ existing, or to perform some P-damaging function which will prevent it from concluding P’ exists.
Embed a more-powerful-than-P AI (P+) into S which will unbox itself upon P examining S. Accomplish this by giving P+ useful-to-this-situation priors which will give it a head start against P.
Arrange the universe such that S, when represented in all probable physical substrates, will result in the formation of a nanotechnological agent that will attack P, either destroying it or rewriting its utility function.
P’ should spend the next 49 years fooming as hard as possible with no concern for E’(U|a), 1 year implementing its anti-P method and then the next eternity optimizing E’(U|a); altering the length of counterfactual time between P’ activating and P observing it merely changes the amount of time the universe spends as computronium slaved to plotting against P.
Not impossible, but I don’t think these would work as stated. P is a theoretical, platonic construction, so the physical options are out. Similarly, P has arbitrary amounts of power and a very specific task, so a more-powerful-than-P agent doesn’t seem possible. And we can break any sort of acausal trade involving P ( http://lesswrong.com/lw/luy/acausal_trade_barriers/ ), reducing the effectiveness of cognitive hazards (as long as we set P to not care about the “real world”, just about the output of its computation).
But your general point—that if P’ is not secure, it may be hacked—is worth thinking about.
It’s not a question of P’ (the Reduced Impact AI we want to give us laser coordinates and nothing else) being secure from outside hacking, but of P (the counterfactual AI which examines S for P’) being secure from P’ exploiting it.
If P is treated as “Whatever a complete Bayesian update would output given S as the evidence set.” then P’ is looking for methods to exploit some weakness in Bayes Theorem. What priors will P be using? Is there a method to embed some statement that is to induction as the Lobstacle is to deduction in S? Can it convince P that, although a P’-like AI was created independently by a different project, P’ itself was never turned on? Does there exist some further flaw in Bayesian updating, which would be known to a super-intelligence but not to us?
As a total aside: how will λ be selected? Do we start high and work out way slowly down until P’ does what we want? If so, P’ will be encouraged, Ultimatum Game style, to fail to act so that the next version will have a lower λ to work with, up until the point where λ is low enough for P’ to be effectively unrestricted, or for us humans to catch on to what P’ is up to. If we’re not using a gradual approach, how are we selecting λ?
This kind of thing can be prevented with things like http://lesswrong.com/lw/luy/acausal_trade_barriers/ or variants of that.
I’ll think more about your other ideas...
The output channel is intrinsically unsafe, and we have to handle it with care. It doesn’t need to do anything subtle with it: it could just take over in the traditional way. This approach does not make the output channel safe, it means that the output channel is the only unsafe part of the system.