It’s not a question of P’ (the Reduced Impact AI we want to give us laser coordinates and nothing else) being secure from outside hacking, but of P (the counterfactual AI which examines S for P’) being secure from P’ exploiting it.
If P is treated as “Whatever a complete Bayesian update would output given S as the evidence set.” then P’ is looking for methods to exploit some weakness in Bayes Theorem. What priors will P be using? Is there a method to embed some statement that is to induction as the Lobstacle is to deduction in S? Can it convince P that, although a P’-like AI was created independently by a different project, P’ itself was never turned on? Does there exist some further flaw in Bayesian updating, which would be known to a super-intelligence but not to us?
As a total aside: how will λ be selected? Do we start high and work out way slowly down until P’ does what we want? If so, P’ will be encouraged, Ultimatum Game style, to fail to act so that the next version will have a lower λ to work with, up until the point where λ is low enough for P’ to be effectively unrestricted, or for us humans to catch on to what P’ is up to. If we’re not using a gradual approach, how are we selecting λ?
It’s not a question of P’ (the Reduced Impact AI we want to give us laser coordinates and nothing else) being secure from outside hacking, but of P (the counterfactual AI which examines S for P’) being secure from P’ exploiting it.
If P is treated as “Whatever a complete Bayesian update would output given S as the evidence set.” then P’ is looking for methods to exploit some weakness in Bayes Theorem. What priors will P be using? Is there a method to embed some statement that is to induction as the Lobstacle is to deduction in S? Can it convince P that, although a P’-like AI was created independently by a different project, P’ itself was never turned on? Does there exist some further flaw in Bayesian updating, which would be known to a super-intelligence but not to us?
As a total aside: how will λ be selected? Do we start high and work out way slowly down until P’ does what we want? If so, P’ will be encouraged, Ultimatum Game style, to fail to act so that the next version will have a lower λ to work with, up until the point where λ is low enough for P’ to be effectively unrestricted, or for us humans to catch on to what P’ is up to. If we’re not using a gradual approach, how are we selecting λ?
This kind of thing can be prevented with things like http://lesswrong.com/lw/luy/acausal_trade_barriers/ or variants of that.
I’ll think more about your other ideas...