Not impossible, but I don’t think these would work as stated. P is a theoretical, platonic construction, so the physical options are out. Similarly, P has arbitrary amounts of power and a very specific task, so a more-powerful-than-P agent doesn’t seem possible. And we can break any sort of acausal trade involving P ( http://lesswrong.com/lw/luy/acausal_trade_barriers/ ), reducing the effectiveness of cognitive hazards (as long as we set P to not care about the “real world”, just about the output of its computation).
But your general point—that if P’ is not secure, it may be hacked—is worth thinking about.
It’s not a question of P’ (the Reduced Impact AI we want to give us laser coordinates and nothing else) being secure from outside hacking, but of P (the counterfactual AI which examines S for P’) being secure from P’ exploiting it.
If P is treated as “Whatever a complete Bayesian update would output given S as the evidence set.” then P’ is looking for methods to exploit some weakness in Bayes Theorem. What priors will P be using? Is there a method to embed some statement that is to induction as the Lobstacle is to deduction in S? Can it convince P that, although a P’-like AI was created independently by a different project, P’ itself was never turned on? Does there exist some further flaw in Bayesian updating, which would be known to a super-intelligence but not to us?
As a total aside: how will λ be selected? Do we start high and work out way slowly down until P’ does what we want? If so, P’ will be encouraged, Ultimatum Game style, to fail to act so that the next version will have a lower λ to work with, up until the point where λ is low enough for P’ to be effectively unrestricted, or for us humans to catch on to what P’ is up to. If we’re not using a gradual approach, how are we selecting λ?
Not impossible, but I don’t think these would work as stated. P is a theoretical, platonic construction, so the physical options are out. Similarly, P has arbitrary amounts of power and a very specific task, so a more-powerful-than-P agent doesn’t seem possible. And we can break any sort of acausal trade involving P ( http://lesswrong.com/lw/luy/acausal_trade_barriers/ ), reducing the effectiveness of cognitive hazards (as long as we set P to not care about the “real world”, just about the output of its computation).
But your general point—that if P’ is not secure, it may be hacked—is worth thinking about.
It’s not a question of P’ (the Reduced Impact AI we want to give us laser coordinates and nothing else) being secure from outside hacking, but of P (the counterfactual AI which examines S for P’) being secure from P’ exploiting it.
If P is treated as “Whatever a complete Bayesian update would output given S as the evidence set.” then P’ is looking for methods to exploit some weakness in Bayes Theorem. What priors will P be using? Is there a method to embed some statement that is to induction as the Lobstacle is to deduction in S? Can it convince P that, although a P’-like AI was created independently by a different project, P’ itself was never turned on? Does there exist some further flaw in Bayesian updating, which would be known to a super-intelligence but not to us?
As a total aside: how will λ be selected? Do we start high and work out way slowly down until P’ does what we want? If so, P’ will be encouraged, Ultimatum Game style, to fail to act so that the next version will have a lower λ to work with, up until the point where λ is low enough for P’ to be effectively unrestricted, or for us humans to catch on to what P’ is up to. If we’re not using a gradual approach, how are we selecting λ?
This kind of thing can be prevented with things like http://lesswrong.com/lw/luy/acausal_trade_barriers/ or variants of that.
I’ll think more about your other ideas...