What came to mind for me before reading the spoiler-ed options, was a variation on #2, with the difference being that, instead of trying to extract P’s hypothesis about B, we instead modify T to get a T’ which has P replaced with a P’ which is a paperclip minimizer instead of maximizer, and then run both, and only use the output when the two agree, or if they give probabilities, use the average, or whatever.
Perhaps this could have an advantage over #2 if it is easier to negate what P is optimizing for than to extract P’s model of B. (edit: though, of course, if extracting the model from P is feasible, that would be better than the scheme I described)
On the other hand, maybe this could still be dangerous, if P and P’ have shared instrumental goals with regards to your predictions for B?
Though, if P has a good model of you, A, then presumably if you were to do this, both P and P’ would expect you would do this, and, so I don’t know what would make sense for them to do?
It seems like they would both expect that, while they may be able to influence you, that insofar as the influence would effect the expected value of number of paperclips, it would be canceled out by the other’s influence (assuming that the ability to influence # paperclips via changing your prediction of B, is symmetric, which, I guess it might not be..).
I suppose this would be a reason why P would want its thought processes to be inscrutable to those simulating it, so that the simulators are unable to construct P’ .
__
As a variation on #4, if P is running on a computer in a physics simulation in T, then almost certainly a direct emulation of that computer running P would run faster than T does, and therefore whatever model of B that P has, can be computed faster than T can be. What if, upon discovering this fact about T, we restrict the search among Turing machines to only include machines that run faster than T?
This would include emulations of P, and would therefore include emulations of P’s model of B (which would probably be even faster than emulating P?), but I imagine that a description of an emulation of P without the physics simulation and such would have a longer description than a description of just P’s model of B. But maybe it wouldn’t.
On the other hand, maybe this could still be dangerous, if P and P’ have shared instrumental goals with regards to your predictions for B?
Assuming that P and P’ are perfectly antialigned, they won’t cooperate. However they need to be really antialigned for this to work. If there is some obscure borderline that P thinks is a paperclip, and P’ thinks isn’t, they can work together to tile the universe with it.
I don’t think it would bed that easy to change evolution into a reproductive fitness minimiser, or to negate a humans values.
If P and P’ are antialigned, then in the scenario where you only listen to them if they agree, then for any particular prediction, at least one of them will consider disagreeing better than that. The game theory is a little complicated, but they aren’t being incentivised to report their predictions.
Actually, A has to be able to manage, not only correct and competent adversaries, but deluded and half mad ones too.
I think P would find it hard to be inscrutable. It is impossible to obfuscate arbitrary code.
I agree with your final point. Though for any particular string X, the fastest turing machine to produce it is the one that is basically print(X) . This is why we use short TM’s not just fast ones.
What came to mind for me before reading the spoiler-ed options, was a variation on #2, with the difference being that, instead of trying to extract P’s hypothesis about B, we instead modify T to get a T’ which has P replaced with a P’ which is a paperclip minimizer instead of maximizer, and then run both, and only use the output when the two agree, or if they give probabilities, use the average, or whatever.
Perhaps this could have an advantage over #2 if it is easier to negate what P is optimizing for than to extract P’s model of B. (edit: though, of course, if extracting the model from P is feasible, that would be better than the scheme I described)
On the other hand, maybe this could still be dangerous, if P and P’ have shared instrumental goals with regards to your predictions for B?
Though, if P has a good model of you, A, then presumably if you were to do this, both P and P’ would expect you would do this, and, so I don’t know what would make sense for them to do?
It seems like they would both expect that, while they may be able to influence you, that insofar as the influence would effect the expected value of number of paperclips, it would be canceled out by the other’s influence (assuming that the ability to influence # paperclips via changing your prediction of B, is symmetric, which, I guess it might not be..).
I suppose this would be a reason why P would want its thought processes to be inscrutable to those simulating it, so that the simulators are unable to construct P’ .
__
As a variation on #4, if P is running on a computer in a physics simulation in T, then almost certainly a direct emulation of that computer running P would run faster than T does, and therefore whatever model of B that P has, can be computed faster than T can be. What if, upon discovering this fact about T, we restrict the search among Turing machines to only include machines that run faster than T?
This would include emulations of P, and would therefore include emulations of P’s model of B (which would probably be even faster than emulating P?), but I imagine that a description of an emulation of P without the physics simulation and such would have a longer description than a description of just P’s model of B. But maybe it wouldn’t.
Thanks for a thoughtful comment.
Assuming that P and P’ are perfectly antialigned, they won’t cooperate. However they need to be really antialigned for this to work. If there is some obscure borderline that P thinks is a paperclip, and P’ thinks isn’t, they can work together to tile the universe with it.
I don’t think it would bed that easy to change evolution into a reproductive fitness minimiser, or to negate a humans values.
If P and P’ are antialigned, then in the scenario where you only listen to them if they agree, then for any particular prediction, at least one of them will consider disagreeing better than that. The game theory is a little complicated, but they aren’t being incentivised to report their predictions.
Actually, A has to be able to manage, not only correct and competent adversaries, but deluded and half mad ones too.
I think P would find it hard to be inscrutable. It is impossible to obfuscate arbitrary code.
I agree with your final point. Though for any particular string X, the fastest turing machine to produce it is the one that is basically print(X) . This is why we use short TM’s not just fast ones.