Looks like you’re making a logical error. Creating a machine that solves the halting problem is prohibited by logic. For many applications assuming a sufficiently powerful and logically consistent oracle is good enough but precisely these kinds of games you are playing, where you ask a machine to predict its own output/the output of a system involving itself, are where you get logically inconsistent. Indeed, imagine asking the oracle to simulate an equivalent version of itself and to output the the opposite answer to what its simulation outputs. This may seem like a derived question, but most “interesting” self-referential questions boil down to an instance of this. I think once you fix the logical inconsistency, you’re left with an equivalent problem to AI in a box: boxed AI P is stronger that friendly AI A but has an agenda.
Alternatively, if you’re assuming A is itself un-aligned (rather than friendly) and has the goal of getting the right answer at any cost then it looks like you need some more assumptions on A’s structure. For example if A is sufficiently sophisticated and knows it has access to a much more powerful but untrustwothy oracle it might know to implement a merlin-arthur protocol.
There is precisely one oracle, O. A and T and P are computable. And crucially, the oracles answer does not depend on itself in any way. This question is not self referential.P might try to predict A and O, but there is no guarantee it will be correct.
P has a restricted amount of compute compared to A, but still enough to be able to reason about A in the abstact.
We are asking how we should design A.
If you have unlimited compute, and want to predict something, you can use solomnov induction. But some of the hypothesis you might find are AI’s that think they are a hypothesis in your induction, and are trying to escape.
Sorry, I misread this. I read your question as O outputting some function T that is most likely to answer some set of questions you want to know the answer to (which would be self-referential as these questions depend on the output of T). I think I understand your question now.
What kind of ability do you have to know the “true value” of your sequence B?
If the paperclip maximizer P is able to control the value of your turing machine, and if you are a one-boxing AI (and this is known to P) then of course you can make deals/communicate with P. In particular, if the sequence B is generated by some known but slow program, you can try to set up an Arthur-Merlin zero knowledge proof protocol in exchange for promising to make a few paperclips, which you can then use to keep P honest (after making the paperclips as promised).
To be clear though, this is a strategy for an agent A that somehow has as its goals only the desire to compute B together with some kind of commitment to following through on agreements. If A is genuinely aligned with humans, the rule “don’t communicate/make deals with malicious superintelligent entities, at least until you have satisfactorily solved the AI in a box and similar underlying problems” should be a no-brainer.
I don’t think that these Arthur merlin proofs are relevant. Here A has a lot more compute than P. A is simulating P and can see and modify P however A sees fit.
Looks like you’re making a logical error. Creating a machine that solves the halting problem is prohibited by logic. For many applications assuming a sufficiently powerful and logically consistent oracle is good enough but precisely these kinds of games you are playing, where you ask a machine to predict its own output/the output of a system involving itself, are where you get logically inconsistent. Indeed, imagine asking the oracle to simulate an equivalent version of itself and to output the the opposite answer to what its simulation outputs. This may seem like a derived question, but most “interesting” self-referential questions boil down to an instance of this. I think once you fix the logical inconsistency, you’re left with an equivalent problem to AI in a box: boxed AI P is stronger that friendly AI A but has an agenda.
Alternatively, if you’re assuming A is itself un-aligned (rather than friendly) and has the goal of getting the right answer at any cost then it looks like you need some more assumptions on A’s structure. For example if A is sufficiently sophisticated and knows it has access to a much more powerful but untrustwothy oracle it might know to implement a merlin-arthur protocol.
There is precisely one oracle, O. A and T and P are computable. And crucially, the oracles answer does not depend on itself in any way. This question is not self referential.P might try to predict A and O, but there is no guarantee it will be correct.
P has a restricted amount of compute compared to A, but still enough to be able to reason about A in the abstact.
We are asking how we should design A.
If you have unlimited compute, and want to predict something, you can use solomnov induction. But some of the hypothesis you might find are AI’s that think they are a hypothesis in your induction, and are trying to escape.
Sorry, I misread this. I read your question as O outputting some function T that is most likely to answer some set of questions you want to know the answer to (which would be self-referential as these questions depend on the output of T). I think I understand your question now.
What kind of ability do you have to know the “true value” of your sequence B?
If the paperclip maximizer P is able to control the value of your turing machine, and if you are a one-boxing AI (and this is known to P) then of course you can make deals/communicate with P. In particular, if the sequence B is generated by some known but slow program, you can try to set up an Arthur-Merlin zero knowledge proof protocol in exchange for promising to make a few paperclips, which you can then use to keep P honest (after making the paperclips as promised).
To be clear though, this is a strategy for an agent A that somehow has as its goals only the desire to compute B together with some kind of commitment to following through on agreements. If A is genuinely aligned with humans, the rule “don’t communicate/make deals with malicious superintelligent entities, at least until you have satisfactorily solved the AI in a box and similar underlying problems” should be a no-brainer.
I don’t think that these Arthur merlin proofs are relevant. Here A has a lot more compute than P. A is simulating P and can see and modify P however A sees fit.