I’m not sure if I understand this correctly. Suppose there’s an unaligned AI that has a high prior that P=NP is true, or that a halting oracle exists in the universe, so it pours all its resources into searching for a polynomial time algorithm for NP-complete problems, or trying to find the halting oracle. (Assuming the user doesn’t understand or have an opinion) your AI would match the unaligned AI’s prior and do the same thing?
Also, do you have a suggestion of how the idea would work when there are multiple unaligned AIs with different priors, and power shifts between them as various uncertainties are resolved in favor of some and against others?
In the case of P=NP, if both the aligned AI and the unaligned AI it’s competing with pour the same percentage of their resources into searching for a polynomial time algorithm for NP-complete problems, this seems fine. The aligned AI could have gained more power in expectation by not searching for these algorithms, but at least it didn’t lose any relative power. At some future point (as Paul points out) humans are going to change the AI’s design, which might include changing its prior on P=NP.
Here’s a possible way for another AI (A) to exploit your AI (B). Search for a statement S such that B can’t consult its human about S’s prior and P(A will win a future war against B | S) is high. Then adopt a high prior for S, wait for B to do the same, and come to B to negotiate a deal that greatly favors A.
It does seem like there’s something right about your idea, but I’m confused about what that is.
Here’s my story for this. A gets to design an agent A’ (which has a weird prior), then we design an agent B’ to be competitive with A’. But actually this isn’t possible using the strategy in the post, since A’ believes it will win, so B’ will also believe that A’ will win.
What we should have done was to design an agent B that is competitive with A directly. That is, when doing this reasoning about priors and such, A has its own beliefs about things (and doesn’t necessarily expect to win against B); B should inherit those priors rather than those of A’.
My point was that originally B did inherit its prior from A, but when A hands its resources to A’, B would detect a power shift, and change its prior to match A’ (as suggested by Paul, which I assumed you agree with). Do you have a different idea about what to do when power shifts between unaligned AIs? Or propose to have aligned AI distinguish between different kinds of power shift somehow?
Also, it seems easy for A to hide its original prior and pretend that it was A’ all along. How would you deal with that?
I see what you’re arguing. I’m actually not sure what to do about ongoing power shifts; I’m currently thinking of the problem as “we have a benchmark system A and want to design a system B that is competitive with A”. The benchmark system should be the original AI system (before it does things like hiding its prior). Of course we can’t do that if A already has weird beliefs, so in this case we have to do something like tracing back to the process that produced A. Hopefully this results in a system that has “symmetric” weird beliefs (e.g. if A falsely thinks it owns a lot of resources, then we design B to also falsely think it owns a lot of resources).
At some meta level, we could see any process that produces an AI (e.g. some AI research/development/deployment strategy implemented by humans) as an agent A and then attempt to design an aligned competitive version B of this process. This high-level picture looks pretty sketchy at the moment.
This is intended as an interim solution, i.e. you would expect to transition to using a “correct” prior before accessing most of the universe’s resources (say within 1000 years). The point of this approach is to avoiding losing influence during the interim period.
If there are multiple unaligned AIs with different beliefs, you would take a weighted average of their beliefs using their current influence. As their influence changed, you would update the weighting.
(This might result in an incoherent / dutch-bookable set of beliefs, in which case you are free to run the dutch book and do even better.)
I’m not sure if I understand this correctly. Suppose there’s an unaligned AI that has a high prior that P=NP is true, or that a halting oracle exists in the universe, so it pours all its resources into searching for a polynomial time algorithm for NP-complete problems, or trying to find the halting oracle. (Assuming the user doesn’t understand or have an opinion) your AI would match the unaligned AI’s prior and do the same thing?
Also, do you have a suggestion of how the idea would work when there are multiple unaligned AIs with different priors, and power shifts between them as various uncertainties are resolved in favor of some and against others?
In the case of P=NP, if both the aligned AI and the unaligned AI it’s competing with pour the same percentage of their resources into searching for a polynomial time algorithm for NP-complete problems, this seems fine. The aligned AI could have gained more power in expectation by not searching for these algorithms, but at least it didn’t lose any relative power. At some future point (as Paul points out) humans are going to change the AI’s design, which might include changing its prior on P=NP.
Here’s a possible way for another AI (A) to exploit your AI (B). Search for a statement S such that B can’t consult its human about S’s prior and P(A will win a future war against B | S) is high. Then adopt a high prior for S, wait for B to do the same, and come to B to negotiate a deal that greatly favors A.
It does seem like there’s something right about your idea, but I’m confused about what that is.
Here’s my story for this. A gets to design an agent A’ (which has a weird prior), then we design an agent B’ to be competitive with A’. But actually this isn’t possible using the strategy in the post, since A’ believes it will win, so B’ will also believe that A’ will win.
What we should have done was to design an agent B that is competitive with A directly. That is, when doing this reasoning about priors and such, A has its own beliefs about things (and doesn’t necessarily expect to win against B); B should inherit those priors rather than those of A’.
My point was that originally B did inherit its prior from A, but when A hands its resources to A’, B would detect a power shift, and change its prior to match A’ (as suggested by Paul, which I assumed you agree with). Do you have a different idea about what to do when power shifts between unaligned AIs? Or propose to have aligned AI distinguish between different kinds of power shift somehow?
Also, it seems easy for A to hide its original prior and pretend that it was A’ all along. How would you deal with that?
I see what you’re arguing. I’m actually not sure what to do about ongoing power shifts; I’m currently thinking of the problem as “we have a benchmark system A and want to design a system B that is competitive with A”. The benchmark system should be the original AI system (before it does things like hiding its prior). Of course we can’t do that if A already has weird beliefs, so in this case we have to do something like tracing back to the process that produced A. Hopefully this results in a system that has “symmetric” weird beliefs (e.g. if A falsely thinks it owns a lot of resources, then we design B to also falsely think it owns a lot of resources).
At some meta level, we could see any process that produces an AI (e.g. some AI research/development/deployment strategy implemented by humans) as an agent A and then attempt to design an aligned competitive version B of this process. This high-level picture looks pretty sketchy at the moment.
This is intended as an interim solution, i.e. you would expect to transition to using a “correct” prior before accessing most of the universe’s resources (say within 1000 years). The point of this approach is to avoiding losing influence during the interim period.
If there are multiple unaligned AIs with different beliefs, you would take a weighted average of their beliefs using their current influence. As their influence changed, you would update the weighting.
(This might result in an incoherent / dutch-bookable set of beliefs, in which case you are free to run the dutch book and do even better.)