This reminds me of an example I described in this SL4 post:
After suggesting in a previous post [1] that AIs who want to cooperate with
each other may find it more efficient to merge than to trade, I realized
that voluntary mergers do not necessarily preserve Bayesian rationality,
that is, rationality as defined by standard decision theory. In other words,
two “rational” AIs may find themselves in a situation where they won’t
voluntarily merge into a “rational” AI, but can agree merge into an
“irrational” one. This seems to suggest that we shouldn’t expect AIs to be
constrained by Bayesian rationality, and that we need an expanded definition
of what rationality is.
Let me give a couple of examples to illustrate my point. First consider an
AI with the only goal of turning the universe into paperclips, and another
one with the goal of turning the universe into staples. Each AI is
programmed to get 1 util if at least 60% of the accessible universe is
converted into its target item, and 0 utils otherwise. Clearly they can’t
both reach their goals (assuming their definitions of “accessible universe”
overlap sufficiently), but they are not playing a zero-sum game, since it is
possible for them to both lose, if for example they start a destructive war
that devastates both of them, or if they just each convert 50% of the
universe.
So what should they do? In [1] I suggested that two AIs can create a third
AI whose utility function is a linear combination of the utilities of the
original AIs, and then hand off their assets to the new AI. But that doesn’t
work in this case. If they tried this, the new AI will get 1 util if at
least 60% of the universe is converted to paperclips, and 1 util if at least
60% of the universe is converted to staples. In order to maximize its
expected utility, it will pursue the one goal with the highest chance of
success (even if it’s just slightly higher than the other goal). But if
these success probabilities were known before the merger, the AI whose goal
has a smaller chance of success would have refused to agree to the merger.
That AI should only agree if the merger allows it to have a close to 50%
probability of success according to its original utility function.
The problem here is that standard decision theory does not allow a
probabilistic mixture of outcomes to have a higher utility than the
mixture’s expected utility, so a 50⁄50 chance of reaching either of two
goals A and B cannot have a higher utility than 100% chance of reaching A
and a higher utility than 100% chance of reaching B, but that is what is
needed in this case in order for both AIs to agree to the merger.
I remember my reaction when first reading this was “both AIs delegate their power, then a jointly trusted coinflip is made, then a new AI is constructed which maximizes one of the utility functions”. That seems to solve the problem in general.
But if these success probabilities were known before the merger, the AI whose goal has a smaller chance of success would have refused to agree to the merger. That AI should only agree if the merger allows it to have a close to 50% probability of success according to its original utility function.
Why does the probability need to be close to 50% for the AI to agree to the merger? Shouldn’t its threshold for agreeing to the merger depend on how likely one or the other AI is to beat the other in a war for the accessible universe?
Is there an assumption that the two AIs are roughly equally powerful, and that a both-lose scenario is relatively unlikely?
It is first past the post, minorities get nothing. There might be an implicit assumption that the created new agent agrees with probablities with the old agents. 49% plausible papperclips, 51% plausible staples will act 100% staple and does not serve at all for paperclips.
Ah, maybe the way to think about it is that if I think I have a 30% chance of success before the merger, then I need to have a 30%+epsilon chance of my goal being chosen after the merger. And my goal will only be chosen if it is estimated to have the higher chance of success.
And so, if we assume that the chosen goal is def going to succeed post-merger (since there’s no destructive war), that means I need to have a 30%+epsilon chance that my goal has a >50% chance of success post-merger. Or in other words “a close to 50% probability of success”, just as Wei said.
This reminds me of an example I described in this SL4 post:
I remember my reaction when first reading this was “both AIs delegate their power, then a jointly trusted coinflip is made, then a new AI is constructed which maximizes one of the utility functions”. That seems to solve the problem in general.
Why does the probability need to be close to 50% for the AI to agree to the merger? Shouldn’t its threshold for agreeing to the merger depend on how likely one or the other AI is to beat the other in a war for the accessible universe?
Is there an assumption that the two AIs are roughly equally powerful, and that a both-lose scenario is relatively unlikely?
It is first past the post, minorities get nothing. There might be an implicit assumption that the created new agent agrees with probablities with the old agents. 49% plausible papperclips, 51% plausible staples will act 100% staple and does not serve at all for paperclips.
Ah, maybe the way to think about it is that if I think I have a 30% chance of success before the merger, then I need to have a 30%+epsilon chance of my goal being chosen after the merger. And my goal will only be chosen if it is estimated to have the higher chance of success.
And so, if we assume that the chosen goal is def going to succeed post-merger (since there’s no destructive war), that means I need to have a 30%+epsilon chance that my goal has a >50% chance of success post-merger. Or in other words “a close to 50% probability of success”, just as Wei said.