It’s definitely not clear to me that updatelessness + Yudkowsky’s solution prevent threats. The core issue is that a target and a threatener face a prima facie symmetric decision problem of whether to use strategies that depend on their counterpart’s strategy or strategies that do not depend on their counterpart’s strategy.[1]
There are potential symmetry breakers that privilege a no-threat equilibrium, such as the potential for cooperation between different targets. However, there are also potential symmetry breakers in the other direction. I expect Yudkowsky is aware of the symmetry of this problem and either thinks the symmetry breakers in favour of no-threats seem very strong, or is just very confident in the superintelligences-should-figure-this-stuff-out heuristic. Relatedly, this post argues that mutually transparent agents should be able to avoid most of the harm of threats being executed, even if they are unable to avoid threats from being made.
But these are different arguments to the one you make here, and I’m personally unconvinced even these arguments are strong enough that it’s not very important for us to work on preventing harmful threats from being made by or against AIs that humanity deploys.
FYI A lot of Center on Long-Term Risk’s research is motivated by this problem; I suggest people reach out to us if you’re interested in working on it!
Examples of non-dependent strategies would include
Refusing all threats regardless of why they were made
Refusing threats to the extent prescribed by Yudkowsky’s solution regardless of why they were made
Making threats regardless of a target’s refusal strategy when the target is incentivised to give in
An example of a dependent strategy would be
Refusing threats more often when a threatener accurately predicted whether or not you would refuse in order to determine whether to make a threat; and refusing threats less often when they did not predict you, or did so less accurately
For posterity, and if it’s of interest to you, my current sense on this stuff is that we should basically throw out the frame of “incentivizing” when it comes to respectful interactions between agents or agent-like processes. This is because regardless of whether it’s more like a threat or a cooperation-enabler, there’s still an element of manipulation that I don’t think belongs in multi-agent interactions we (or our AI systems) should consent to.
I can’t be formal about what I want instead, but I’ll use the term “negotiation” for what I think is more respectful. In negotiation there is more of a dialogue that supports choices to be made in an informed way, and there is less this element of trying to get ahead of your trading partner by messing with the world such that their “values” will cause them to want to do what you want them to do.
I will note that this “negotiation” doesn’t necessarily have to take place in literal time and space. There can be processes of agents thinking about each other that resemble negotiation and qualify to me as respectful, even without a physical conversation. What matters, I think, is whether the logical process that lead to an another agent’s choices can be seen in this light.
And I think in cases when another agent is “incentivizing” my cooperation in a way that I actually like, it is exactly when the process was considering what the outcome would be of a negotiating process that respected me.
It’s definitely not clear to me that updatelessness + Yudkowsky’s solution prevent threats. The core issue is that a target and a threatener face a prima facie symmetric decision problem of whether to use strategies that depend on their counterpart’s strategy or strategies that do not depend on their counterpart’s strategy.[1]
In other words, the incentive targets have to use non-dependent strategies that incentivise favourable (no-threat) responses from threateners is the same incentive threateners have to use non-dependent strategies that incentivise favourable (give-into-threat) responses from targets. This problem is discussed in more detail in parts of Responses to apparent rationalist confusions about game / decision theory and in Updatelessness doesn’t solve most problems.
There are potential symmetry breakers that privilege a no-threat equilibrium, such as the potential for cooperation between different targets. However, there are also potential symmetry breakers in the other direction. I expect Yudkowsky is aware of the symmetry of this problem and either thinks the symmetry breakers in favour of no-threats seem very strong, or is just very confident in the superintelligences-should-figure-this-stuff-out heuristic. Relatedly, this post argues that mutually transparent agents should be able to avoid most of the harm of threats being executed, even if they are unable to avoid threats from being made.
But these are different arguments to the one you make here, and I’m personally unconvinced even these arguments are strong enough that it’s not very important for us to work on preventing harmful threats from being made by or against AIs that humanity deploys.
FYI A lot of Center on Long-Term Risk’s research is motivated by this problem; I suggest people reach out to us if you’re interested in working on it!
Examples of non-dependent strategies would include
Refusing all threats regardless of why they were made
Refusing threats to the extent prescribed by Yudkowsky’s solution regardless of why they were made
Making threats regardless of a target’s refusal strategy when the target is incentivised to give in
An example of a dependent strategy would be
Refusing threats more often when a threatener accurately predicted whether or not you would refuse in order to determine whether to make a threat; and refusing threats less often when they did not predict you, or did so less accurately
For posterity, and if it’s of interest to you, my current sense on this stuff is that we should basically throw out the frame of “incentivizing” when it comes to respectful interactions between agents or agent-like processes. This is because regardless of whether it’s more like a threat or a cooperation-enabler, there’s still an element of manipulation that I don’t think belongs in multi-agent interactions we (or our AI systems) should consent to.
I can’t be formal about what I want instead, but I’ll use the term “negotiation” for what I think is more respectful. In negotiation there is more of a dialogue that supports choices to be made in an informed way, and there is less this element of trying to get ahead of your trading partner by messing with the world such that their “values” will cause them to want to do what you want them to do.
I will note that this “negotiation” doesn’t necessarily have to take place in literal time and space. There can be processes of agents thinking about each other that resemble negotiation and qualify to me as respectful, even without a physical conversation. What matters, I think, is whether the logical process that lead to an another agent’s choices can be seen in this light.
And I think in cases when another agent is “incentivizing” my cooperation in a way that I actually like, it is exactly when the process was considering what the outcome would be of a negotiating process that respected me.