I mean that we don’t have any process that looks like debate that could produce an agent that wasn’t trying to kill you without being competitive
It took me an embarrassingly long time to parse this. I think it says: any debate-trained agent that isn’t competitive will try to kill you. But I think the next clause clarifies that any debate-trained agent whose competitor isn’t competitive will try to kill you. This may be moot if I’m getting that wrong.
So I guess you’re imagining running Debate with horizons that are long enough that, in the absence of a competitor, the remaining debater would try to kill you. It seems to me that you put more faith in the mechanism that I was saying didn’t comfort me. I had just claimed that a single-agent chatbot system with a long enough horizon would try to take over the world:
The existence of an adversary may make it harder for a debater to trick the operator, but if they’re both trying to push the operator in dangerous directions, I’m not very comforted by this effect. The probability that the operator ends up trusting one of them doesn’t seem (to me) so much lower than the probability the operator ends up trusting the single agent in the single-agent setup.
Running a debate between two entities that would both kill me if they could get away with it seems critically dangerous.
Suppose two equally matched people are trying shoot a basket from opposite ends of the 3-point line, before their opponent makes a basket. Each time they shoot, the two basketballs collide above the hoop and bounce off of each other, hopefully. Making the basket first = taking over the world and killing us on their terms. My view is that if they’re both trying to make a basket, a basket being made is a more likely outcome than a basket not being made (if it’s not too difficult for them to make the proverbial basket).
Side comment: so I think the existential risk is quite high in this setting, but I certainly don’t think the existential risk is so low that there’s little existential risk left to reduce with the boxing-the-moderator strategy. (I don’t know if you’d have disputed that, but I’ve had conversations with others who did, so this seems like a good place to put this comment.)
It took me an embarrassingly long time to parse this. I think it says: any debate-trained agent that isn’t competitive will try to kill you. But I think the next clause clarifies that any debate-trained agent whose competitor isn’t competitive will try to kill you. This may be moot if I’m getting that wrong.
So I guess you’re imagining running Debate with horizons that are long enough that, in the absence of a competitor, the remaining debater would try to kill you. It seems to me that you put more faith in the mechanism that I was saying didn’t comfort me. I had just claimed that a single-agent chatbot system with a long enough horizon would try to take over the world:
Running a debate between two entities that would both kill me if they could get away with it seems critically dangerous.
Suppose two equally matched people are trying shoot a basket from opposite ends of the 3-point line, before their opponent makes a basket. Each time they shoot, the two basketballs collide above the hoop and bounce off of each other, hopefully. Making the basket first = taking over the world and killing us on their terms. My view is that if they’re both trying to make a basket, a basket being made is a more likely outcome than a basket not being made (if it’s not too difficult for them to make the proverbial basket).
Side comment: so I think the existential risk is quite high in this setting, but I certainly don’t think the existential risk is so low that there’s little existential risk left to reduce with the boxing-the-moderator strategy. (I don’t know if you’d have disputed that, but I’ve had conversations with others who did, so this seems like a good place to put this comment.)