So what if AI Debate survives this concern? That is, suppose we can reliably find a horizon-length for which running AI Debate is not existentially dangerous. One worry I’ve heard raised is that human judges will be unable to effectively judge arguments way above their level. My reaction is to this is that I don’t know, but it’s not an existential failure mode, so we could try it out and tinker with evaluation protocols until it works, or until we give up. If we can run AI Debate without incurring an existential risk, I don’t see why it’s important to resolve questions like this in advance.
There are two reasons to worry about this:
The purpose of research now is to understand the landscape of plausible alignment approaches, and from that perspective viability is as important as safety.
I think it is unlikely for a scheme like debate to be safe without being approximately competitive—the goal is to get honest answers which are competitive with a potential malicious agent, and then use those answers to ensure that malicious agent can’t cause trouble and that the overall system can be stable to malicious perturbations. If your honest answers aren’t competitive, then you can’t do that and your situation isn’t qualitatively different from a human trying to directly supervise a much smarter AI.
In practice I doubt the second consideration matters—if your AI could easily kill you in order to win a debate, probably someone else’s AI has already killed you to take your money (and long before that your society totally fell apart). That is, safety separate from competitiveness mostly matters in scenarios where you have very large leads / very rapid takeoffs.
Even if you were the only AI project on earth, I think competitiveness is the main thing responsible for internal regulation and stability. For example, it seems to me you need competitiveness for any of the plausible approaches for avoiding deceptive alignment (since they require having an aligned overseer who can understand what a treacherous agent is doing). More generally, trying to maintain a totally sanitized internal environment seems a lot harder than trying to maintain a competitive internal environment where misaligned agents won’t be at a competitive advantage.
The purpose of research now is to understand the landscape of plausible alignment approaches, and from that perspective viability is as important as safety.
Point taken.
I think it is unlikely for a scheme like debate to be safe without being approximately competitive
The way I map these concepts, this feels like an elision to me. I understand what you’re saying, but I would like to have a term for “this AI isn’t trying to kill me”, and I think “safe” is a good one. That’s the relevant sense of “safe” when I say “if it’s safe, we can try it out and tinker”. So maybe we can recruit another word to describe an AI that is both safe itself and able to protect us from other agents.
use those answers [from Debate] to ensure … that the overall system can be stable to malicious perturbations
Is “overall system” still referring to the malicious agent, or to Debate itself? If it’s referring to Debate, I assume you’re talking about malicious perturbations from within rather than malicious perturbations from the outside world?
If your honest answers aren’t competitive, then you can’t do that and your situation isn’t qualitatively different from a human trying to directly supervise a much smarter AI.
You’re saying that if we don’t get useful answers out of Debate, we can’t use the system to prevent malicious AI, and so we’d have to just try to supervise nascent malicious AI directly? I certainly don’t dispute that if we don’t get useful answers out of Debate, Debate won’t help us solve X, including when X is “nip malicious AI in the bud”.
It certainly wouldn’t hurt to know in advance whether Debate is competitive enough, but if it really isn’t dangerous itself, then I think we’re unlikely to become so pessimistic about the prospects of Debate, through our arguments and our proxy experiments, that we don’t even bother trying it out, so it doesn’t seem especially decision-relevant to figure it out for sure in advance. But again, I take your earlier point that a better understanding of the landscape is always going to have some worth.
if your AI could easily kill you in order to win a debate, probably someone else’s AI has already killed you
This argument seems to prove too much. Are you saying that if society has learned how to do artificial induction at a superhuman level, then by the time we give a safe planner that induction subroutine, someone will have already given that induction routine to an unsafe planner? If so, what hope is there as prediction algorithms relentlessly improve? In my view, the whole point of AGI Safety research is to try to come up with ways to use powerful-enough-to-kill-you artificial induction in a way that it doesn’t kill you (and helps you achieve your other goals). But it seems you’re saying that there is a certain level of ingenuity where malicious agents will probably act with that level of ingenuity before benign agents do.
That is, safety separate from competitiveness mostly matters in scenarios where you have very large leads / very rapid takeoffs
It seems fairly likely to me that the next best AGI project behind Deepmind, OpenAI, the USA, and China is way behind the best of those. I would think people in those projects would have months at least before some dark horse catches up.
So competitiveness still matters somewhat, but here’s a potential disagreement we might have: I think we will probably have at least a few months, and maybe more than a year, where the top one or two teams have AGI (powerful enough to kill everyone if let loose), and nobody else has anything more valuable than an Amazon Mechanical Turk worker. [Edit: “valuable” is the wrong word. I guess I mean better at killing.]
For example, it seems to me you need competitiveness for any of the plausible approaches for avoiding deceptive alignment (since they require having an aligned overseer who can understand what a treacherous agent is doing)
Do you think something like IDA is the only plausible approach to alignment? If so, I hadn’t realized that, and I’d be curious to hear more arguments, or just intuitions are fine. The aligned overseer you describe is supposed to make treachery impossible by recognizing it, so it seems your concern is equivalent to the concern: “any agent (we make) that learns to act will be treacherous if treachery is possible.” Are all learning agents fundamentally out to get you? I suppose that’s a live possibility to me, but it seems to me there is a possibility we could design an agent that is not inclined to treachery, even if the treachery wouldn’t be recognized.
Edit: even so, having two internal components that are competitive with each other (e.g. overseer and overseee) does not require competitiveness with other projects.
More generally, trying to maintain a totally sanitized internal environment seems a lot harder than trying to maintain a competitive internal environment where misaligned agents won’t be at a competitive advantage.
I don’t understand the dichotomy here. Are you talking about the problem of how to make it hard for a debater to take over the world within the course a debate? Or are you talking about the problem of how to make it hard for a debater to mislead the moderator? The solutions to those problems might be different, so maybe we can separate the concept “misaligned” into “ambitious” and/or “deceitful”, to make it easier to talk about the possibility of separate solutions.
So competitiveness still matters somewhat, but here’s a potential disagreement we might have: I think we will probably have at least a few months, and maybe more than a year, where the top one or two teams have AGI (powerful enough to kill everyone if let loose), and nobody else has anything more valuable than an Amazon Mechanical Turk worker.
Definitely a disagreement, I think that before anyone has an AGI that could beat humans in a fistfight, tons of people will have systems much much more valuable than a mechanical turk worker.
Okay. I’ll lower my confidence in my position. I think these two possibilities are strategically different enough, and each sufficiently plausible enough, that we should come up with separate plans/research agendas for both of them. And then those research agendas can be critiqued on their own terms.
For the purposes of this discussion, I think qualifies as a useful tangent, and this is the thread where a related disagreement comes to a head.
Edit: “valuable” was the wrong word. “Better at killing” is more to the point.
nobody else has anything more valuable than an Amazon Mechanical Turk worker
Huh? Isn’t the ML powering e.g. Google Search more valuable than an MTurk worker? Or Netflix’s recommendation algorithm? (I think I don’t understand what you mean by “value” here.)
Are you predicting there won’t be any lethal autonomous weapons before AGI? It seems like if that ends up being true, it would only be because we coordinated well to prevent that. More generally, we don’t usually try to kill people, whereas we do try to build AGI.
(Whereas I think at least Paul usually thinks about people not paying the “safety tax” because the unaligned AI is still really good at e.g. getting them money, at least in the short term.)
Are you predicting there won’t be any lethal autonomous weapons before AGI?
No… thanks for pressing me on this.
Better at killing an a context where either: the operator would punish the agent if they knew, or the state would punish the operator if they knew. So the agent has to conceal its actions at whichever the level the punishment would occur.
How about a recommendation engine that accidentally learns to show depressed people sequences of videos that affirm their self-hatred that leads them to commit suicide? (It seems plausible that something like this has already happened, though idk if it has.)
I think the thing you actually want to talk about is an agent that “intentionally” deceives its operator / the state? I think even there I’d disagree with your prediction, but it seems more reasonable as a stance (mostly because depending on how you interpret the “intentionally” it may need to have human-level reasoning abilities). Would it count if a malicious actor successfully finetuned GPT-3 to e.g. incite violence while maintaining plausible deniability?
Would it count if a malicious actor successfully finetuned GPT-3 to e.g. incite violence while maintaining plausible deniability?
Yes, that would count. I suspect that many “unskilled workers” would (alone) be better at inciting violence while maintaining plausible deniability than GPT-N at the point in time the leading group had AGI. Unless it’s OpenAI, of course :P
Regarding intentionality, I suppose I didn’t clarify the precise meaning of “better at”, which I did take to imply some degree of intentionality, or else I think “ends up” would have been a better word choice. The impetus for this point was Paul’s concern that someone would have used an AI to kill you to take your money. I think we can probably avoid the difficulty of a rigorous definition intentionality, if we gesture vaguely at “the sort of intentionality required for that to be viable”? But let me know if more precision would be helpful, and I’ll try to figure out exactly what I mean. I certainly don’t think we need to make use of a version of intentionality that requires human-level reasoning.
Do you think something like IDA is the only plausible approach to alignment? If so, I hadn’t realized that, and I’d be curious to hear more arguments, or just intuitions are fine. The aligned overseer you describe is supposed to make treachery impossible by recognizing it, so it seems your concern is equivalent to the concern: “any agent (we make) that learns to act will be treacherous if treachery is possible.” Are all learning agents fundamentally out to get you? I suppose that’s a live possibility to me, but it seems to me there is a possibility we could design an agent that is not inclined to treachery, even if the treachery wouldn’t be recognized
No, but what are the approaches to avoiding deceptive alignment that don’t go through competitiveness?
I guess the obvious one is “don’t use ML,” and I agree that doesn’t require competitiveness.
Edit: even so, having two internal components that are competitive with each other (e.g. overseer and overseee) does not require competitiveness with other projects.
No, but now we are starting to play the game of throttling the overseee (to avoid it overpowering the overseer) and it’s not clear how this is going to work and be stable. It currently seems like the only appealing approach to getting stability there is to ensure the overseer is competitive.
No, but what are the approaches to avoiding deceptive alignment that don’t go through competitiveness?
We could talk for a while about this. But I’m not sure how much hangs on this point if I’m right, since you offered this as an extra reason to care about competitiveness, but there’s still the obvious reason to value competitiveness. And idea space is big, so you would have your work cut out to turn this from an epistemic landscape where two people can reasonably have different intuitions to an epistemic landscape that would cast serious doubt on my side.
But here’s one idea: have the AI show messages to the operator that causes them to do better on randomly selected prediction tasks, and the operator’s prediction depends on the message, obviously, but the ground truth is the counterfactual ground truth if the message were never shown, so the AI’s message can’t affect the ground truth.
And then more broadly, impact measures, conservatism, or utility information about counterfactuals to complicate wireheading, seem at least somewhat viable to me, and then you could have an agent that does more than show us text that’s only useful if it’s true. In my view, this approach is way more difficult to get safe, but if I had the position that we needed parity in competitiveness with unsafe competitors in order to use a chatbot to save the world, then I’d start to find these other approaches more appealing.
This argument seems to prove too much. Are you saying that if society has learned how to do artificial induction at a superhuman level, then by the time we give a safe planner that induction subroutine, someone will have already given that induction routine to an unsafe planner? If so, what hope is there as prediction algorithms relentlessly improve? In my view, the whole point of AGI Safety research is to try to come up with ways to use powerful-enough-to-kill-you artificial induction in a way that it doesn’t kill you (and helps you achieve your other goals). But it seems you’re saying that there is a certain level of ingenuity where malicious agents will probably act with that level of ingenuity before benign agents do.
I’m saying that if you can’t protect yourself from an AI in your lab, under conditions that you carefully control, you probably couldn’t protect yourself from AI systems out there in the world.
The hope is that you can protect yourself from an AI in your lab.
But your original comment was referring to a situation in which we didn’t carefully control the AI in our lab. (By letting it have an arbitrarily long horizon). If we have lead time on other projects, I think it’s very plausible to have a situation where we couldn’t protect ourselves from our own AI if we weren’t carefully controlling the conditions, but we could protect ourselves from our own AI if we we were carefully controlling the situation, and then given our lead time, we’re not at a big risk from other projects yet.
The way I map these concepts, this feels like an elision to me. I understand what you’re saying, but I would like to have a term for “this AI isn’t trying to kill me”, and I think “safe” is a good one. That’s the relevant sense of “safe” when I say “if it’s safe, we can try it out and tinker”. So maybe we can recruit another word to describe an AI that is both safe itself and able to protect us from other agents.
I mean that we don’t have any process that looks like debate that could produce an agent that wasn’t trying to kill you without being competitive, because debate relies on using aligned agents to guide the training process (and if they aren’t competitive then the agent-being-trained will, at least in the limit, converge to an equilibrium where it kills you).
I mean that we don’t have any process that looks like debate that could produce an agent that wasn’t trying to kill you without being competitive
It took me an embarrassingly long time to parse this. I think it says: any debate-trained agent that isn’t competitive will try to kill you. But I think the next clause clarifies that any debate-trained agent whose competitor isn’t competitive will try to kill you. This may be moot if I’m getting that wrong.
So I guess you’re imagining running Debate with horizons that are long enough that, in the absence of a competitor, the remaining debater would try to kill you. It seems to me that you put more faith in the mechanism that I was saying didn’t comfort me. I had just claimed that a single-agent chatbot system with a long enough horizon would try to take over the world:
The existence of an adversary may make it harder for a debater to trick the operator, but if they’re both trying to push the operator in dangerous directions, I’m not very comforted by this effect. The probability that the operator ends up trusting one of them doesn’t seem (to me) so much lower than the probability the operator ends up trusting the single agent in the single-agent setup.
Running a debate between two entities that would both kill me if they could get away with it seems critically dangerous.
Suppose two equally matched people are trying shoot a basket from opposite ends of the 3-point line, before their opponent makes a basket. Each time they shoot, the two basketballs collide above the hoop and bounce off of each other, hopefully. Making the basket first = taking over the world and killing us on their terms. My view is that if they’re both trying to make a basket, a basket being made is a more likely outcome than a basket not being made (if it’s not too difficult for them to make the proverbial basket).
Side comment: so I think the existential risk is quite high in this setting, but I certainly don’t think the existential risk is so low that there’s little existential risk left to reduce with the boxing-the-moderator strategy. (I don’t know if you’d have disputed that, but I’ve had conversations with others who did, so this seems like a good place to put this comment.)
There are two reasons to worry about this:
The purpose of research now is to understand the landscape of plausible alignment approaches, and from that perspective viability is as important as safety.
I think it is unlikely for a scheme like debate to be safe without being approximately competitive—the goal is to get honest answers which are competitive with a potential malicious agent, and then use those answers to ensure that malicious agent can’t cause trouble and that the overall system can be stable to malicious perturbations. If your honest answers aren’t competitive, then you can’t do that and your situation isn’t qualitatively different from a human trying to directly supervise a much smarter AI.
In practice I doubt the second consideration matters—if your AI could easily kill you in order to win a debate, probably someone else’s AI has already killed you to take your money (and long before that your society totally fell apart). That is, safety separate from competitiveness mostly matters in scenarios where you have very large leads / very rapid takeoffs.
Even if you were the only AI project on earth, I think competitiveness is the main thing responsible for internal regulation and stability. For example, it seems to me you need competitiveness for any of the plausible approaches for avoiding deceptive alignment (since they require having an aligned overseer who can understand what a treacherous agent is doing). More generally, trying to maintain a totally sanitized internal environment seems a lot harder than trying to maintain a competitive internal environment where misaligned agents won’t be at a competitive advantage.
Point taken.
The way I map these concepts, this feels like an elision to me. I understand what you’re saying, but I would like to have a term for “this AI isn’t trying to kill me”, and I think “safe” is a good one. That’s the relevant sense of “safe” when I say “if it’s safe, we can try it out and tinker”. So maybe we can recruit another word to describe an AI that is both safe itself and able to protect us from other agents.
Is “overall system” still referring to the malicious agent, or to Debate itself? If it’s referring to Debate, I assume you’re talking about malicious perturbations from within rather than malicious perturbations from the outside world?
You’re saying that if we don’t get useful answers out of Debate, we can’t use the system to prevent malicious AI, and so we’d have to just try to supervise nascent malicious AI directly? I certainly don’t dispute that if we don’t get useful answers out of Debate, Debate won’t help us solve X, including when X is “nip malicious AI in the bud”.
It certainly wouldn’t hurt to know in advance whether Debate is competitive enough, but if it really isn’t dangerous itself, then I think we’re unlikely to become so pessimistic about the prospects of Debate, through our arguments and our proxy experiments, that we don’t even bother trying it out, so it doesn’t seem especially decision-relevant to figure it out for sure in advance. But again, I take your earlier point that a better understanding of the landscape is always going to have some worth.
This argument seems to prove too much. Are you saying that if society has learned how to do artificial induction at a superhuman level, then by the time we give a safe planner that induction subroutine, someone will have already given that induction routine to an unsafe planner? If so, what hope is there as prediction algorithms relentlessly improve? In my view, the whole point of AGI Safety research is to try to come up with ways to use powerful-enough-to-kill-you artificial induction in a way that it doesn’t kill you (and helps you achieve your other goals). But it seems you’re saying that there is a certain level of ingenuity where malicious agents will probably act with that level of ingenuity before benign agents do.
It seems fairly likely to me that the next best AGI project behind Deepmind, OpenAI, the USA, and China is way behind the best of those. I would think people in those projects would have months at least before some dark horse catches up.
So competitiveness still matters somewhat, but here’s a potential disagreement we might have: I think we will probably have at least a few months, and maybe more than a year, where the top one or two teams have AGI (powerful enough to kill everyone if let loose), and nobody else has anything more valuable than an Amazon Mechanical Turk worker. [Edit: “valuable” is the wrong word. I guess I mean better at killing.]
Do you think something like IDA is the only plausible approach to alignment? If so, I hadn’t realized that, and I’d be curious to hear more arguments, or just intuitions are fine. The aligned overseer you describe is supposed to make treachery impossible by recognizing it, so it seems your concern is equivalent to the concern: “any agent (we make) that learns to act will be treacherous if treachery is possible.” Are all learning agents fundamentally out to get you? I suppose that’s a live possibility to me, but it seems to me there is a possibility we could design an agent that is not inclined to treachery, even if the treachery wouldn’t be recognized.
Edit: even so, having two internal components that are competitive with each other (e.g. overseer and overseee) does not require competitiveness with other projects.
I don’t understand the dichotomy here. Are you talking about the problem of how to make it hard for a debater to take over the world within the course a debate? Or are you talking about the problem of how to make it hard for a debater to mislead the moderator? The solutions to those problems might be different, so maybe we can separate the concept “misaligned” into “ambitious” and/or “deceitful”, to make it easier to talk about the possibility of separate solutions.
Definitely a disagreement, I think that before anyone has an AGI that could beat humans in a fistfight, tons of people will have systems much much more valuable than a mechanical turk worker.
Okay. I’ll lower my confidence in my position. I think these two possibilities are strategically different enough, and each sufficiently plausible enough, that we should come up with separate plans/research agendas for both of them. And then those research agendas can be critiqued on their own terms.
For the purposes of this discussion, I think qualifies as a useful tangent, and this is the thread where a related disagreement comes to a head.
Edit: “valuable” was the wrong word. “Better at killing” is more to the point.
Huh? Isn’t the ML powering e.g. Google Search more valuable than an MTurk worker? Or Netflix’s recommendation algorithm? (I think I don’t understand what you mean by “value” here.)
You’re right—valuable is the wrong word. I guess I mean better at killing.
Are you predicting there won’t be any lethal autonomous weapons before AGI? It seems like if that ends up being true, it would only be because we coordinated well to prevent that. More generally, we don’t usually try to kill people, whereas we do try to build AGI.
(Whereas I think at least Paul usually thinks about people not paying the “safety tax” because the unaligned AI is still really good at e.g. getting them money, at least in the short term.)
No… thanks for pressing me on this.
Better at killing an a context where either: the operator would punish the agent if they knew, or the state would punish the operator if they knew. So the agent has to conceal its actions at whichever the level the punishment would occur.
How about a recommendation engine that accidentally learns to show depressed people sequences of videos that affirm their self-hatred that leads them to commit suicide? (It seems plausible that something like this has already happened, though idk if it has.)
I think the thing you actually want to talk about is an agent that “intentionally” deceives its operator / the state? I think even there I’d disagree with your prediction, but it seems more reasonable as a stance (mostly because depending on how you interpret the “intentionally” it may need to have human-level reasoning abilities). Would it count if a malicious actor successfully finetuned GPT-3 to e.g. incite violence while maintaining plausible deniability?
Yes, that would count. I suspect that many “unskilled workers” would (alone) be better at inciting violence while maintaining plausible deniability than GPT-N at the point in time the leading group had AGI. Unless it’s OpenAI, of course :P
Regarding intentionality, I suppose I didn’t clarify the precise meaning of “better at”, which I did take to imply some degree of intentionality, or else I think “ends up” would have been a better word choice. The impetus for this point was Paul’s concern that someone would have used an AI to kill you to take your money. I think we can probably avoid the difficulty of a rigorous definition intentionality, if we gesture vaguely at “the sort of intentionality required for that to be viable”? But let me know if more precision would be helpful, and I’ll try to figure out exactly what I mean. I certainly don’t think we need to make use of a version of intentionality that requires human-level reasoning.
No, but what are the approaches to avoiding deceptive alignment that don’t go through competitiveness?
I guess the obvious one is “don’t use ML,” and I agree that doesn’t require competitiveness.
No, but now we are starting to play the game of throttling the overseee (to avoid it overpowering the overseer) and it’s not clear how this is going to work and be stable. It currently seems like the only appealing approach to getting stability there is to ensure the overseer is competitive.
We could talk for a while about this. But I’m not sure how much hangs on this point if I’m right, since you offered this as an extra reason to care about competitiveness, but there’s still the obvious reason to value competitiveness. And idea space is big, so you would have your work cut out to turn this from an epistemic landscape where two people can reasonably have different intuitions to an epistemic landscape that would cast serious doubt on my side.
But here’s one idea: have the AI show messages to the operator that causes them to do better on randomly selected prediction tasks, and the operator’s prediction depends on the message, obviously, but the ground truth is the counterfactual ground truth if the message were never shown, so the AI’s message can’t affect the ground truth.
And then more broadly, impact measures, conservatism, or utility information about counterfactuals to complicate wireheading, seem at least somewhat viable to me, and then you could have an agent that does more than show us text that’s only useful if it’s true. In my view, this approach is way more difficult to get safe, but if I had the position that we needed parity in competitiveness with unsafe competitors in order to use a chatbot to save the world, then I’d start to find these other approaches more appealing.
I’m saying that if you can’t protect yourself from an AI in your lab, under conditions that you carefully control, you probably couldn’t protect yourself from AI systems out there in the world.
The hope is that you can protect yourself from an AI in your lab.
But your original comment was referring to a situation in which we didn’t carefully control the AI in our lab. (By letting it have an arbitrarily long horizon). If we have lead time on other projects, I think it’s very plausible to have a situation where we couldn’t protect ourselves from our own AI if we weren’t carefully controlling the conditions, but we could protect ourselves from our own AI if we we were carefully controlling the situation, and then given our lead time, we’re not at a big risk from other projects yet.
I mean that we don’t have any process that looks like debate that could produce an agent that wasn’t trying to kill you without being competitive, because debate relies on using aligned agents to guide the training process (and if they aren’t competitive then the agent-being-trained will, at least in the limit, converge to an equilibrium where it kills you).
It took me an embarrassingly long time to parse this. I think it says: any debate-trained agent that isn’t competitive will try to kill you. But I think the next clause clarifies that any debate-trained agent whose competitor isn’t competitive will try to kill you. This may be moot if I’m getting that wrong.
So I guess you’re imagining running Debate with horizons that are long enough that, in the absence of a competitor, the remaining debater would try to kill you. It seems to me that you put more faith in the mechanism that I was saying didn’t comfort me. I had just claimed that a single-agent chatbot system with a long enough horizon would try to take over the world:
Running a debate between two entities that would both kill me if they could get away with it seems critically dangerous.
Suppose two equally matched people are trying shoot a basket from opposite ends of the 3-point line, before their opponent makes a basket. Each time they shoot, the two basketballs collide above the hoop and bounce off of each other, hopefully. Making the basket first = taking over the world and killing us on their terms. My view is that if they’re both trying to make a basket, a basket being made is a more likely outcome than a basket not being made (if it’s not too difficult for them to make the proverbial basket).
Side comment: so I think the existential risk is quite high in this setting, but I certainly don’t think the existential risk is so low that there’s little existential risk left to reduce with the boxing-the-moderator strategy. (I don’t know if you’d have disputed that, but I’ve had conversations with others who did, so this seems like a good place to put this comment.)