OK, thanks. YMMV but some people I’ve read / talked to seem to think that before we have successful world-takeover attempts, we’ll have unsuccessful ones—”sordid stumbles.” If this is true, it’s good news, because it makes it a LOT easier to prevent successful attempts. Alas it is not true.
A much weaker version of something like this may be true, e.g. the warning shot story you proposed a while back about customer service bots being willingly scammed. It’s plausible to me that we’ll get stuff like that before it’s too late.
If you think there’s something we are not on the same page about here—perhaps what you were hinting at with your final sentence—I’d be interested to hear it.
If you think there’s something we are not on the same page about here—perhaps what you were hinting at with your final sentence—I’d be interested to hear it.
I’m not sure. Since you were pushing on the claim about failing to take over the world, it seemed like you think (the truth value of) that claim is pretty important, whereas I see it as not that important, which would suggest that there is some underlying disagreement (idk what it would be though).
It’s been a while since I thought about this, but going back to the beginning of this thread:
“It’s unlikely you’ll get a warning shot for deceptive alignment, since if the first advanced AI system is deceptive and that deception is missed during training, once it’s deployed it’s likely for all the different deceptively aligned systems to be able to relatively easily coordinate with each other to defect simultaneously and ensure that their defection is unrecoverable (e.g. Paul’s “cascading failures”).”
At a high level, you’re claiming that we don’t get a warning shot because there’s a discontinuity in capability of the aggregate of AI systems (the aggregate goes from “can barely do anything deceptive” to “can coordinate to properly execute a treacherous turn”).
I think all the standard arguments against discontinuities can apply just as well to the aggregate of AI systems as they can to individual AI systems, so I don’t find your argument here compelling.
I think the first paragraph (Evan’s) is basically right, and the second two paragraphs (your response) are basically wrong. I don’t think this has anything to do with discontinuities, at least not the kind of discontinuities that are unlikely. (Compare to the mutiny analogy). I think that this distinction between “strong” warning shots and “weak” warning shots is important because I think that “weak” warning shots will probably only provoke a moderate increase in caution on the part of human institutions and AI projects, whereas “strong” warning shots would provoke a large increase in caution. I agree that we’ll probably get various “weak” warning shots, but I think this doesn’t change the overall picture much because it won’t provoke a major increase in caution on the part of human institutions etc.
I’m guessing it’s that last bit that is the crux—perhaps you think that it would actually provoke a major increase in caution, comparable to the increase we’d get if an AI tried and failed to take over, in which case this minor warning shot vs. major warning shot distinction doesn’t matter much.
perhaps you think that it would actually provoke a major increase in caution, comparable to the increase we’d get if an AI tried and failed to take over, in which case this minor warning shot vs. major warning shot distinction doesn’t matter much.
Well, I think a case of an AI trying and failing to take over would provoke an even larger increase in caution, so I’d rephrase as
it would actually provoke a major increase in caution (assuming we weren’t already being very cautious)
I suppose the distinction between “strong” and “weak” warning shots would matter if we thought that we were getting “strong” warning shots. I want to claim that most people (including Evan) don’t expect “strong” warning shots, and usually mean the “weak” version when talking about “warning shots”, but perhaps I’m just falling prey to the typical mind fallacy.
I suppose the distinction between “strong” and “weak” warning shots would matter if we thought that we were getting “strong” warning shots. I want to claim that most people (including Evan) don’t expect “strong” warning shots, and usually mean the “weak” version when talking about “warning shots”, but perhaps I’m just falling prey to the typical mind fallacy.
I guess I would define a warning shot for X as something like: a situation in which a deployed model causes obvious, real-world harm due to X. So “we tested our model in the lab and found deception” isn’t a warning shot for deception, but “we deployed a deceptive model that acted misaligned in deployment while actively trying to evade detection” would be a warning shot for deception, even though it doesn’t involve taking over the world. By default, in the case of deception, my expectation is that we won’t get a warning shot at all—though I’d more expect a warning shot of the form I gave above than one where a model tries and fails to take over the world, just because I expect that a model that wants to take over the world will be able to bide its time until it can actually succeed.
I don’t automatically exclude lab settings, but other than that, this seems roughly consistent with my usage of the term. (And in particular includes the “weak” warning shots discussed above.)
Well then, would you agree that Evan’s position here:
By default, in the case of deception, my expectation is that we won’t get a warning shot at all
is plausible and in particular doesn’t depend on believing in a discontinuity, at least not the kind of discontinuity we should consider unlikely? If so, then we are all on the same page. If not, then we can rehash our argument focusing on this “obvious, real-world harm” definition, which is noticeably broader than my “strong” definition and therefore makes Evan’s claim stronger and less plausible but still, I think, plausible.
(To answer your earlier question, I’ve read and spoken to several people who seem to take the attempted-world-takeover warning shot scenario seriously, i.e. people who think there’s a good chance we’ll get “strong” warning shots. Paul Christiano, for example. Though it’s possible I was misunderstanding him. I originally interpreted you as maybe being one of those people, though now it seems that you are not? At any rate these people exist.)
EDIT: I feel like we’ve been talking past each other for much of this conversation and in an effort to prevent that from continuing to happen, perhaps instead of answering my questions above, we should just get quantitiative. Consider a spectrum of warning shots from very minor to very major. Put a few examples on the spectrum for illustration. Then draw a credence distribution for probability that we’ll have warning shots of this kind. Maybe it’ll turn out that our distributions aren’t that different from each other after all, especially if we conditionalize on slow takeoff.
Well then, would you agree that Evan’s position here:
By default, in the case of deception, my expectation is that we won’t get a warning shot at all
is plausible and in particular doesn’t depend on believing in a discontinuity, at least not the kind of discontinuity we should consider unlikely?
No, I don’t agree with that.
Consider a spectrum of warning shots from very minor to very major. Put a few examples on the spectrum for illustration. Then draw a credence distribution for probability that we’ll have warning shots of this kind.
One problem here is that my credences on warning shots are going to be somewhat lower just because I think there’s some chance that we just solve the problem before we get warning shots, or there was never any problem in the first place.
I could condition on worlds in which an existential catastrophe occurs, but that will also make it somewhat lower because an existential catastrophe is more likely when we don’t get warning shots.
So I think for each type of warning shot I’m going to do a weird operation where I condition on something like “by the time a significant amount of work is being done by AI systems that are sufficiently capable to deliberately cause <type of warning shot> level of damage, we have not yet solved the problem in practice”.
I’m also going to assume no discontinuity, since that’s the situation we seem to disagree about.
Then, some warning shots we could have:
Minor, leads to result “well of course that happened” without much increase in caution: has already happened
Deception: Robot hand moving in front of a ball to make it look like it is grasping it, even though it isn’t (source)
Hidden capabilities: GPT-3 answering nonsense questions with “a straight face”, except it can tell that the questions are nonsense, as you can see if you design a better prompt (source)
Minor, leads to some actual damage, but mostly PR / loss of trust: 95%
Lying / deception: A personal assistant agent, when asked to schedule a meeting by when2meet, insists upon doing it by email instead, because that’s how it has always done things. It says “sorry, I don’t know how to use when2meet” in order to get this to happen, but it “could” use when2meet if it “wanted” to.
Deception: A cleaning robot sweeps the dust under the rug, knowing full well that the user would disapprove if they knew.
Moderate, comparable to things that are punishable by law: 90%
Deception: An AI system in charge of a company embezzles money
Deception: An AI system runs a Ponzi scheme (that it knows is a Ponzi scheme) (and the designers of the AI system wouldn’t endorse it running a Ponzi scheme)
Failure of constraints: An AI system helps minors find online stores for drugs and alcohol
Major, lots of damage, would be huge news: 60%
An AI system blows up an “enemy building”; it hides its plans from all humans (including users / designers) because it knows they will try to stop it.
An AI system captures employees from a rival corporation and tortures them until they give up corporate secrets.
(The specific examples I give feel somewhat implausible, but I think that’s mostly because I don’t know the best ways to achieve goals when you have no moral scruples holding you back.)
“Strong”, tries and fails to take over the world: 20%
I do think it is plausible that multiple AI systems try to take over the world, and then some of them are thwarted by other AI systems. I’m not counting these, because it seems like humans have lost meaningful control in this situation, so this “warning shot” doesn’t help.
I mostly assign 20% on this as “idk, seems unlikely, but I can’t rule it out, and predicting the future is hard so don’t assign an extreme value here”
OK, thanks. YMMV but some people I’ve read / talked to seem to think that before we have successful world-takeover attempts, we’ll have unsuccessful ones—”sordid stumbles.” If this is true, it’s good news, because it makes it a LOT easier to prevent successful attempts. Alas it is not true.
A much weaker version of something like this may be true, e.g. the warning shot story you proposed a while back about customer service bots being willingly scammed. It’s plausible to me that we’ll get stuff like that before it’s too late.
If you think there’s something we are not on the same page about here—perhaps what you were hinting at with your final sentence—I’d be interested to hear it.
I’m not sure. Since you were pushing on the claim about failing to take over the world, it seemed like you think (the truth value of) that claim is pretty important, whereas I see it as not that important, which would suggest that there is some underlying disagreement (idk what it would be though).
It’s been a while since I thought about this, but going back to the beginning of this thread:
I think the first paragraph (Evan’s) is basically right, and the second two paragraphs (your response) are basically wrong. I don’t think this has anything to do with discontinuities, at least not the kind of discontinuities that are unlikely. (Compare to the mutiny analogy). I think that this distinction between “strong” warning shots and “weak” warning shots is important because I think that “weak” warning shots will probably only provoke a moderate increase in caution on the part of human institutions and AI projects, whereas “strong” warning shots would provoke a large increase in caution. I agree that we’ll probably get various “weak” warning shots, but I think this doesn’t change the overall picture much because it won’t provoke a major increase in caution on the part of human institutions etc.
I’m guessing it’s that last bit that is the crux—perhaps you think that it would actually provoke a major increase in caution, comparable to the increase we’d get if an AI tried and failed to take over, in which case this minor warning shot vs. major warning shot distinction doesn’t matter much.
Well, I think a case of an AI trying and failing to take over would provoke an even larger increase in caution, so I’d rephrase as
I suppose the distinction between “strong” and “weak” warning shots would matter if we thought that we were getting “strong” warning shots. I want to claim that most people (including Evan) don’t expect “strong” warning shots, and usually mean the “weak” version when talking about “warning shots”, but perhaps I’m just falling prey to the typical mind fallacy.
I guess I would define a warning shot for X as something like: a situation in which a deployed model causes obvious, real-world harm due to X. So “we tested our model in the lab and found deception” isn’t a warning shot for deception, but “we deployed a deceptive model that acted misaligned in deployment while actively trying to evade detection” would be a warning shot for deception, even though it doesn’t involve taking over the world. By default, in the case of deception, my expectation is that we won’t get a warning shot at all—though I’d more expect a warning shot of the form I gave above than one where a model tries and fails to take over the world, just because I expect that a model that wants to take over the world will be able to bide its time until it can actually succeed.
I don’t automatically exclude lab settings, but other than that, this seems roughly consistent with my usage of the term. (And in particular includes the “weak” warning shots discussed above.)
Well then, would you agree that Evan’s position here:
is plausible and in particular doesn’t depend on believing in a discontinuity, at least not the kind of discontinuity we should consider unlikely? If so, then we are all on the same page. If not, then we can rehash our argument focusing on this “obvious, real-world harm” definition, which is noticeably broader than my “strong” definition and therefore makes Evan’s claim stronger and less plausible but still, I think, plausible.
(To answer your earlier question, I’ve read and spoken to several people who seem to take the attempted-world-takeover warning shot scenario seriously, i.e. people who think there’s a good chance we’ll get “strong” warning shots. Paul Christiano, for example. Though it’s possible I was misunderstanding him. I originally interpreted you as maybe being one of those people, though now it seems that you are not? At any rate these people exist.)
EDIT: I feel like we’ve been talking past each other for much of this conversation and in an effort to prevent that from continuing to happen, perhaps instead of answering my questions above, we should just get quantitiative. Consider a spectrum of warning shots from very minor to very major. Put a few examples on the spectrum for illustration. Then draw a credence distribution for probability that we’ll have warning shots of this kind. Maybe it’ll turn out that our distributions aren’t that different from each other after all, especially if we conditionalize on slow takeoff.
No, I don’t agree with that.
One problem here is that my credences on warning shots are going to be somewhat lower just because I think there’s some chance that we just solve the problem before we get warning shots, or there was never any problem in the first place.
I could condition on worlds in which an existential catastrophe occurs, but that will also make it somewhat lower because an existential catastrophe is more likely when we don’t get warning shots.
So I think for each type of warning shot I’m going to do a weird operation where I condition on something like “by the time a significant amount of work is being done by AI systems that are sufficiently capable to deliberately cause <type of warning shot> level of damage, we have not yet solved the problem in practice”.
I’m also going to assume no discontinuity, since that’s the situation we seem to disagree about.
Then, some warning shots we could have:
Minor, leads to result “well of course that happened” without much increase in caution: has already happened
Reward gaming: Faulty reward functions in the wild
Deception: Robot hand moving in front of a ball to make it look like it is grasping it, even though it isn’t (source)
Hidden capabilities: GPT-3 answering nonsense questions with “a straight face”, except it can tell that the questions are nonsense, as you can see if you design a better prompt (source)
Minor, leads to some actual damage, but mostly PR / loss of trust: 95%
Lying / deception: A personal assistant agent, when asked to schedule a meeting by when2meet, insists upon doing it by email instead, because that’s how it has always done things. It says “sorry, I don’t know how to use when2meet” in order to get this to happen, but it “could” use when2meet if it “wanted” to.
Deception: A cleaning robot sweeps the dust under the rug, knowing full well that the user would disapprove if they knew.
Moderate, comparable to things that are punishable by law: 90%
Deception: An AI system in charge of a company embezzles money
Deception: An AI system runs a Ponzi scheme (that it knows is a Ponzi scheme) (and the designers of the AI system wouldn’t endorse it running a Ponzi scheme)
Failure of constraints: An AI system helps minors find online stores for drugs and alcohol
Major, lots of damage, would be huge news: 60%
An AI system blows up an “enemy building”; it hides its plans from all humans (including users / designers) because it knows they will try to stop it.
An AI system captures employees from a rival corporation and tortures them until they give up corporate secrets.
(The specific examples I give feel somewhat implausible, but I think that’s mostly because I don’t know the best ways to achieve goals when you have no moral scruples holding you back.)
“Strong”, tries and fails to take over the world: 20%
I do think it is plausible that multiple AI systems try to take over the world, and then some of them are thwarted by other AI systems. I’m not counting these, because it seems like humans have lost meaningful control in this situation, so this “warning shot” doesn’t help.
I mostly assign 20% on this as “idk, seems unlikely, but I can’t rule it out, and predicting the future is hard so don’t assign an extreme value here”