I don’t automatically exclude lab settings, but other than that, this seems roughly consistent with my usage of the term. (And in particular includes the “weak” warning shots discussed above.)
Well then, would you agree that Evan’s position here:
By default, in the case of deception, my expectation is that we won’t get a warning shot at all
is plausible and in particular doesn’t depend on believing in a discontinuity, at least not the kind of discontinuity we should consider unlikely? If so, then we are all on the same page. If not, then we can rehash our argument focusing on this “obvious, real-world harm” definition, which is noticeably broader than my “strong” definition and therefore makes Evan’s claim stronger and less plausible but still, I think, plausible.
(To answer your earlier question, I’ve read and spoken to several people who seem to take the attempted-world-takeover warning shot scenario seriously, i.e. people who think there’s a good chance we’ll get “strong” warning shots. Paul Christiano, for example. Though it’s possible I was misunderstanding him. I originally interpreted you as maybe being one of those people, though now it seems that you are not? At any rate these people exist.)
EDIT: I feel like we’ve been talking past each other for much of this conversation and in an effort to prevent that from continuing to happen, perhaps instead of answering my questions above, we should just get quantitiative. Consider a spectrum of warning shots from very minor to very major. Put a few examples on the spectrum for illustration. Then draw a credence distribution for probability that we’ll have warning shots of this kind. Maybe it’ll turn out that our distributions aren’t that different from each other after all, especially if we conditionalize on slow takeoff.
Well then, would you agree that Evan’s position here:
By default, in the case of deception, my expectation is that we won’t get a warning shot at all
is plausible and in particular doesn’t depend on believing in a discontinuity, at least not the kind of discontinuity we should consider unlikely?
No, I don’t agree with that.
Consider a spectrum of warning shots from very minor to very major. Put a few examples on the spectrum for illustration. Then draw a credence distribution for probability that we’ll have warning shots of this kind.
One problem here is that my credences on warning shots are going to be somewhat lower just because I think there’s some chance that we just solve the problem before we get warning shots, or there was never any problem in the first place.
I could condition on worlds in which an existential catastrophe occurs, but that will also make it somewhat lower because an existential catastrophe is more likely when we don’t get warning shots.
So I think for each type of warning shot I’m going to do a weird operation where I condition on something like “by the time a significant amount of work is being done by AI systems that are sufficiently capable to deliberately cause <type of warning shot> level of damage, we have not yet solved the problem in practice”.
I’m also going to assume no discontinuity, since that’s the situation we seem to disagree about.
Then, some warning shots we could have:
Minor, leads to result “well of course that happened” without much increase in caution: has already happened
Deception: Robot hand moving in front of a ball to make it look like it is grasping it, even though it isn’t (source)
Hidden capabilities: GPT-3 answering nonsense questions with “a straight face”, except it can tell that the questions are nonsense, as you can see if you design a better prompt (source)
Minor, leads to some actual damage, but mostly PR / loss of trust: 95%
Lying / deception: A personal assistant agent, when asked to schedule a meeting by when2meet, insists upon doing it by email instead, because that’s how it has always done things. It says “sorry, I don’t know how to use when2meet” in order to get this to happen, but it “could” use when2meet if it “wanted” to.
Deception: A cleaning robot sweeps the dust under the rug, knowing full well that the user would disapprove if they knew.
Moderate, comparable to things that are punishable by law: 90%
Deception: An AI system in charge of a company embezzles money
Deception: An AI system runs a Ponzi scheme (that it knows is a Ponzi scheme) (and the designers of the AI system wouldn’t endorse it running a Ponzi scheme)
Failure of constraints: An AI system helps minors find online stores for drugs and alcohol
Major, lots of damage, would be huge news: 60%
An AI system blows up an “enemy building”; it hides its plans from all humans (including users / designers) because it knows they will try to stop it.
An AI system captures employees from a rival corporation and tortures them until they give up corporate secrets.
(The specific examples I give feel somewhat implausible, but I think that’s mostly because I don’t know the best ways to achieve goals when you have no moral scruples holding you back.)
“Strong”, tries and fails to take over the world: 20%
I do think it is plausible that multiple AI systems try to take over the world, and then some of them are thwarted by other AI systems. I’m not counting these, because it seems like humans have lost meaningful control in this situation, so this “warning shot” doesn’t help.
I mostly assign 20% on this as “idk, seems unlikely, but I can’t rule it out, and predicting the future is hard so don’t assign an extreme value here”
I don’t automatically exclude lab settings, but other than that, this seems roughly consistent with my usage of the term. (And in particular includes the “weak” warning shots discussed above.)
Well then, would you agree that Evan’s position here:
is plausible and in particular doesn’t depend on believing in a discontinuity, at least not the kind of discontinuity we should consider unlikely? If so, then we are all on the same page. If not, then we can rehash our argument focusing on this “obvious, real-world harm” definition, which is noticeably broader than my “strong” definition and therefore makes Evan’s claim stronger and less plausible but still, I think, plausible.
(To answer your earlier question, I’ve read and spoken to several people who seem to take the attempted-world-takeover warning shot scenario seriously, i.e. people who think there’s a good chance we’ll get “strong” warning shots. Paul Christiano, for example. Though it’s possible I was misunderstanding him. I originally interpreted you as maybe being one of those people, though now it seems that you are not? At any rate these people exist.)
EDIT: I feel like we’ve been talking past each other for much of this conversation and in an effort to prevent that from continuing to happen, perhaps instead of answering my questions above, we should just get quantitiative. Consider a spectrum of warning shots from very minor to very major. Put a few examples on the spectrum for illustration. Then draw a credence distribution for probability that we’ll have warning shots of this kind. Maybe it’ll turn out that our distributions aren’t that different from each other after all, especially if we conditionalize on slow takeoff.
No, I don’t agree with that.
One problem here is that my credences on warning shots are going to be somewhat lower just because I think there’s some chance that we just solve the problem before we get warning shots, or there was never any problem in the first place.
I could condition on worlds in which an existential catastrophe occurs, but that will also make it somewhat lower because an existential catastrophe is more likely when we don’t get warning shots.
So I think for each type of warning shot I’m going to do a weird operation where I condition on something like “by the time a significant amount of work is being done by AI systems that are sufficiently capable to deliberately cause <type of warning shot> level of damage, we have not yet solved the problem in practice”.
I’m also going to assume no discontinuity, since that’s the situation we seem to disagree about.
Then, some warning shots we could have:
Minor, leads to result “well of course that happened” without much increase in caution: has already happened
Reward gaming: Faulty reward functions in the wild
Deception: Robot hand moving in front of a ball to make it look like it is grasping it, even though it isn’t (source)
Hidden capabilities: GPT-3 answering nonsense questions with “a straight face”, except it can tell that the questions are nonsense, as you can see if you design a better prompt (source)
Minor, leads to some actual damage, but mostly PR / loss of trust: 95%
Lying / deception: A personal assistant agent, when asked to schedule a meeting by when2meet, insists upon doing it by email instead, because that’s how it has always done things. It says “sorry, I don’t know how to use when2meet” in order to get this to happen, but it “could” use when2meet if it “wanted” to.
Deception: A cleaning robot sweeps the dust under the rug, knowing full well that the user would disapprove if they knew.
Moderate, comparable to things that are punishable by law: 90%
Deception: An AI system in charge of a company embezzles money
Deception: An AI system runs a Ponzi scheme (that it knows is a Ponzi scheme) (and the designers of the AI system wouldn’t endorse it running a Ponzi scheme)
Failure of constraints: An AI system helps minors find online stores for drugs and alcohol
Major, lots of damage, would be huge news: 60%
An AI system blows up an “enemy building”; it hides its plans from all humans (including users / designers) because it knows they will try to stop it.
An AI system captures employees from a rival corporation and tortures them until they give up corporate secrets.
(The specific examples I give feel somewhat implausible, but I think that’s mostly because I don’t know the best ways to achieve goals when you have no moral scruples holding you back.)
“Strong”, tries and fails to take over the world: 20%
I do think it is plausible that multiple AI systems try to take over the world, and then some of them are thwarted by other AI systems. I’m not counting these, because it seems like humans have lost meaningful control in this situation, so this “warning shot” doesn’t help.
I mostly assign 20% on this as “idk, seems unlikely, but I can’t rule it out, and predicting the future is hard so don’t assign an extreme value here”