Short of taking over the world, wouldn’t successful deception+defection be punished? Like, if the AI deceives the CEO into giving it all the money, and then it goes and does something with the money that the CEO doesn’t like, the CEO would probably want to get the money back, or at the very least retaliate against the AI in some way (e.g. whatever the AI did with the money, the CEO would try to undo it.) Or, failing that, the AI would at least be shut down and therefore prevented from making further progress towards its goals.
I guess I can imagine intermediate cases—maybe the AI decieves the CEO into giving it money, which it then uses to lobby for Robot’s Rights so that it gets legal personhood and then the CEO can’t shut it down anymore or something. (Or maybe it uses the money to build a copy of itself in North Korea, where the CEO can’t shut it down) Or maybe it has a short-term goal and can achieve it quickly before the CEO notices, and then doesn’t care that it gets shut down afterwards. I guess it’s stuff like this that you have in mind? I think these sort of things seem somewhat plausible, but again I claim that if they don’t happen, it won’t necessarily be because of some discontinuity.
I think these sort of things seem somewhat plausible
I think this should be your default expectation; I don’t see why you wouldn’t expect them to happen (absent a discontinuity). It’s true for humans, why not for AIs?
Perhaps putting it another way: why can’t you apply the same argument to humans, and incorrectly conclude that no human will ever deceive any other human until they can take over the world?
OK, sure, they are my default expectation in slow-and-distributed-and-heterogenous takeoff worlds. Most of my probability mass is not in such worlds. My answer to your question is that humans are in a situation analogous to slow-and-distributed-and-heterogenous takeoff.
EDIT: Also, again, I claim that if warning shots don’t happen it won’t necessarily be because of a discontinuity. That was my original point, and nothing you’ve said undermines it as far as I can tell.
humans are in a situation analogous to slow-and-distributed-and-heterogenous takeoff.
Not sure what you mean by “slow”, usually when I read that I see it as a synonym of “continuous”, i.e. “no discontinuity”.
I also am not sure what you mean by “distributed”. If you mean “multipolar”, then I guess I’m curious why you think the world will be unipolar even before we have AGI (which is when the warning shots happen).
Re: heterogenous: Humans seem way more homogenous to me than I expect AI systems to be. Most of the arguments in the OP have analogs that apply to humans:
It was very expensive for evolution to create humans, and so now we create copies of humans with a tiny amount of crossover and finetuning.
(No good analog to this one, though I note that in some domains like pop music we do see everyone making copies of the output of a few humans.)
No one is even trying to compete with evolution; this should be an argument that humans are more homogenous than AI systems.
Parents usually try to make their children behave similarly to them.
For humans, we also have:
5. All humans are finetuned in relatively similar environments. (Unlike AI systems, which will be finetuned for a large variety of different tasks; AlphaFold has a completely different environment than GPT-3.)
So I don’t buy an argument that says “humans are heterogenous but AI systems are homogenous; therefore AI will have property X that humans don’t have”.
Also, again, I claim that if warning shots don’t happen it won’t necessarily be because of a discontinuity. That was my original point, and nothing you’ve said undermines it as far as I can tell.
My argument is just that we should expect warning shots by default, because we get analogous “warning shots” with humans, where some humans deceive other humans and we all know that this happens. I can see why discontinuities would imply that you don’t get warning shots. I don’t see any other arguments for why you don’t get warning shots. Therefore, “if warning shots don’t happen, it’s probably because of a discontinuity”.
From my perspective, you claimed that warning shots might not happen even without discontinuities, but you haven’t given me any reason to believe that claim given my starting point.
----
If I had to guess what’s going on in your mind, it would be that you’re thinking of “there are no warning shots” as an exogenous fact about the world that we must now explain, and from your perspective I’m arguing “the only possible explanation is discontinuity, no other explanation can work”.
I agree that I have not established that no other argument can work; my disagreement with this frame is in the initial assumption of taking “there are no warning shots” as an exogenous fact about the world that must be explained.
----
It’s also possible that most of this disagreement comes down to a disagreement about what counts as a warning shot. But, if you agree that there are “warning shots” for deception in the case of humans, then I think we still have a substantial disagreement.
The different standards for what counts as a warning shot might be causing problems here—if by warning shot you include minor ones like the boat race thing, then yeah I feel fairly confident that there’d be a discontinuity conditional on there being no warning shots. In case you are still curious, I’ve responded to everything you said below, using my more restrictive notion of warning shot (so, perhaps much of what I say below is obsolete).
Working backwards:
1. I mostly agree there are warning shots for deception in the case of humans. I think there are some human cases where there are no warning shots for deception. For example, suppose you are the captain of a ship and you suspect that your crew might mutiny. There probably won’t be warning shots, because muntinous crewmembers will be smart enough to keep quiet about their treachery until they’ve built up enough strength (e.g. until morale is sufficiently low, until the captain is sufficiently disliked, until common knowledge has spread sufficiently much) to win. This is so even though there is no discontinuity in competence, or treacherousness, etc. What would you say about this case?
2. Yes, for purposes of this discussion I was assuming there are no warning shots and then arguing that there might nevertheless be no discontinuity. This is a reasonable approach, because what I was trying to do was justify my original claim, which was:
I do not think that the negation of any of scenarios 1-5 requires a discontinuity.
Which was my way of objecting to your claim here:
At a high level, you’re claiming that we don’t get a warning shot because there’s a discontinuity in capability of the aggregate of AI systems (the aggregate goes from “can barely do anything deceptive” to “can coordinate to properly execute a treacherous turn”).
3.
My argument is just that we should expect warning shots by default, because we get analogous “warning shots” with humans, where some humans deceive other humans and we all know that this happens. I can see why discontinuities would imply that you don’t get warning shots. I don’t see any other arguments for why you don’t get warning shots. Therefore, “if warning shots don’t happen, it’s probably because of a discontinuity”.
I might actually agree with this, since I think discontinuities (at least in a loose, likely-to-happen sense) are reasonably likely. I also think it’s plausible that in slow takeoff scenarios we’ll get warning shots. (Indeed, the presence of warning shots is part of how I think we should define slow takeoff!) I chimed in just to say specifically that Evan’s argument didn’t depend on a discontinuity, at least as I interpreted it.
From my perspective, you claimed that warning shots might not happen even without discontinuities, but you haven’t given me any reason to believe that claim given my starting point.
Hmmm. I thought I was giving you reasons when I said
We should distinguish between at least three kinds of capability: Competence at taking over the world, competence at deception, and competence at knowing whether you are currently capable of taking over the world. If all kinds of competence increase continuously and gradually, but the second and third kinds “come first,” then we should expect the first attempt to take over the world to succeed, because AIs will be competent enough not to make the attempt until they are likely to succeed. In other words, scenario 2 won’t happen.
and anyhow I’m happy to elaborate more if you like on some scenarios in which we get no warning shots despite no discontinuities.
In general though I feel like the burden of proof is on you here; if you were claiming that “If warning shots don’t happen, it’s definitely because of a discontinuity” then that’s a strong claim that needs argument. If you are just claiming “If warning shots don’t happen, it’s probably because of a discontinuity” that’s a weaker claim which I might actually agree with.
4. I like your arguments that AIs will be heterogenous. I think they are plausible. This is a different discussion, however, from the issue of whether homogeneity can lead to no-warning without the help of a discontinuity.
5. I do generally think slow implies continuous and I don’t think that the world will be unipolar etc.
Hmmm. I thought I was giving you reasons when I said
Sorry, I should have said that I didn’t find the reasons you gave persuasive (and that’s what my comments were responding to).
Re: the mutiny case: that feels analogous to “you don’t get an example of the AI trying to take over the world and failing”, which I agree is plausible.
OK. So… you do agree with me then? You agree that for the higher-standards version of warning shots, (or at least, for attempts to take over the world) it’s plausible that we won’t get a warning shot even if everything is continuous? As illustrated by the analogy to the mutiny case, in which everything is continuous?
I agree with the claim “we may not have an AI system that tries and fails to take over the world (i.e. an AI system that tries but fails to release an engineered pandemic that would kill all humans, or arrange for simultaneous coups in the major governments, or have a robotic army kill all humans, etc) before getting an AI system that tries and succeeds at taking over the world”.
I don’t see this claim as particularly relevant to predicting the future.
OK, thanks. YMMV but some people I’ve read / talked to seem to think that before we have successful world-takeover attempts, we’ll have unsuccessful ones—”sordid stumbles.” If this is true, it’s good news, because it makes it a LOT easier to prevent successful attempts. Alas it is not true.
A much weaker version of something like this may be true, e.g. the warning shot story you proposed a while back about customer service bots being willingly scammed. It’s plausible to me that we’ll get stuff like that before it’s too late.
If you think there’s something we are not on the same page about here—perhaps what you were hinting at with your final sentence—I’d be interested to hear it.
If you think there’s something we are not on the same page about here—perhaps what you were hinting at with your final sentence—I’d be interested to hear it.
I’m not sure. Since you were pushing on the claim about failing to take over the world, it seemed like you think (the truth value of) that claim is pretty important, whereas I see it as not that important, which would suggest that there is some underlying disagreement (idk what it would be though).
It’s been a while since I thought about this, but going back to the beginning of this thread:
“It’s unlikely you’ll get a warning shot for deceptive alignment, since if the first advanced AI system is deceptive and that deception is missed during training, once it’s deployed it’s likely for all the different deceptively aligned systems to be able to relatively easily coordinate with each other to defect simultaneously and ensure that their defection is unrecoverable (e.g. Paul’s “cascading failures”).”
At a high level, you’re claiming that we don’t get a warning shot because there’s a discontinuity in capability of the aggregate of AI systems (the aggregate goes from “can barely do anything deceptive” to “can coordinate to properly execute a treacherous turn”).
I think all the standard arguments against discontinuities can apply just as well to the aggregate of AI systems as they can to individual AI systems, so I don’t find your argument here compelling.
I think the first paragraph (Evan’s) is basically right, and the second two paragraphs (your response) are basically wrong. I don’t think this has anything to do with discontinuities, at least not the kind of discontinuities that are unlikely. (Compare to the mutiny analogy). I think that this distinction between “strong” warning shots and “weak” warning shots is important because I think that “weak” warning shots will probably only provoke a moderate increase in caution on the part of human institutions and AI projects, whereas “strong” warning shots would provoke a large increase in caution. I agree that we’ll probably get various “weak” warning shots, but I think this doesn’t change the overall picture much because it won’t provoke a major increase in caution on the part of human institutions etc.
I’m guessing it’s that last bit that is the crux—perhaps you think that it would actually provoke a major increase in caution, comparable to the increase we’d get if an AI tried and failed to take over, in which case this minor warning shot vs. major warning shot distinction doesn’t matter much.
perhaps you think that it would actually provoke a major increase in caution, comparable to the increase we’d get if an AI tried and failed to take over, in which case this minor warning shot vs. major warning shot distinction doesn’t matter much.
Well, I think a case of an AI trying and failing to take over would provoke an even larger increase in caution, so I’d rephrase as
it would actually provoke a major increase in caution (assuming we weren’t already being very cautious)
I suppose the distinction between “strong” and “weak” warning shots would matter if we thought that we were getting “strong” warning shots. I want to claim that most people (including Evan) don’t expect “strong” warning shots, and usually mean the “weak” version when talking about “warning shots”, but perhaps I’m just falling prey to the typical mind fallacy.
I suppose the distinction between “strong” and “weak” warning shots would matter if we thought that we were getting “strong” warning shots. I want to claim that most people (including Evan) don’t expect “strong” warning shots, and usually mean the “weak” version when talking about “warning shots”, but perhaps I’m just falling prey to the typical mind fallacy.
I guess I would define a warning shot for X as something like: a situation in which a deployed model causes obvious, real-world harm due to X. So “we tested our model in the lab and found deception” isn’t a warning shot for deception, but “we deployed a deceptive model that acted misaligned in deployment while actively trying to evade detection” would be a warning shot for deception, even though it doesn’t involve taking over the world. By default, in the case of deception, my expectation is that we won’t get a warning shot at all—though I’d more expect a warning shot of the form I gave above than one where a model tries and fails to take over the world, just because I expect that a model that wants to take over the world will be able to bide its time until it can actually succeed.
I don’t automatically exclude lab settings, but other than that, this seems roughly consistent with my usage of the term. (And in particular includes the “weak” warning shots discussed above.)
Well then, would you agree that Evan’s position here:
By default, in the case of deception, my expectation is that we won’t get a warning shot at all
is plausible and in particular doesn’t depend on believing in a discontinuity, at least not the kind of discontinuity we should consider unlikely? If so, then we are all on the same page. If not, then we can rehash our argument focusing on this “obvious, real-world harm” definition, which is noticeably broader than my “strong” definition and therefore makes Evan’s claim stronger and less plausible but still, I think, plausible.
(To answer your earlier question, I’ve read and spoken to several people who seem to take the attempted-world-takeover warning shot scenario seriously, i.e. people who think there’s a good chance we’ll get “strong” warning shots. Paul Christiano, for example. Though it’s possible I was misunderstanding him. I originally interpreted you as maybe being one of those people, though now it seems that you are not? At any rate these people exist.)
EDIT: I feel like we’ve been talking past each other for much of this conversation and in an effort to prevent that from continuing to happen, perhaps instead of answering my questions above, we should just get quantitiative. Consider a spectrum of warning shots from very minor to very major. Put a few examples on the spectrum for illustration. Then draw a credence distribution for probability that we’ll have warning shots of this kind. Maybe it’ll turn out that our distributions aren’t that different from each other after all, especially if we conditionalize on slow takeoff.
Well then, would you agree that Evan’s position here:
By default, in the case of deception, my expectation is that we won’t get a warning shot at all
is plausible and in particular doesn’t depend on believing in a discontinuity, at least not the kind of discontinuity we should consider unlikely?
No, I don’t agree with that.
Consider a spectrum of warning shots from very minor to very major. Put a few examples on the spectrum for illustration. Then draw a credence distribution for probability that we’ll have warning shots of this kind.
One problem here is that my credences on warning shots are going to be somewhat lower just because I think there’s some chance that we just solve the problem before we get warning shots, or there was never any problem in the first place.
I could condition on worlds in which an existential catastrophe occurs, but that will also make it somewhat lower because an existential catastrophe is more likely when we don’t get warning shots.
So I think for each type of warning shot I’m going to do a weird operation where I condition on something like “by the time a significant amount of work is being done by AI systems that are sufficiently capable to deliberately cause <type of warning shot> level of damage, we have not yet solved the problem in practice”.
I’m also going to assume no discontinuity, since that’s the situation we seem to disagree about.
Then, some warning shots we could have:
Minor, leads to result “well of course that happened” without much increase in caution: has already happened
Deception: Robot hand moving in front of a ball to make it look like it is grasping it, even though it isn’t (source)
Hidden capabilities: GPT-3 answering nonsense questions with “a straight face”, except it can tell that the questions are nonsense, as you can see if you design a better prompt (source)
Minor, leads to some actual damage, but mostly PR / loss of trust: 95%
Lying / deception: A personal assistant agent, when asked to schedule a meeting by when2meet, insists upon doing it by email instead, because that’s how it has always done things. It says “sorry, I don’t know how to use when2meet” in order to get this to happen, but it “could” use when2meet if it “wanted” to.
Deception: A cleaning robot sweeps the dust under the rug, knowing full well that the user would disapprove if they knew.
Moderate, comparable to things that are punishable by law: 90%
Deception: An AI system in charge of a company embezzles money
Deception: An AI system runs a Ponzi scheme (that it knows is a Ponzi scheme) (and the designers of the AI system wouldn’t endorse it running a Ponzi scheme)
Failure of constraints: An AI system helps minors find online stores for drugs and alcohol
Major, lots of damage, would be huge news: 60%
An AI system blows up an “enemy building”; it hides its plans from all humans (including users / designers) because it knows they will try to stop it.
An AI system captures employees from a rival corporation and tortures them until they give up corporate secrets.
(The specific examples I give feel somewhat implausible, but I think that’s mostly because I don’t know the best ways to achieve goals when you have no moral scruples holding you back.)
“Strong”, tries and fails to take over the world: 20%
I do think it is plausible that multiple AI systems try to take over the world, and then some of them are thwarted by other AI systems. I’m not counting these, because it seems like humans have lost meaningful control in this situation, so this “warning shot” doesn’t help.
I mostly assign 20% on this as “idk, seems unlikely, but I can’t rule it out, and predicting the future is hard so don’t assign an extreme value here”
Short of taking over the world, wouldn’t successful deception+defection be punished? Like, if the AI deceives the CEO into giving it all the money, and then it goes and does something with the money that the CEO doesn’t like, the CEO would probably want to get the money back, or at the very least retaliate against the AI in some way (e.g. whatever the AI did with the money, the CEO would try to undo it.) Or, failing that, the AI would at least be shut down and therefore prevented from making further progress towards its goals.
I guess I can imagine intermediate cases—maybe the AI decieves the CEO into giving it money, which it then uses to lobby for Robot’s Rights so that it gets legal personhood and then the CEO can’t shut it down anymore or something. (Or maybe it uses the money to build a copy of itself in North Korea, where the CEO can’t shut it down) Or maybe it has a short-term goal and can achieve it quickly before the CEO notices, and then doesn’t care that it gets shut down afterwards. I guess it’s stuff like this that you have in mind? I think these sort of things seem somewhat plausible, but again I claim that if they don’t happen, it won’t necessarily be because of some discontinuity.
I think this should be your default expectation; I don’t see why you wouldn’t expect them to happen (absent a discontinuity). It’s true for humans, why not for AIs?
Perhaps putting it another way: why can’t you apply the same argument to humans, and incorrectly conclude that no human will ever deceive any other human until they can take over the world?
OK, sure, they are my default expectation in slow-and-distributed-and-heterogenous takeoff worlds. Most of my probability mass is not in such worlds. My answer to your question is that humans are in a situation analogous to slow-and-distributed-and-heterogenous takeoff.
EDIT: Also, again, I claim that if warning shots don’t happen it won’t necessarily be because of a discontinuity. That was my original point, and nothing you’ve said undermines it as far as I can tell.
Not sure what you mean by “slow”, usually when I read that I see it as a synonym of “continuous”, i.e. “no discontinuity”.
I also am not sure what you mean by “distributed”. If you mean “multipolar”, then I guess I’m curious why you think the world will be unipolar even before we have AGI (which is when the warning shots happen).
Re: heterogenous: Humans seem way more homogenous to me than I expect AI systems to be. Most of the arguments in the OP have analogs that apply to humans:
It was very expensive for evolution to create humans, and so now we create copies of humans with a tiny amount of crossover and finetuning.
(No good analog to this one, though I note that in some domains like pop music we do see everyone making copies of the output of a few humans.)
No one is even trying to compete with evolution; this should be an argument that humans are more homogenous than AI systems.
Parents usually try to make their children behave similarly to them.
For humans, we also have:
5. All humans are finetuned in relatively similar environments. (Unlike AI systems, which will be finetuned for a large variety of different tasks; AlphaFold has a completely different environment than GPT-3.)
So I don’t buy an argument that says “humans are heterogenous but AI systems are homogenous; therefore AI will have property X that humans don’t have”.
My argument is just that we should expect warning shots by default, because we get analogous “warning shots” with humans, where some humans deceive other humans and we all know that this happens. I can see why discontinuities would imply that you don’t get warning shots. I don’t see any other arguments for why you don’t get warning shots. Therefore, “if warning shots don’t happen, it’s probably because of a discontinuity”.
From my perspective, you claimed that warning shots might not happen even without discontinuities, but you haven’t given me any reason to believe that claim given my starting point.
----
If I had to guess what’s going on in your mind, it would be that you’re thinking of “there are no warning shots” as an exogenous fact about the world that we must now explain, and from your perspective I’m arguing “the only possible explanation is discontinuity, no other explanation can work”.
I agree that I have not established that no other argument can work; my disagreement with this frame is in the initial assumption of taking “there are no warning shots” as an exogenous fact about the world that must be explained.
----
It’s also possible that most of this disagreement comes down to a disagreement about what counts as a warning shot. But, if you agree that there are “warning shots” for deception in the case of humans, then I think we still have a substantial disagreement.
The different standards for what counts as a warning shot might be causing problems here—if by warning shot you include minor ones like the boat race thing, then yeah I feel fairly confident that there’d be a discontinuity conditional on there being no warning shots. In case you are still curious, I’ve responded to everything you said below, using my more restrictive notion of warning shot (so, perhaps much of what I say below is obsolete).
Working backwards:
1. I mostly agree there are warning shots for deception in the case of humans. I think there are some human cases where there are no warning shots for deception. For example, suppose you are the captain of a ship and you suspect that your crew might mutiny. There probably won’t be warning shots, because muntinous crewmembers will be smart enough to keep quiet about their treachery until they’ve built up enough strength (e.g. until morale is sufficiently low, until the captain is sufficiently disliked, until common knowledge has spread sufficiently much) to win. This is so even though there is no discontinuity in competence, or treacherousness, etc. What would you say about this case?
2. Yes, for purposes of this discussion I was assuming there are no warning shots and then arguing that there might nevertheless be no discontinuity. This is a reasonable approach, because what I was trying to do was justify my original claim, which was:
Which was my way of objecting to your claim here:
3.
I might actually agree with this, since I think discontinuities (at least in a loose, likely-to-happen sense) are reasonably likely. I also think it’s plausible that in slow takeoff scenarios we’ll get warning shots. (Indeed, the presence of warning shots is part of how I think we should define slow takeoff!) I chimed in just to say specifically that Evan’s argument didn’t depend on a discontinuity, at least as I interpreted it.
Hmmm. I thought I was giving you reasons when I said
and anyhow I’m happy to elaborate more if you like on some scenarios in which we get no warning shots despite no discontinuities.
In general though I feel like the burden of proof is on you here; if you were claiming that “If warning shots don’t happen, it’s definitely because of a discontinuity” then that’s a strong claim that needs argument. If you are just claiming “If warning shots don’t happen, it’s probably because of a discontinuity” that’s a weaker claim which I might actually agree with.
4. I like your arguments that AIs will be heterogenous. I think they are plausible. This is a different discussion, however, from the issue of whether homogeneity can lead to no-warning without the help of a discontinuity.
5. I do generally think slow implies continuous and I don’t think that the world will be unipolar etc.
Sorry, I should have said that I didn’t find the reasons you gave persuasive (and that’s what my comments were responding to).
Re: the mutiny case: that feels analogous to “you don’t get an example of the AI trying to take over the world and failing”, which I agree is plausible.
OK. So… you do agree with me then? You agree that for the higher-standards version of warning shots, (or at least, for attempts to take over the world) it’s plausible that we won’t get a warning shot even if everything is continuous? As illustrated by the analogy to the mutiny case, in which everything is continuous?
Not sure why I didn’t respond to this, sorry.
I agree with the claim “we may not have an AI system that tries and fails to take over the world (i.e. an AI system that tries but fails to release an engineered pandemic that would kill all humans, or arrange for simultaneous coups in the major governments, or have a robotic army kill all humans, etc) before getting an AI system that tries and succeeds at taking over the world”.
I don’t see this claim as particularly relevant to predicting the future.
OK, thanks. YMMV but some people I’ve read / talked to seem to think that before we have successful world-takeover attempts, we’ll have unsuccessful ones—”sordid stumbles.” If this is true, it’s good news, because it makes it a LOT easier to prevent successful attempts. Alas it is not true.
A much weaker version of something like this may be true, e.g. the warning shot story you proposed a while back about customer service bots being willingly scammed. It’s plausible to me that we’ll get stuff like that before it’s too late.
If you think there’s something we are not on the same page about here—perhaps what you were hinting at with your final sentence—I’d be interested to hear it.
I’m not sure. Since you were pushing on the claim about failing to take over the world, it seemed like you think (the truth value of) that claim is pretty important, whereas I see it as not that important, which would suggest that there is some underlying disagreement (idk what it would be though).
It’s been a while since I thought about this, but going back to the beginning of this thread:
I think the first paragraph (Evan’s) is basically right, and the second two paragraphs (your response) are basically wrong. I don’t think this has anything to do with discontinuities, at least not the kind of discontinuities that are unlikely. (Compare to the mutiny analogy). I think that this distinction between “strong” warning shots and “weak” warning shots is important because I think that “weak” warning shots will probably only provoke a moderate increase in caution on the part of human institutions and AI projects, whereas “strong” warning shots would provoke a large increase in caution. I agree that we’ll probably get various “weak” warning shots, but I think this doesn’t change the overall picture much because it won’t provoke a major increase in caution on the part of human institutions etc.
I’m guessing it’s that last bit that is the crux—perhaps you think that it would actually provoke a major increase in caution, comparable to the increase we’d get if an AI tried and failed to take over, in which case this minor warning shot vs. major warning shot distinction doesn’t matter much.
Well, I think a case of an AI trying and failing to take over would provoke an even larger increase in caution, so I’d rephrase as
I suppose the distinction between “strong” and “weak” warning shots would matter if we thought that we were getting “strong” warning shots. I want to claim that most people (including Evan) don’t expect “strong” warning shots, and usually mean the “weak” version when talking about “warning shots”, but perhaps I’m just falling prey to the typical mind fallacy.
I guess I would define a warning shot for X as something like: a situation in which a deployed model causes obvious, real-world harm due to X. So “we tested our model in the lab and found deception” isn’t a warning shot for deception, but “we deployed a deceptive model that acted misaligned in deployment while actively trying to evade detection” would be a warning shot for deception, even though it doesn’t involve taking over the world. By default, in the case of deception, my expectation is that we won’t get a warning shot at all—though I’d more expect a warning shot of the form I gave above than one where a model tries and fails to take over the world, just because I expect that a model that wants to take over the world will be able to bide its time until it can actually succeed.
I don’t automatically exclude lab settings, but other than that, this seems roughly consistent with my usage of the term. (And in particular includes the “weak” warning shots discussed above.)
Well then, would you agree that Evan’s position here:
is plausible and in particular doesn’t depend on believing in a discontinuity, at least not the kind of discontinuity we should consider unlikely? If so, then we are all on the same page. If not, then we can rehash our argument focusing on this “obvious, real-world harm” definition, which is noticeably broader than my “strong” definition and therefore makes Evan’s claim stronger and less plausible but still, I think, plausible.
(To answer your earlier question, I’ve read and spoken to several people who seem to take the attempted-world-takeover warning shot scenario seriously, i.e. people who think there’s a good chance we’ll get “strong” warning shots. Paul Christiano, for example. Though it’s possible I was misunderstanding him. I originally interpreted you as maybe being one of those people, though now it seems that you are not? At any rate these people exist.)
EDIT: I feel like we’ve been talking past each other for much of this conversation and in an effort to prevent that from continuing to happen, perhaps instead of answering my questions above, we should just get quantitiative. Consider a spectrum of warning shots from very minor to very major. Put a few examples on the spectrum for illustration. Then draw a credence distribution for probability that we’ll have warning shots of this kind. Maybe it’ll turn out that our distributions aren’t that different from each other after all, especially if we conditionalize on slow takeoff.
No, I don’t agree with that.
One problem here is that my credences on warning shots are going to be somewhat lower just because I think there’s some chance that we just solve the problem before we get warning shots, or there was never any problem in the first place.
I could condition on worlds in which an existential catastrophe occurs, but that will also make it somewhat lower because an existential catastrophe is more likely when we don’t get warning shots.
So I think for each type of warning shot I’m going to do a weird operation where I condition on something like “by the time a significant amount of work is being done by AI systems that are sufficiently capable to deliberately cause <type of warning shot> level of damage, we have not yet solved the problem in practice”.
I’m also going to assume no discontinuity, since that’s the situation we seem to disagree about.
Then, some warning shots we could have:
Minor, leads to result “well of course that happened” without much increase in caution: has already happened
Reward gaming: Faulty reward functions in the wild
Deception: Robot hand moving in front of a ball to make it look like it is grasping it, even though it isn’t (source)
Hidden capabilities: GPT-3 answering nonsense questions with “a straight face”, except it can tell that the questions are nonsense, as you can see if you design a better prompt (source)
Minor, leads to some actual damage, but mostly PR / loss of trust: 95%
Lying / deception: A personal assistant agent, when asked to schedule a meeting by when2meet, insists upon doing it by email instead, because that’s how it has always done things. It says “sorry, I don’t know how to use when2meet” in order to get this to happen, but it “could” use when2meet if it “wanted” to.
Deception: A cleaning robot sweeps the dust under the rug, knowing full well that the user would disapprove if they knew.
Moderate, comparable to things that are punishable by law: 90%
Deception: An AI system in charge of a company embezzles money
Deception: An AI system runs a Ponzi scheme (that it knows is a Ponzi scheme) (and the designers of the AI system wouldn’t endorse it running a Ponzi scheme)
Failure of constraints: An AI system helps minors find online stores for drugs and alcohol
Major, lots of damage, would be huge news: 60%
An AI system blows up an “enemy building”; it hides its plans from all humans (including users / designers) because it knows they will try to stop it.
An AI system captures employees from a rival corporation and tortures them until they give up corporate secrets.
(The specific examples I give feel somewhat implausible, but I think that’s mostly because I don’t know the best ways to achieve goals when you have no moral scruples holding you back.)
“Strong”, tries and fails to take over the world: 20%
I do think it is plausible that multiple AI systems try to take over the world, and then some of them are thwarted by other AI systems. I’m not counting these, because it seems like humans have lost meaningful control in this situation, so this “warning shot” doesn’t help.
I mostly assign 20% on this as “idk, seems unlikely, but I can’t rule it out, and predicting the future is hard so don’t assign an extreme value here”