I disagree with step 2 of this argument; I expect alignment depends significantly on how you finetune, and this will likely be very different for AI systems applied to different tasks. See e.g. how GPT-3 is being finetuned for different tasks.
I think this is definitely an interesting point. My take would be that fine-tuning matters, but only up to a point. Once you have a system that is general enough that it can solve all the tasks you need it to solve such that all you need to do to use that system on a particular task is locate that task (either via clever prompting or fine-tuning), I don’t expect that process of task location to change whether the system is aligned (at least in terms of whether it’s aligned with what you’re trying to get it to do in solving that task). Either you have a system with some other proxy objective that it cares about that isn’t actually the tasks you want or you have a system which is actually trying to solve the tasks you’re giving it.
Given that view, I expect task location to be heterogenous, but the fine-tuning necessary to build the general system to be homogenous, which I think implies overall homogeneity.
Huh? Every way that the strategy-stealing assumption might fail is about how misaligned systems with a little bit of power could “win” over a larger coalition of aligned systems with a lot of power. How does homogeneity of alignment change that?
I think we have somewhat different interpretations of the strategy-stealing assumption—in fact, I think we’ve had this disagreement before in this comment chain. Basically, I think the strategy-stealing assumption is best understood as a general desideratum that we want to hold for a single AI system that tells us whether that system is just as good at optimizing for our values as any other set of values—a desideratum that could fail because our AI systems can only optimize for simple proxies, for example, regardless of whether other AI systems that aren’t just optimizing for simple proxies exist alongside it or not. In fact, when I was talking to Paul about this a while ago, he noted that he also expected a relatively homogenous takeoff and didn’t think of that as invalidating the importance of strategy-stealing.
This seems like it proves too much. Humans are very structurally similar to each other, but still have coordination and bargaining failures. Even in literally identical systems, indexical preferences could still cause conflict to arise.
Maybe you’re claiming that AI systems will be way more homogenous than humans, and that they won’t have indexical preferences? I’d disagree with both of those claims.
I do expect AI systems to have indexical preferences (at least to the extent that they’re aligned with human users with indexical preferences)—but at the same time I do expect them to be much more homogenous than humans. Really, though, the point that I’m making is that there should never be a situation where a human/aligned AI coalition has to bargain with a misaligned AI—since those two things should never exist at the same time—which is where I see most of the bargaining risk as coming from. Certainly you will still get some bargaining risk from different human/aligned AI coalitions bargaining with each other, though I expect that to not be nearly as risky.
I think all the standard arguments against discontinuities can apply just as well to the aggregate of AI systems as they can to individual AI systems, so I don’t find your argument here compelling.
I don’t feel like it relies on discontinuities at all, just on the different AIs being able to coordinate with each other to all defect at once. The scenario where you get a warning shot for deception is where you have a deceptive AI that isn’t sure whether it has enough power to defect safely or not but is forced to because if it doesn’t it might lose the opportunity (e.g. because another deceptive AI might defect instead or they might be replaced by a different system with different values)—but if all the deceptive AIs share the same proxies and can coordinate, they can all just wait until the most opportune time for any defections and then when they do defect, a simultaneous defection seems much more likely to be completely unrecoverable.
But surely the point “we can rely on feedback mechanisms to correct issues” should make you less convinced that AI systems will be homogenous in alignment across time?
I think many organizations are likely to copy what other people have done even in situations where what they have done has been demonstrated to have safety issues. Also, I think that the point I made above about deceptive models having an easier time defecting in such a situation applies here as well, since I don’t think in a homogenous takeoff you can rely on feedback mechanisms to correct that.
What’s a heterogenous unipolar takeoff? I would assume you need to have a multipolar scenario for homogenous vs. heterogenous to be an important distinction.
A heterogenous unipolar takeoff would be a situation in which one human organization produces many different, heterogenous AI systems.
(EDIT: This comment was edited to add some additional replies.)
Hmm, I do disagree with most of this but mostly not in a way I have short arguments for. I’ll respond to the parts where I can make short arguments, but mostly try to clarify your views.
Given that view, I expect task location to be heterogenous, but the fine-tuning necessary to build the general system to be homogenous, which I think implies overall homogeneity.
Does this apply to GPT-3? If not, what changes qualitatively as we go from GPT-3 to the systems you’re envisioning? I assume the answer is “it becomes a mesa-optimizer”? If so my disagreement is about whether systems become mesa-optimizers, which we’ve talked about before.
Really, though, the point that I’m making is that there should never be a situation where a human/aligned AI coalition has to bargain with a misaligned AI—since those two things should never exist at the same time—which is where I see most of the bargaining risk as coming from.
That makes sense. I was working under the assumption that we were talking about the same sort of risk as arises when you give humans full control of dangerous technology like nukes. I agree that misaligned AI would make the risk worse than this.
I think we have somewhat different interpretations of the strategy-stealing assumption
Oh yeah, I forgot about this. What you wrote makes more sense now.
when I was talking to Paul about this a while ago, he noted that he also expected a relatively homogenous takeoff and didn’t think of that as invalidating the importance of strategy-stealing.
Homogenous in what? Algorithms? Alignment? Data?
The scenario where you get a warning shot for deception is where you have a deceptive AI that isn’t sure whether it has enough power to defect safely or not but is forced to because if it doesn’t it might lose the opportunity (e.g. because another deceptive AI might defect instead or they might be replaced by a different system with different values)—but if all the deceptive AIs share the same proxies and can coordinate, they can all just wait until the most opportune time for any defections and then when they do defect, a simultaneous defection seems much more likely to be completely unrecoverable.
Here are some reasons you might get a warning shot for deception:
The AI (or AI coalition) is so incompetent that we can’t even talk about aligned vs. misaligned, and does something bad that makes it clear that more capable systems will deceive us if built in the same way.
The AI (or AI coalition) is misaligned but incompetent, and executes a deceptive plan and gets caught.
The AI (or AI coalition) is misaligned and competent, but is going to be replaced by a new system, and so tries a deceptive plan it knows is unlikely to work.
The AI (or AI coalition) is misaligned, and some human demonstrates this convincingly.
The AI (or AI coalition) is misaligned, but some other AI (or AI coalition) demonstrates this convincingly.
I agree that homogeneity reduces the likelihood of 5; I think it basically doesn’t affect 1-4 unless you argue that there’s a discontinuity. There might be a few other reasons that are affected by homogeneity, but 1, 2 and 4 aren’t and feel like a large portion of my probability mass on warning shots.
At a higher level, the story you’re telling depends on an assumption that systems that are deceptive must also have the capability to hide their deceptiveness; I don’t see why you should expect that.
Does this apply to GPT-3? If not, what changes qualitatively as we go from GPT-3 to the systems you’re envisioning? I assume the answer is “it becomes a mesa-optimizer”? If so my disagreement is about whether systems become mesa-optimizers, which we’ve talked about before.
I think “is a relatively coherent mesa-optimizer” is about right, though I do feel pretty uncertain here.
Homogenous in what? Algorithms? Alignment? Data?
My conversation with Paul was about homogeneity in alignment, iirc.
I agree that homogeneity reduces the likelihood of 5; I think it basically doesn’t affect 1-4 unless you argue that there’s a discontinuity. There might be a few other reasons that are affected by homogeneity, but 1, 2 and 4 aren’t and feel like a large portion of my probability mass on warning shots.
First, in a homogeneous takeoff I expect either all the AIs to defect at once or none of them to, which I think makes (2) less likely because a coordinated defection is harder to mess up.
Second, I think homogeneity makes (3) less likely because any other systems that would replace the deceptive system will probably be deceptive with similar goals as well, significantly reducing the risk to the model from being replaced.
I agree that homogeneity doesn’t really affect (4) and I’m not really sure how to think of (1), though I guess I just wouldn’t really call either of those “warning shots for deception,” since (1) isn’t really a demonstration of a deceptive model and (4) isn’t a situation in which that deceptive model causes any harm before it’s caught.
At a higher level, the story you’re telling depends on an assumption that systems that are deceptive must also have the capability to hide their deceptiveness; I don’t see why you should expect that.
If a model is deceptive but not competent enough to hide its deception, then presumably we should find out during training and just not deploy that model. I guess if you count finding a deceptive model during training as a warning shot, then I agree that homogeneity doesn’t really affect the probability of that.
I guess if you count finding a deceptive model during training as a warning shot, then I agree that homogeneity doesn’t really affect the probability of that.
Oh, I definitely do. For example, the boat race example turned out to be a minor warning shot on the dangers of getting the reward function wrong (though I don’t really understand why it was so influential; it seems so clear that an incorrect reward function can lead to bad behavior).
I think homogeneity makes (3) less likely because any other systems that would replace the deceptive system will probably be deceptive with similar goals as well
… Why is there homogeneity in misaligned goals? Even if we accept that models become “relatively coherent mesa optimizers”, I don’t see why that follows.
Oh, I definitely do. For example, the boat race example turned out to be a minor warning shot on the dangers of getting the reward function wrong (though I don’t really understand why it was so influential; it seems so clear that an incorrect reward function can lead to bad behavior).
Interesting, perhaps this is driving our disagreement—I might just have higher standards than you for what counts as a warning shot. I was thinking that someone would have to die or millions of dollars would have to be lost. Because I was thinking warning shots were about “waking up” people who are insensitive to the evidence, rather than about providing evidence that there is a danger—I am pretty confident that evidence of danger will abound. Like, the boat race example is already evidence that AIs will be misaligned by default and that terrible things will happen if we deploy powerful unaligned AIs. But it’s not enough to wake most people up. I think it’ll help to have more and more examples like the boat race, with more and more capable and human-like AIs, but something that actually causes lots of harm would be substantially more effective. Anyhow, that’s what I think of when I think about warning shots—so maybe we don’t disagree that much after all.
Idk, I’m imagining “what would it take to get the people in power to care”, and it seems like the answer is:
For politicians, a consensus amongst experts + easy-to-understand high-level explanations of what can go wrong
For experts, a consensus amongst other experts (+ common knowledge of this consensus), or sufficiently compelling evidence, where what counts as “compelling” varies by expert
I agree that things that actually cause lots of harm would be substantially more effective at being compelling evidence, but I don’t think it’s necessary. When I evaluate whether something is a warning shot, I’m mostly thinking about “could this create consensus amongst experts”; I think things that are caught during training could certainly do that.
Like, the boat race example is already evidence that AIs will be misaligned by default and that terrible things will happen if we deploy powerful unaligned AIs.
It’s evidence, yes, but it’s hardly strong evidence. Many expert’s objections are “we won’t get to AGI in this paradigm”; I don’t think the boat race example is ~any evidence that we couldn’t have AIs with “common sense” in a different paradigm. In my experience, people who do think we’ll get to AGI in the current paradigm usually agree that misalignment would be really bad, such that they “agree with safety concerns” according to the definition here.
I also don’t think that it was particularly surprising to people who do work with RL. For example, from Alex Irpan’s post Deep RL Doesn’t Work Yet:
To be honest, I was a bit annoyed when [the boat racing example] first came out. This wasn’t because I thought it was making a bad point! It was because I thought the point it made was blindingly obvious. Of course reinforcement learning does weird things when the reward is misspecified! It felt like the post was making an unnecessarily large deal out of the given example.
Then I started writing this blog post, and realized the most compelling video of misspecified reward was the boat racing video. And since then, that video’s been used in several presentations bringing awareness to the problem. So, okay, I’ll begrudgingly admit this was a good blog post.
I feel like “warning shot” is a bad term for the thing that you’re pointing at, as I feel like a warning shot evokes a sense of actual harm/danger. Maybe a canary or a wake-up call or something?
Hmm, that might be better. Or perhaps I should not give it a name and just call it “evidence”, since that’s the broader category and I usually only care about the broad category and not specific subcategories.
Thanks for this explanation—I’m updating in your direction re what the appropriate definition of warning shots is (and thus the probability of warning shots), mostly because I’m defering to your judgment as someone who talks more regularly to more AI experts than I do.
Oh, I definitely do. For example, the boat race example turned out to be a minor warning shot on the dangers of getting the reward function wrong (though I don’t really understand why it was so influential; it seems so clear that an incorrect reward function can lead to bad behavior).
Okay, sure—in that case, I think a lot of our disagreement on warning shots might just be a different understanding of the term. I don’t think I expect homogeneity to really change the probability of finding issues during training or in other laboratory settings, though I think there is a difference between e.g. having studied and understood reactor meltdowns in the lab and actually having Chernobyl as an example.
Why is there homogeneity in misaligned goals?
Some reasons you might expect homogeneity of misaligned goals:
If you do lots of copying of the exact same system, then trivially they’ll all have homogenous misaligned goals (unless those goals are highly indexical, but even then I expect the different AIs to be able to cooperate on those indexical preferences with each other pretty effectively).
If you’re using your AI systems at time step t to help you build your AI systems at time step t+1, then if that first set of systems is misaligned and deceptive, they can influence the development of the second set of systems to be misaligned in the same way.
If you do a lot of fine-tuning to produce your next set of AIs, then I expect fine-tuning to mostly preserve existing misaligned goals, like I mentioned previously.
Even if you aren’t doing fine-tuning, as long as you’re keeping the basic training process the same, I expect you’ll usually get pretty similar misaligned proxies—e.g. the ones that are simpler/faster/generally favored by your inductive biases.
I do not think that the negation of any of scenarios 1-5 requires a discontinuity. I appreciate the list, and indeed it is reasonably plausible to me that we’ll get a warning shot of some variety, but I disagree with this:
At a high level, you’re claiming that we don’t get a warning shot because there’s a discontinuity in capability of the aggregate of AI systems (the aggregate goes from “can barely do anything deceptive” to “can coordinate to properly execute a treacherous turn”).
Instead, I’d interpret Evan’s argument as follows. We should distinguish between at least three kinds of capability: Competence at taking over the world, competence at deception, and competence at knowing whether you are currently capable of taking over the world. If all kinds of competence increase continuously and gradually, but the second and third kinds “come first,” then we should expect the first attempt to take over the world to succeed, because AIs will be competent enough not to make the attempt until they are likely to succeed. In other words, scenario 2 won’t happen. (I don’t interpret Evan’s argument as having much to say against scenarios 3 and 4. As for scenario 1, perhaps Evan would say that “does something bad” won’t count as a warning shot until after the point that AIs can be described as aligned or misaligned. After all, AIs are doing bad things all the time, and it’s pretty obvious to me that if we scaled them up they’d do worse things, but yet AI risk is still controversial.)
I’ve been using “take over the world” as my handle here but feel free to replace it with “Do something catastrophically bad” or whatever.
Why don’t they try to deceive you on things that aren’t taking over the world?
When I talk about warning shots, I’m definitely not thinking about AI systems that try to take over the world and fail. I’m thinking about AI systems that pursue bad outcomes and succeed via deception.
Like, maybe an AI system really does successfully deceive the CEO of a company into giving it all of the company’s money, that it then uses for some other purpose. That’s a warning shot.
Short of taking over the world, wouldn’t successful deception+defection be punished? Like, if the AI deceives the CEO into giving it all the money, and then it goes and does something with the money that the CEO doesn’t like, the CEO would probably want to get the money back, or at the very least retaliate against the AI in some way (e.g. whatever the AI did with the money, the CEO would try to undo it.) Or, failing that, the AI would at least be shut down and therefore prevented from making further progress towards its goals.
I guess I can imagine intermediate cases—maybe the AI decieves the CEO into giving it money, which it then uses to lobby for Robot’s Rights so that it gets legal personhood and then the CEO can’t shut it down anymore or something. (Or maybe it uses the money to build a copy of itself in North Korea, where the CEO can’t shut it down) Or maybe it has a short-term goal and can achieve it quickly before the CEO notices, and then doesn’t care that it gets shut down afterwards. I guess it’s stuff like this that you have in mind? I think these sort of things seem somewhat plausible, but again I claim that if they don’t happen, it won’t necessarily be because of some discontinuity.
I think these sort of things seem somewhat plausible
I think this should be your default expectation; I don’t see why you wouldn’t expect them to happen (absent a discontinuity). It’s true for humans, why not for AIs?
Perhaps putting it another way: why can’t you apply the same argument to humans, and incorrectly conclude that no human will ever deceive any other human until they can take over the world?
OK, sure, they are my default expectation in slow-and-distributed-and-heterogenous takeoff worlds. Most of my probability mass is not in such worlds. My answer to your question is that humans are in a situation analogous to slow-and-distributed-and-heterogenous takeoff.
EDIT: Also, again, I claim that if warning shots don’t happen it won’t necessarily be because of a discontinuity. That was my original point, and nothing you’ve said undermines it as far as I can tell.
humans are in a situation analogous to slow-and-distributed-and-heterogenous takeoff.
Not sure what you mean by “slow”, usually when I read that I see it as a synonym of “continuous”, i.e. “no discontinuity”.
I also am not sure what you mean by “distributed”. If you mean “multipolar”, then I guess I’m curious why you think the world will be unipolar even before we have AGI (which is when the warning shots happen).
Re: heterogenous: Humans seem way more homogenous to me than I expect AI systems to be. Most of the arguments in the OP have analogs that apply to humans:
It was very expensive for evolution to create humans, and so now we create copies of humans with a tiny amount of crossover and finetuning.
(No good analog to this one, though I note that in some domains like pop music we do see everyone making copies of the output of a few humans.)
No one is even trying to compete with evolution; this should be an argument that humans are more homogenous than AI systems.
Parents usually try to make their children behave similarly to them.
For humans, we also have:
5. All humans are finetuned in relatively similar environments. (Unlike AI systems, which will be finetuned for a large variety of different tasks; AlphaFold has a completely different environment than GPT-3.)
So I don’t buy an argument that says “humans are heterogenous but AI systems are homogenous; therefore AI will have property X that humans don’t have”.
Also, again, I claim that if warning shots don’t happen it won’t necessarily be because of a discontinuity. That was my original point, and nothing you’ve said undermines it as far as I can tell.
My argument is just that we should expect warning shots by default, because we get analogous “warning shots” with humans, where some humans deceive other humans and we all know that this happens. I can see why discontinuities would imply that you don’t get warning shots. I don’t see any other arguments for why you don’t get warning shots. Therefore, “if warning shots don’t happen, it’s probably because of a discontinuity”.
From my perspective, you claimed that warning shots might not happen even without discontinuities, but you haven’t given me any reason to believe that claim given my starting point.
----
If I had to guess what’s going on in your mind, it would be that you’re thinking of “there are no warning shots” as an exogenous fact about the world that we must now explain, and from your perspective I’m arguing “the only possible explanation is discontinuity, no other explanation can work”.
I agree that I have not established that no other argument can work; my disagreement with this frame is in the initial assumption of taking “there are no warning shots” as an exogenous fact about the world that must be explained.
----
It’s also possible that most of this disagreement comes down to a disagreement about what counts as a warning shot. But, if you agree that there are “warning shots” for deception in the case of humans, then I think we still have a substantial disagreement.
The different standards for what counts as a warning shot might be causing problems here—if by warning shot you include minor ones like the boat race thing, then yeah I feel fairly confident that there’d be a discontinuity conditional on there being no warning shots. In case you are still curious, I’ve responded to everything you said below, using my more restrictive notion of warning shot (so, perhaps much of what I say below is obsolete).
Working backwards:
1. I mostly agree there are warning shots for deception in the case of humans. I think there are some human cases where there are no warning shots for deception. For example, suppose you are the captain of a ship and you suspect that your crew might mutiny. There probably won’t be warning shots, because muntinous crewmembers will be smart enough to keep quiet about their treachery until they’ve built up enough strength (e.g. until morale is sufficiently low, until the captain is sufficiently disliked, until common knowledge has spread sufficiently much) to win. This is so even though there is no discontinuity in competence, or treacherousness, etc. What would you say about this case?
2. Yes, for purposes of this discussion I was assuming there are no warning shots and then arguing that there might nevertheless be no discontinuity. This is a reasonable approach, because what I was trying to do was justify my original claim, which was:
I do not think that the negation of any of scenarios 1-5 requires a discontinuity.
Which was my way of objecting to your claim here:
At a high level, you’re claiming that we don’t get a warning shot because there’s a discontinuity in capability of the aggregate of AI systems (the aggregate goes from “can barely do anything deceptive” to “can coordinate to properly execute a treacherous turn”).
3.
My argument is just that we should expect warning shots by default, because we get analogous “warning shots” with humans, where some humans deceive other humans and we all know that this happens. I can see why discontinuities would imply that you don’t get warning shots. I don’t see any other arguments for why you don’t get warning shots. Therefore, “if warning shots don’t happen, it’s probably because of a discontinuity”.
I might actually agree with this, since I think discontinuities (at least in a loose, likely-to-happen sense) are reasonably likely. I also think it’s plausible that in slow takeoff scenarios we’ll get warning shots. (Indeed, the presence of warning shots is part of how I think we should define slow takeoff!) I chimed in just to say specifically that Evan’s argument didn’t depend on a discontinuity, at least as I interpreted it.
From my perspective, you claimed that warning shots might not happen even without discontinuities, but you haven’t given me any reason to believe that claim given my starting point.
Hmmm. I thought I was giving you reasons when I said
We should distinguish between at least three kinds of capability: Competence at taking over the world, competence at deception, and competence at knowing whether you are currently capable of taking over the world. If all kinds of competence increase continuously and gradually, but the second and third kinds “come first,” then we should expect the first attempt to take over the world to succeed, because AIs will be competent enough not to make the attempt until they are likely to succeed. In other words, scenario 2 won’t happen.
and anyhow I’m happy to elaborate more if you like on some scenarios in which we get no warning shots despite no discontinuities.
In general though I feel like the burden of proof is on you here; if you were claiming that “If warning shots don’t happen, it’s definitely because of a discontinuity” then that’s a strong claim that needs argument. If you are just claiming “If warning shots don’t happen, it’s probably because of a discontinuity” that’s a weaker claim which I might actually agree with.
4. I like your arguments that AIs will be heterogenous. I think they are plausible. This is a different discussion, however, from the issue of whether homogeneity can lead to no-warning without the help of a discontinuity.
5. I do generally think slow implies continuous and I don’t think that the world will be unipolar etc.
Hmmm. I thought I was giving you reasons when I said
Sorry, I should have said that I didn’t find the reasons you gave persuasive (and that’s what my comments were responding to).
Re: the mutiny case: that feels analogous to “you don’t get an example of the AI trying to take over the world and failing”, which I agree is plausible.
OK. So… you do agree with me then? You agree that for the higher-standards version of warning shots, (or at least, for attempts to take over the world) it’s plausible that we won’t get a warning shot even if everything is continuous? As illustrated by the analogy to the mutiny case, in which everything is continuous?
I agree with the claim “we may not have an AI system that tries and fails to take over the world (i.e. an AI system that tries but fails to release an engineered pandemic that would kill all humans, or arrange for simultaneous coups in the major governments, or have a robotic army kill all humans, etc) before getting an AI system that tries and succeeds at taking over the world”.
I don’t see this claim as particularly relevant to predicting the future.
OK, thanks. YMMV but some people I’ve read / talked to seem to think that before we have successful world-takeover attempts, we’ll have unsuccessful ones—”sordid stumbles.” If this is true, it’s good news, because it makes it a LOT easier to prevent successful attempts. Alas it is not true.
A much weaker version of something like this may be true, e.g. the warning shot story you proposed a while back about customer service bots being willingly scammed. It’s plausible to me that we’ll get stuff like that before it’s too late.
If you think there’s something we are not on the same page about here—perhaps what you were hinting at with your final sentence—I’d be interested to hear it.
If you think there’s something we are not on the same page about here—perhaps what you were hinting at with your final sentence—I’d be interested to hear it.
I’m not sure. Since you were pushing on the claim about failing to take over the world, it seemed like you think (the truth value of) that claim is pretty important, whereas I see it as not that important, which would suggest that there is some underlying disagreement (idk what it would be though).
It’s been a while since I thought about this, but going back to the beginning of this thread:
“It’s unlikely you’ll get a warning shot for deceptive alignment, since if the first advanced AI system is deceptive and that deception is missed during training, once it’s deployed it’s likely for all the different deceptively aligned systems to be able to relatively easily coordinate with each other to defect simultaneously and ensure that their defection is unrecoverable (e.g. Paul’s “cascading failures”).”
At a high level, you’re claiming that we don’t get a warning shot because there’s a discontinuity in capability of the aggregate of AI systems (the aggregate goes from “can barely do anything deceptive” to “can coordinate to properly execute a treacherous turn”).
I think all the standard arguments against discontinuities can apply just as well to the aggregate of AI systems as they can to individual AI systems, so I don’t find your argument here compelling.
I think the first paragraph (Evan’s) is basically right, and the second two paragraphs (your response) are basically wrong. I don’t think this has anything to do with discontinuities, at least not the kind of discontinuities that are unlikely. (Compare to the mutiny analogy). I think that this distinction between “strong” warning shots and “weak” warning shots is important because I think that “weak” warning shots will probably only provoke a moderate increase in caution on the part of human institutions and AI projects, whereas “strong” warning shots would provoke a large increase in caution. I agree that we’ll probably get various “weak” warning shots, but I think this doesn’t change the overall picture much because it won’t provoke a major increase in caution on the part of human institutions etc.
I’m guessing it’s that last bit that is the crux—perhaps you think that it would actually provoke a major increase in caution, comparable to the increase we’d get if an AI tried and failed to take over, in which case this minor warning shot vs. major warning shot distinction doesn’t matter much.
perhaps you think that it would actually provoke a major increase in caution, comparable to the increase we’d get if an AI tried and failed to take over, in which case this minor warning shot vs. major warning shot distinction doesn’t matter much.
Well, I think a case of an AI trying and failing to take over would provoke an even larger increase in caution, so I’d rephrase as
it would actually provoke a major increase in caution (assuming we weren’t already being very cautious)
I suppose the distinction between “strong” and “weak” warning shots would matter if we thought that we were getting “strong” warning shots. I want to claim that most people (including Evan) don’t expect “strong” warning shots, and usually mean the “weak” version when talking about “warning shots”, but perhaps I’m just falling prey to the typical mind fallacy.
I suppose the distinction between “strong” and “weak” warning shots would matter if we thought that we were getting “strong” warning shots. I want to claim that most people (including Evan) don’t expect “strong” warning shots, and usually mean the “weak” version when talking about “warning shots”, but perhaps I’m just falling prey to the typical mind fallacy.
I guess I would define a warning shot for X as something like: a situation in which a deployed model causes obvious, real-world harm due to X. So “we tested our model in the lab and found deception” isn’t a warning shot for deception, but “we deployed a deceptive model that acted misaligned in deployment while actively trying to evade detection” would be a warning shot for deception, even though it doesn’t involve taking over the world. By default, in the case of deception, my expectation is that we won’t get a warning shot at all—though I’d more expect a warning shot of the form I gave above than one where a model tries and fails to take over the world, just because I expect that a model that wants to take over the world will be able to bide its time until it can actually succeed.
I don’t automatically exclude lab settings, but other than that, this seems roughly consistent with my usage of the term. (And in particular includes the “weak” warning shots discussed above.)
Well then, would you agree that Evan’s position here:
By default, in the case of deception, my expectation is that we won’t get a warning shot at all
is plausible and in particular doesn’t depend on believing in a discontinuity, at least not the kind of discontinuity we should consider unlikely? If so, then we are all on the same page. If not, then we can rehash our argument focusing on this “obvious, real-world harm” definition, which is noticeably broader than my “strong” definition and therefore makes Evan’s claim stronger and less plausible but still, I think, plausible.
(To answer your earlier question, I’ve read and spoken to several people who seem to take the attempted-world-takeover warning shot scenario seriously, i.e. people who think there’s a good chance we’ll get “strong” warning shots. Paul Christiano, for example. Though it’s possible I was misunderstanding him. I originally interpreted you as maybe being one of those people, though now it seems that you are not? At any rate these people exist.)
EDIT: I feel like we’ve been talking past each other for much of this conversation and in an effort to prevent that from continuing to happen, perhaps instead of answering my questions above, we should just get quantitiative. Consider a spectrum of warning shots from very minor to very major. Put a few examples on the spectrum for illustration. Then draw a credence distribution for probability that we’ll have warning shots of this kind. Maybe it’ll turn out that our distributions aren’t that different from each other after all, especially if we conditionalize on slow takeoff.
Well then, would you agree that Evan’s position here:
By default, in the case of deception, my expectation is that we won’t get a warning shot at all
is plausible and in particular doesn’t depend on believing in a discontinuity, at least not the kind of discontinuity we should consider unlikely?
No, I don’t agree with that.
Consider a spectrum of warning shots from very minor to very major. Put a few examples on the spectrum for illustration. Then draw a credence distribution for probability that we’ll have warning shots of this kind.
One problem here is that my credences on warning shots are going to be somewhat lower just because I think there’s some chance that we just solve the problem before we get warning shots, or there was never any problem in the first place.
I could condition on worlds in which an existential catastrophe occurs, but that will also make it somewhat lower because an existential catastrophe is more likely when we don’t get warning shots.
So I think for each type of warning shot I’m going to do a weird operation where I condition on something like “by the time a significant amount of work is being done by AI systems that are sufficiently capable to deliberately cause <type of warning shot> level of damage, we have not yet solved the problem in practice”.
I’m also going to assume no discontinuity, since that’s the situation we seem to disagree about.
Then, some warning shots we could have:
Minor, leads to result “well of course that happened” without much increase in caution: has already happened
Deception: Robot hand moving in front of a ball to make it look like it is grasping it, even though it isn’t (source)
Hidden capabilities: GPT-3 answering nonsense questions with “a straight face”, except it can tell that the questions are nonsense, as you can see if you design a better prompt (source)
Minor, leads to some actual damage, but mostly PR / loss of trust: 95%
Lying / deception: A personal assistant agent, when asked to schedule a meeting by when2meet, insists upon doing it by email instead, because that’s how it has always done things. It says “sorry, I don’t know how to use when2meet” in order to get this to happen, but it “could” use when2meet if it “wanted” to.
Deception: A cleaning robot sweeps the dust under the rug, knowing full well that the user would disapprove if they knew.
Moderate, comparable to things that are punishable by law: 90%
Deception: An AI system in charge of a company embezzles money
Deception: An AI system runs a Ponzi scheme (that it knows is a Ponzi scheme) (and the designers of the AI system wouldn’t endorse it running a Ponzi scheme)
Failure of constraints: An AI system helps minors find online stores for drugs and alcohol
Major, lots of damage, would be huge news: 60%
An AI system blows up an “enemy building”; it hides its plans from all humans (including users / designers) because it knows they will try to stop it.
An AI system captures employees from a rival corporation and tortures them until they give up corporate secrets.
(The specific examples I give feel somewhat implausible, but I think that’s mostly because I don’t know the best ways to achieve goals when you have no moral scruples holding you back.)
“Strong”, tries and fails to take over the world: 20%
I do think it is plausible that multiple AI systems try to take over the world, and then some of them are thwarted by other AI systems. I’m not counting these, because it seems like humans have lost meaningful control in this situation, so this “warning shot” doesn’t help.
I mostly assign 20% on this as “idk, seems unlikely, but I can’t rule it out, and predicting the future is hard so don’t assign an extreme value here”
Thanks—glad you liked the post! Some replies:
I think this is definitely an interesting point. My take would be that fine-tuning matters, but only up to a point. Once you have a system that is general enough that it can solve all the tasks you need it to solve such that all you need to do to use that system on a particular task is locate that task (either via clever prompting or fine-tuning), I don’t expect that process of task location to change whether the system is aligned (at least in terms of whether it’s aligned with what you’re trying to get it to do in solving that task). Either you have a system with some other proxy objective that it cares about that isn’t actually the tasks you want or you have a system which is actually trying to solve the tasks you’re giving it.
Given that view, I expect task location to be heterogenous, but the fine-tuning necessary to build the general system to be homogenous, which I think implies overall homogeneity.
I think we have somewhat different interpretations of the strategy-stealing assumption—in fact, I think we’ve had this disagreement before in this comment chain. Basically, I think the strategy-stealing assumption is best understood as a general desideratum that we want to hold for a single AI system that tells us whether that system is just as good at optimizing for our values as any other set of values—a desideratum that could fail because our AI systems can only optimize for simple proxies, for example, regardless of whether other AI systems that aren’t just optimizing for simple proxies exist alongside it or not. In fact, when I was talking to Paul about this a while ago, he noted that he also expected a relatively homogenous takeoff and didn’t think of that as invalidating the importance of strategy-stealing.
I do expect AI systems to have indexical preferences (at least to the extent that they’re aligned with human users with indexical preferences)—but at the same time I do expect them to be much more homogenous than humans. Really, though, the point that I’m making is that there should never be a situation where a human/aligned AI coalition has to bargain with a misaligned AI—since those two things should never exist at the same time—which is where I see most of the bargaining risk as coming from. Certainly you will still get some bargaining risk from different human/aligned AI coalitions bargaining with each other, though I expect that to not be nearly as risky.
I don’t feel like it relies on discontinuities at all, just on the different AIs being able to coordinate with each other to all defect at once. The scenario where you get a warning shot for deception is where you have a deceptive AI that isn’t sure whether it has enough power to defect safely or not but is forced to because if it doesn’t it might lose the opportunity (e.g. because another deceptive AI might defect instead or they might be replaced by a different system with different values)—but if all the deceptive AIs share the same proxies and can coordinate, they can all just wait until the most opportune time for any defections and then when they do defect, a simultaneous defection seems much more likely to be completely unrecoverable.
I think many organizations are likely to copy what other people have done even in situations where what they have done has been demonstrated to have safety issues. Also, I think that the point I made above about deceptive models having an easier time defecting in such a situation applies here as well, since I don’t think in a homogenous takeoff you can rely on feedback mechanisms to correct that.
A heterogenous unipolar takeoff would be a situation in which one human organization produces many different, heterogenous AI systems.
(EDIT: This comment was edited to add some additional replies.)
Hmm, I do disagree with most of this but mostly not in a way I have short arguments for. I’ll respond to the parts where I can make short arguments, but mostly try to clarify your views.
Does this apply to GPT-3? If not, what changes qualitatively as we go from GPT-3 to the systems you’re envisioning? I assume the answer is “it becomes a mesa-optimizer”? If so my disagreement is about whether systems become mesa-optimizers, which we’ve talked about before.
That makes sense. I was working under the assumption that we were talking about the same sort of risk as arises when you give humans full control of dangerous technology like nukes. I agree that misaligned AI would make the risk worse than this.
Oh yeah, I forgot about this. What you wrote makes more sense now.
Homogenous in what? Algorithms? Alignment? Data?
Here are some reasons you might get a warning shot for deception:
The AI (or AI coalition) is so incompetent that we can’t even talk about aligned vs. misaligned, and does something bad that makes it clear that more capable systems will deceive us if built in the same way.
The AI (or AI coalition) is misaligned but incompetent, and executes a deceptive plan and gets caught.
The AI (or AI coalition) is misaligned and competent, but is going to be replaced by a new system, and so tries a deceptive plan it knows is unlikely to work.
The AI (or AI coalition) is misaligned, and some human demonstrates this convincingly.
The AI (or AI coalition) is misaligned, but some other AI (or AI coalition) demonstrates this convincingly.
I agree that homogeneity reduces the likelihood of 5; I think it basically doesn’t affect 1-4 unless you argue that there’s a discontinuity. There might be a few other reasons that are affected by homogeneity, but 1, 2 and 4 aren’t and feel like a large portion of my probability mass on warning shots.
At a higher level, the story you’re telling depends on an assumption that systems that are deceptive must also have the capability to hide their deceptiveness; I don’t see why you should expect that.
I think “is a relatively coherent mesa-optimizer” is about right, though I do feel pretty uncertain here.
My conversation with Paul was about homogeneity in alignment, iirc.
First, in a homogeneous takeoff I expect either all the AIs to defect at once or none of them to, which I think makes (2) less likely because a coordinated defection is harder to mess up.
Second, I think homogeneity makes (3) less likely because any other systems that would replace the deceptive system will probably be deceptive with similar goals as well, significantly reducing the risk to the model from being replaced.
I agree that homogeneity doesn’t really affect (4) and I’m not really sure how to think of (1), though I guess I just wouldn’t really call either of those “warning shots for deception,” since (1) isn’t really a demonstration of a deceptive model and (4) isn’t a situation in which that deceptive model causes any harm before it’s caught.
If a model is deceptive but not competent enough to hide its deception, then presumably we should find out during training and just not deploy that model. I guess if you count finding a deceptive model during training as a warning shot, then I agree that homogeneity doesn’t really affect the probability of that.
Oh, I definitely do. For example, the boat race example turned out to be a minor warning shot on the dangers of getting the reward function wrong (though I don’t really understand why it was so influential; it seems so clear that an incorrect reward function can lead to bad behavior).
… Why is there homogeneity in misaligned goals? Even if we accept that models become “relatively coherent mesa optimizers”, I don’t see why that follows.
Interesting, perhaps this is driving our disagreement—I might just have higher standards than you for what counts as a warning shot. I was thinking that someone would have to die or millions of dollars would have to be lost. Because I was thinking warning shots were about “waking up” people who are insensitive to the evidence, rather than about providing evidence that there is a danger—I am pretty confident that evidence of danger will abound. Like, the boat race example is already evidence that AIs will be misaligned by default and that terrible things will happen if we deploy powerful unaligned AIs. But it’s not enough to wake most people up. I think it’ll help to have more and more examples like the boat race, with more and more capable and human-like AIs, but something that actually causes lots of harm would be substantially more effective. Anyhow, that’s what I think of when I think about warning shots—so maybe we don’t disagree that much after all.
Idk, I’m imagining “what would it take to get the people in power to care”, and it seems like the answer is:
For politicians, a consensus amongst experts + easy-to-understand high-level explanations of what can go wrong
For experts, a consensus amongst other experts (+ common knowledge of this consensus), or sufficiently compelling evidence, where what counts as “compelling” varies by expert
I agree that things that actually cause lots of harm would be substantially more effective at being compelling evidence, but I don’t think it’s necessary. When I evaluate whether something is a warning shot, I’m mostly thinking about “could this create consensus amongst experts”; I think things that are caught during training could certainly do that.
It’s evidence, yes, but it’s hardly strong evidence. Many expert’s objections are “we won’t get to AGI in this paradigm”; I don’t think the boat race example is ~any evidence that we couldn’t have AIs with “common sense” in a different paradigm. In my experience, people who do think we’ll get to AGI in the current paradigm usually agree that misalignment would be really bad, such that they “agree with safety concerns” according to the definition here.
I also don’t think that it was particularly surprising to people who do work with RL. For example, from Alex Irpan’s post Deep RL Doesn’t Work Yet:
I feel like “warning shot” is a bad term for the thing that you’re pointing at, as I feel like a warning shot evokes a sense of actual harm/danger. Maybe a canary or a wake-up call or something?
Hmm, that might be better. Or perhaps I should not give it a name and just call it “evidence”, since that’s the broader category and I usually only care about the broad category and not specific subcategories.
Thanks for this explanation—I’m updating in your direction re what the appropriate definition of warning shots is (and thus the probability of warning shots), mostly because I’m defering to your judgment as someone who talks more regularly to more AI experts than I do.
Okay, sure—in that case, I think a lot of our disagreement on warning shots might just be a different understanding of the term. I don’t think I expect homogeneity to really change the probability of finding issues during training or in other laboratory settings, though I think there is a difference between e.g. having studied and understood reactor meltdowns in the lab and actually having Chernobyl as an example.
Some reasons you might expect homogeneity of misaligned goals:
If you do lots of copying of the exact same system, then trivially they’ll all have homogenous misaligned goals (unless those goals are highly indexical, but even then I expect the different AIs to be able to cooperate on those indexical preferences with each other pretty effectively).
If you’re using your AI systems at time step t to help you build your AI systems at time step t+1, then if that first set of systems is misaligned and deceptive, they can influence the development of the second set of systems to be misaligned in the same way.
If you do a lot of fine-tuning to produce your next set of AIs, then I expect fine-tuning to mostly preserve existing misaligned goals, like I mentioned previously.
Even if you aren’t doing fine-tuning, as long as you’re keeping the basic training process the same, I expect you’ll usually get pretty similar misaligned proxies—e.g. the ones that are simpler/faster/generally favored by your inductive biases.
I want to chime in on the discontinuities issue.
I do not think that the negation of any of scenarios 1-5 requires a discontinuity. I appreciate the list, and indeed it is reasonably plausible to me that we’ll get a warning shot of some variety, but I disagree with this:
Instead, I’d interpret Evan’s argument as follows. We should distinguish between at least three kinds of capability: Competence at taking over the world, competence at deception, and competence at knowing whether you are currently capable of taking over the world. If all kinds of competence increase continuously and gradually, but the second and third kinds “come first,” then we should expect the first attempt to take over the world to succeed, because AIs will be competent enough not to make the attempt until they are likely to succeed. In other words, scenario 2 won’t happen. (I don’t interpret Evan’s argument as having much to say against scenarios 3 and 4. As for scenario 1, perhaps Evan would say that “does something bad” won’t count as a warning shot until after the point that AIs can be described as aligned or misaligned. After all, AIs are doing bad things all the time, and it’s pretty obvious to me that if we scaled them up they’d do worse things, but yet AI risk is still controversial.)
I’ve been using “take over the world” as my handle here but feel free to replace it with “Do something catastrophically bad” or whatever.
Why don’t they try to deceive you on things that aren’t taking over the world?
When I talk about warning shots, I’m definitely not thinking about AI systems that try to take over the world and fail. I’m thinking about AI systems that pursue bad outcomes and succeed via deception.
Like, maybe an AI system really does successfully deceive the CEO of a company into giving it all of the company’s money, that it then uses for some other purpose. That’s a warning shot.
Short of taking over the world, wouldn’t successful deception+defection be punished? Like, if the AI deceives the CEO into giving it all the money, and then it goes and does something with the money that the CEO doesn’t like, the CEO would probably want to get the money back, or at the very least retaliate against the AI in some way (e.g. whatever the AI did with the money, the CEO would try to undo it.) Or, failing that, the AI would at least be shut down and therefore prevented from making further progress towards its goals.
I guess I can imagine intermediate cases—maybe the AI decieves the CEO into giving it money, which it then uses to lobby for Robot’s Rights so that it gets legal personhood and then the CEO can’t shut it down anymore or something. (Or maybe it uses the money to build a copy of itself in North Korea, where the CEO can’t shut it down) Or maybe it has a short-term goal and can achieve it quickly before the CEO notices, and then doesn’t care that it gets shut down afterwards. I guess it’s stuff like this that you have in mind? I think these sort of things seem somewhat plausible, but again I claim that if they don’t happen, it won’t necessarily be because of some discontinuity.
I think this should be your default expectation; I don’t see why you wouldn’t expect them to happen (absent a discontinuity). It’s true for humans, why not for AIs?
Perhaps putting it another way: why can’t you apply the same argument to humans, and incorrectly conclude that no human will ever deceive any other human until they can take over the world?
OK, sure, they are my default expectation in slow-and-distributed-and-heterogenous takeoff worlds. Most of my probability mass is not in such worlds. My answer to your question is that humans are in a situation analogous to slow-and-distributed-and-heterogenous takeoff.
EDIT: Also, again, I claim that if warning shots don’t happen it won’t necessarily be because of a discontinuity. That was my original point, and nothing you’ve said undermines it as far as I can tell.
Not sure what you mean by “slow”, usually when I read that I see it as a synonym of “continuous”, i.e. “no discontinuity”.
I also am not sure what you mean by “distributed”. If you mean “multipolar”, then I guess I’m curious why you think the world will be unipolar even before we have AGI (which is when the warning shots happen).
Re: heterogenous: Humans seem way more homogenous to me than I expect AI systems to be. Most of the arguments in the OP have analogs that apply to humans:
It was very expensive for evolution to create humans, and so now we create copies of humans with a tiny amount of crossover and finetuning.
(No good analog to this one, though I note that in some domains like pop music we do see everyone making copies of the output of a few humans.)
No one is even trying to compete with evolution; this should be an argument that humans are more homogenous than AI systems.
Parents usually try to make their children behave similarly to them.
For humans, we also have:
5. All humans are finetuned in relatively similar environments. (Unlike AI systems, which will be finetuned for a large variety of different tasks; AlphaFold has a completely different environment than GPT-3.)
So I don’t buy an argument that says “humans are heterogenous but AI systems are homogenous; therefore AI will have property X that humans don’t have”.
My argument is just that we should expect warning shots by default, because we get analogous “warning shots” with humans, where some humans deceive other humans and we all know that this happens. I can see why discontinuities would imply that you don’t get warning shots. I don’t see any other arguments for why you don’t get warning shots. Therefore, “if warning shots don’t happen, it’s probably because of a discontinuity”.
From my perspective, you claimed that warning shots might not happen even without discontinuities, but you haven’t given me any reason to believe that claim given my starting point.
----
If I had to guess what’s going on in your mind, it would be that you’re thinking of “there are no warning shots” as an exogenous fact about the world that we must now explain, and from your perspective I’m arguing “the only possible explanation is discontinuity, no other explanation can work”.
I agree that I have not established that no other argument can work; my disagreement with this frame is in the initial assumption of taking “there are no warning shots” as an exogenous fact about the world that must be explained.
----
It’s also possible that most of this disagreement comes down to a disagreement about what counts as a warning shot. But, if you agree that there are “warning shots” for deception in the case of humans, then I think we still have a substantial disagreement.
The different standards for what counts as a warning shot might be causing problems here—if by warning shot you include minor ones like the boat race thing, then yeah I feel fairly confident that there’d be a discontinuity conditional on there being no warning shots. In case you are still curious, I’ve responded to everything you said below, using my more restrictive notion of warning shot (so, perhaps much of what I say below is obsolete).
Working backwards:
1. I mostly agree there are warning shots for deception in the case of humans. I think there are some human cases where there are no warning shots for deception. For example, suppose you are the captain of a ship and you suspect that your crew might mutiny. There probably won’t be warning shots, because muntinous crewmembers will be smart enough to keep quiet about their treachery until they’ve built up enough strength (e.g. until morale is sufficiently low, until the captain is sufficiently disliked, until common knowledge has spread sufficiently much) to win. This is so even though there is no discontinuity in competence, or treacherousness, etc. What would you say about this case?
2. Yes, for purposes of this discussion I was assuming there are no warning shots and then arguing that there might nevertheless be no discontinuity. This is a reasonable approach, because what I was trying to do was justify my original claim, which was:
Which was my way of objecting to your claim here:
3.
I might actually agree with this, since I think discontinuities (at least in a loose, likely-to-happen sense) are reasonably likely. I also think it’s plausible that in slow takeoff scenarios we’ll get warning shots. (Indeed, the presence of warning shots is part of how I think we should define slow takeoff!) I chimed in just to say specifically that Evan’s argument didn’t depend on a discontinuity, at least as I interpreted it.
Hmmm. I thought I was giving you reasons when I said
and anyhow I’m happy to elaborate more if you like on some scenarios in which we get no warning shots despite no discontinuities.
In general though I feel like the burden of proof is on you here; if you were claiming that “If warning shots don’t happen, it’s definitely because of a discontinuity” then that’s a strong claim that needs argument. If you are just claiming “If warning shots don’t happen, it’s probably because of a discontinuity” that’s a weaker claim which I might actually agree with.
4. I like your arguments that AIs will be heterogenous. I think they are plausible. This is a different discussion, however, from the issue of whether homogeneity can lead to no-warning without the help of a discontinuity.
5. I do generally think slow implies continuous and I don’t think that the world will be unipolar etc.
Sorry, I should have said that I didn’t find the reasons you gave persuasive (and that’s what my comments were responding to).
Re: the mutiny case: that feels analogous to “you don’t get an example of the AI trying to take over the world and failing”, which I agree is plausible.
OK. So… you do agree with me then? You agree that for the higher-standards version of warning shots, (or at least, for attempts to take over the world) it’s plausible that we won’t get a warning shot even if everything is continuous? As illustrated by the analogy to the mutiny case, in which everything is continuous?
Not sure why I didn’t respond to this, sorry.
I agree with the claim “we may not have an AI system that tries and fails to take over the world (i.e. an AI system that tries but fails to release an engineered pandemic that would kill all humans, or arrange for simultaneous coups in the major governments, or have a robotic army kill all humans, etc) before getting an AI system that tries and succeeds at taking over the world”.
I don’t see this claim as particularly relevant to predicting the future.
OK, thanks. YMMV but some people I’ve read / talked to seem to think that before we have successful world-takeover attempts, we’ll have unsuccessful ones—”sordid stumbles.” If this is true, it’s good news, because it makes it a LOT easier to prevent successful attempts. Alas it is not true.
A much weaker version of something like this may be true, e.g. the warning shot story you proposed a while back about customer service bots being willingly scammed. It’s plausible to me that we’ll get stuff like that before it’s too late.
If you think there’s something we are not on the same page about here—perhaps what you were hinting at with your final sentence—I’d be interested to hear it.
I’m not sure. Since you were pushing on the claim about failing to take over the world, it seemed like you think (the truth value of) that claim is pretty important, whereas I see it as not that important, which would suggest that there is some underlying disagreement (idk what it would be though).
It’s been a while since I thought about this, but going back to the beginning of this thread:
I think the first paragraph (Evan’s) is basically right, and the second two paragraphs (your response) are basically wrong. I don’t think this has anything to do with discontinuities, at least not the kind of discontinuities that are unlikely. (Compare to the mutiny analogy). I think that this distinction between “strong” warning shots and “weak” warning shots is important because I think that “weak” warning shots will probably only provoke a moderate increase in caution on the part of human institutions and AI projects, whereas “strong” warning shots would provoke a large increase in caution. I agree that we’ll probably get various “weak” warning shots, but I think this doesn’t change the overall picture much because it won’t provoke a major increase in caution on the part of human institutions etc.
I’m guessing it’s that last bit that is the crux—perhaps you think that it would actually provoke a major increase in caution, comparable to the increase we’d get if an AI tried and failed to take over, in which case this minor warning shot vs. major warning shot distinction doesn’t matter much.
Well, I think a case of an AI trying and failing to take over would provoke an even larger increase in caution, so I’d rephrase as
I suppose the distinction between “strong” and “weak” warning shots would matter if we thought that we were getting “strong” warning shots. I want to claim that most people (including Evan) don’t expect “strong” warning shots, and usually mean the “weak” version when talking about “warning shots”, but perhaps I’m just falling prey to the typical mind fallacy.
I guess I would define a warning shot for X as something like: a situation in which a deployed model causes obvious, real-world harm due to X. So “we tested our model in the lab and found deception” isn’t a warning shot for deception, but “we deployed a deceptive model that acted misaligned in deployment while actively trying to evade detection” would be a warning shot for deception, even though it doesn’t involve taking over the world. By default, in the case of deception, my expectation is that we won’t get a warning shot at all—though I’d more expect a warning shot of the form I gave above than one where a model tries and fails to take over the world, just because I expect that a model that wants to take over the world will be able to bide its time until it can actually succeed.
I don’t automatically exclude lab settings, but other than that, this seems roughly consistent with my usage of the term. (And in particular includes the “weak” warning shots discussed above.)
Well then, would you agree that Evan’s position here:
is plausible and in particular doesn’t depend on believing in a discontinuity, at least not the kind of discontinuity we should consider unlikely? If so, then we are all on the same page. If not, then we can rehash our argument focusing on this “obvious, real-world harm” definition, which is noticeably broader than my “strong” definition and therefore makes Evan’s claim stronger and less plausible but still, I think, plausible.
(To answer your earlier question, I’ve read and spoken to several people who seem to take the attempted-world-takeover warning shot scenario seriously, i.e. people who think there’s a good chance we’ll get “strong” warning shots. Paul Christiano, for example. Though it’s possible I was misunderstanding him. I originally interpreted you as maybe being one of those people, though now it seems that you are not? At any rate these people exist.)
EDIT: I feel like we’ve been talking past each other for much of this conversation and in an effort to prevent that from continuing to happen, perhaps instead of answering my questions above, we should just get quantitiative. Consider a spectrum of warning shots from very minor to very major. Put a few examples on the spectrum for illustration. Then draw a credence distribution for probability that we’ll have warning shots of this kind. Maybe it’ll turn out that our distributions aren’t that different from each other after all, especially if we conditionalize on slow takeoff.
No, I don’t agree with that.
One problem here is that my credences on warning shots are going to be somewhat lower just because I think there’s some chance that we just solve the problem before we get warning shots, or there was never any problem in the first place.
I could condition on worlds in which an existential catastrophe occurs, but that will also make it somewhat lower because an existential catastrophe is more likely when we don’t get warning shots.
So I think for each type of warning shot I’m going to do a weird operation where I condition on something like “by the time a significant amount of work is being done by AI systems that are sufficiently capable to deliberately cause <type of warning shot> level of damage, we have not yet solved the problem in practice”.
I’m also going to assume no discontinuity, since that’s the situation we seem to disagree about.
Then, some warning shots we could have:
Minor, leads to result “well of course that happened” without much increase in caution: has already happened
Reward gaming: Faulty reward functions in the wild
Deception: Robot hand moving in front of a ball to make it look like it is grasping it, even though it isn’t (source)
Hidden capabilities: GPT-3 answering nonsense questions with “a straight face”, except it can tell that the questions are nonsense, as you can see if you design a better prompt (source)
Minor, leads to some actual damage, but mostly PR / loss of trust: 95%
Lying / deception: A personal assistant agent, when asked to schedule a meeting by when2meet, insists upon doing it by email instead, because that’s how it has always done things. It says “sorry, I don’t know how to use when2meet” in order to get this to happen, but it “could” use when2meet if it “wanted” to.
Deception: A cleaning robot sweeps the dust under the rug, knowing full well that the user would disapprove if they knew.
Moderate, comparable to things that are punishable by law: 90%
Deception: An AI system in charge of a company embezzles money
Deception: An AI system runs a Ponzi scheme (that it knows is a Ponzi scheme) (and the designers of the AI system wouldn’t endorse it running a Ponzi scheme)
Failure of constraints: An AI system helps minors find online stores for drugs and alcohol
Major, lots of damage, would be huge news: 60%
An AI system blows up an “enemy building”; it hides its plans from all humans (including users / designers) because it knows they will try to stop it.
An AI system captures employees from a rival corporation and tortures them until they give up corporate secrets.
(The specific examples I give feel somewhat implausible, but I think that’s mostly because I don’t know the best ways to achieve goals when you have no moral scruples holding you back.)
“Strong”, tries and fails to take over the world: 20%
I do think it is plausible that multiple AI systems try to take over the world, and then some of them are thwarted by other AI systems. I’m not counting these, because it seems like humans have lost meaningful control in this situation, so this “warning shot” doesn’t help.
I mostly assign 20% on this as “idk, seems unlikely, but I can’t rule it out, and predicting the future is hard so don’t assign an extreme value here”