paulfchristiano answers What are the most plausible “AI Safety warning shot” scenarios?

paulfchristiano 27 Mar 2020 16:00 UTC
LW: 20 AF: 10
AF
I think “makes 50% of currently-skeptical people change their minds” is a high bar for a warning shot. On that definition e.g. COVID-19 will probably not be a warning shot for existential risk from pandemics. I do think it is plausible that AI warning shots won’t be much better than pandemic warning shots. (On your definition it seems likely that there won’t ever again be a warning shot for any existential risk.)
For a more normal bar, I expect plenty of AI systems to fail at large scales in ways that seem like “malice,” and then to cover up the fact that they’ve failed. AI employees will embezzle funds, AI assistants will threaten and manipulate their users, AI soldiers will desert. Events like this will make it clear to most people that there is a serious problem, which plenty of people will be working on in order to make AI useful. The base rate will remain low but there will be periodic high-profile blow-ups.
I don’t expect this kind of total unity of AI motivations you are imagining, where all of them want to take over the world (so that the only case where you see something frightening is a failed bid to take over the world). That seems pretty unlikely to me, though it’s conceivable (maybe 10-20%?) and may be an important risk scenario. I think it’s much more likely that we stamp out all of the other failures gradually, and are left with only the patient+treacherous failures, and in that case whether it’s a warning shot or not depends entirely on how much people are willing to generalize.
I do think the situation in the AI community will be radically different after observing these kinds of warning shots, even if we don’t observe an AI literally taking over a country.
There is a very narrow range of AI capability between “too stupid to do significant damage of the sort that would scare people” and “too smart to fail at takeover if it tried.”
Why do you think this is true? Do you think it’s true of humans? I think it’s plausible if you require “take over a country” but not if you require e.g. “kill plenty of people” or “scare people who hear about it a lot.”
(This is all focused on intent alignment warning shots. I expect there will also be other scary consequences of AI that get people’s attention, but the argument in your post seemed to be just about intent alignment failures.)
What links here?
- AI Fire Alarm Scenarios by PeterMcCluskey (28 Dec 2021 2:20 UTC; 10 points)
- Daniel Kokotajlo 27 Mar 2020 19:54 UTC
  LW: 5 AF: 1
  AF Parent
  Thanks for this reply. Yes, I was talking about intent alignment warning shots. I agree it would be good to consider smaller warning shots that convince, say, 10% of currently-skeptical people. (I think it is too early to say whether COVID-19 is a 50%-warning shot for existential risk from pandemics. If it does end up killing millions, the societal incompetence necessary to get us to that point will be apparent to most people, I think, and thus most people will be on board with more funding for pandemic preparedness even if before they would have been “meh” about it.) If we are looking at 10%-warning shots, smaller-scale things like you are talking about will be more viable.
  (Whereas if we are looking at 50%-warning shots, it seems like at least attempting to take over the world is almost necessary, because otherwise skeptics will say “OK yeah so one bad apple embezzled some funds, that’s a far cry from taking over the world. Most AIs behave exactly as intended, and no small group of AIs has the ability to take over the world even if it wanted to.”)
  I’m not imagining that they all want to take over the world. I was just imagining that minor failures wouldn’t be sufficiently convincing to count as 50%-warning shots, and it seems you agree with me on that.
  Yes, I think it’s true of humans: Almost all humans are incapable of getting even close to taking over the world. There may be a few humans who have a decent shot at it and also the motivation and incaution to try it, but they are a very small fraction. And if they were even more competent than they already are, their shot at it would be more than decent. I think the crux of our whole disagreement here was just the thing you identified right away about 50% vs. 10% warning shots. Obviously there are plenty of humans capable and willing to do evil things, and if doing evil things is enough to count as a warning shot, then yeah it’s not true of humans, and neither would it be true of AI.
  I think you’ve also pointed out an unfairness in my definition, which was about single events. A series of separate minor events gradually convincing most skeptics is just as good, and now that you mention it, much more likely. I’ll focus on these sorts of things from now on, when I think of warning shots.
- Kenny 2 Apr 2020 1:39 UTC
  1 point
  Parent
  
  (On your definition it seems likely that there won’t ever again be a warning shot for any existential risk.)
  
  The only risk I can think of that might meet the post’s definition is an asteroid that has to be (successfully) diverted from colliding with the Earth.
  
  But I think you might be right in that even that wouldn’t convince “50% of currently-skeptical people [to] change their mind”.