A thing I wrote on FB a few months ago, in response to someone asking if a warning shot might happen:
[… I]t’s a realistic possibility, but I’d guess it won’t happen before AI destroys the world, and if it happens I’m guessing the reaction will be stupid/panicky enough to just make the situation worse.
(It’s also possible it happens before AI destroys the world, but six weeks before rather than six years before, when it’s too late to make any difference.)
A lot of EAs feel confident we’ll get a “warning shot” like this, and/or are mostly predicating their AI strategy around “warning-shot-ish things will happen and suddenly everyone will get serious and do much more sane things”. Which doesn’t sound like, eg, how the world reacted to COVID or 9/11, though it sounds a bit like how the world (eventually) reacted to nukes and maybe to the recent Ukraine invasion?
Someone then asked why I thought a warning shot might make things worse, and I said:
It might not buy time, or might buy orders of magnitude less time than matters; and/or some combination of:
- the places that are likely to have the strictest regulations are (maybe) the most safety-conscious parts of the world. So you may end up slowing down the safety-conscious researchers much more than the reckless ones.
- more generally, it’s surprising and neat that the frontrunner (DM) is currently one of the least allergic to thinking about AI risk. I don’t think it’s anywhere near sufficient, but if we reroll the dice we should by default expect a worse front-runner.
- regulations and/or safety research are misdirected, because people have bad models now and are therefore likely to have bad models when the warning shot happens, and warning shots don’t instantly fix bad underlying models.
The problem is complicated, and steering in the right direction requires that people spend time (often years) setting multiple parameters to the right values in a world-model. Warning shots might at best fix a single parameter, ‘level of fear’, not transmit the whole model. And even if people afterwards start thinking more seriously and thereby end up with better models down the road, their snap reaction to the warning shot may lock in sticky bad regulations, policies, norms, culture, etc., because they don’t already have the right models before the warning shot happens.
- people tend to make worse decisions (if it’s a complicated issue like this, not just ‘run from tiger’) when they’re panicking and scared and feeling super rushed. As AGI draws visibly close / more people get scared (if either of those things ever happen), I expect more person-hours spent on the problem, but I also expect more rationalization, rushed and motivated reasoning, friendships and alliances breaking under the emotional strain, uncreative and on-rails thinking, unstrategic flailing, race dynamics, etc.
- if regulations or public backlash do happen, these are likely to sour a lot of ML researchers on the whole idea of AI safety and/or sour them on xrisk/EA ideas/people. Politicians or the public suddenly caring or getting involved, can easily cause a counter-backlash that makes AI alignment progress even more slowly than it would have by default.
- software is not very regulatable, software we don’t understand well enough to define is even less regulatable, whiteboard ideas are less regulatable still, you can probably run an AGI on a not-expensive laptop eventually, etc.
So regulation is mostly relevant as a way to try to slow everything down indiscriminately, rather than as a way to specifically target AGI; and it would be hard to make it have a large effect on that front, even if this would have a net positive effect.
- a warning shot could convince everyone that AI is super powerful and important and we need to invest way more in it.
- (insert random change to the world I haven’t thought of, because events like these often have big random hard-to-predict effects)
Any given big random change will tend to be bad on average, because the end-state we want requires setting multiple parameters to pretty specific values and any randomizing effect will be more likely to break a parameter we already have in approximately the right place, than to coincidentally set a parameter to exactly the right value.
There are far more ways to set the world to the wrong state than the right one, so adding entropy will usually make things worse.
We may still need to make some high-variance choices like this, if we think we’re just so fucked that we need to reroll the dice and hope to have something good happen by coincidence. But this is very different from expecting the reaction to a warning shot to be a good one. (And even in a best-case scenario we’ll need to set almost all of the parameters via steering rather than via rerolling; rerolling can maybe get us one or even two values close-to-correct if we’re crazy lucky, but the other eight values will still need to be locked in by optimization, because relying on ten independent one-in-ten coincidences to happen is obviously silly.)
- oh, [redacted]‘s comments remind me of a special case of ‘worse actors replace the current ones’: AI is banned or nationalized and the UK or US government builds it instead. To my eye, this seems a lot likelier to go poorly than the status quo.
There are plenty of scenarios that I think make the world go a lot better, but I don’t think warning shots are one of them.
I’m not sure about comparing warning shots to adding entropy more generally, cause warning shots are very specific and useful datapoint, not just any random data.
I don’t think warning shots are random, but if they have a large impact it may be in unexpected directions, perhaps for butterfly-effect-y reasons.
[Just for definitions sake, I’m assuming warning shot here is an AI that a) clearly demonstrates deceptive intent and malevolent-to-human intent, and b) either has enough capabilities to do some damage, or enough capability that one can much more easily extrapolate why increasing capability will do damage, and this extrapolation can be done even if you don’t have a really good world model.]
I’m not defining warning shots that way; I’d be much more surprised to see an event like that happen (because it’s more conjunctive), and I’d be much more confident that a warning shot like that won’t shift us from a very-bad trajectory to an OK one (because I’d expect an event like that to come very shortly before AGI destroys or saves the world, if an event like that happened at all).
When I say ‘warning shot’ I just mean ‘an event where AI is perceived to have caused a very large amount of destruction’.
The warning shots I’d expect to do the most good are ones like:
’20 or 30 or 40 years before we’d naturally reach AGI, a huge narrow-AI disaster unrelated to AGI risk occurs. This disaster is purely accidental (not terrorism or whatever). Its effect is mainly just to cause it to be in the Overton window that a wider variety of serious technical people can talk about scary AI outcomes at all, and maybe it slows timelines by five years or whatever. Also, somehow none of this causes discourse to become even dumber; e.g., people don’t start dismissing AGI risk because “the real risk is narrow AI symptoms like the one we just saw”, and there isn’t a big ML backlash to regulatory/safety efforts, and so on.′
I don’t expect anything at all like that to happen, not least because I suspect we may not have 20+ years left before AGI. But that’s a scenario where I could imagine real, modest improvements. Maybe. Optimistically.
And I can see why convincing anyone this late has less benefit, but maybe it’s still worth it?
I don’t know what you mean by “worth it”. I’m not planning to make a warning shot happen, and would strongly advise that others not do so either. :p
A very late-stage warning shot might help a little, but the whole scenario seems unimportant and ‘not where the action is’ to me. The action is in slow, unsexy earlier work to figure out how to actually align AGI systems (or failing that, how to achieve some non-AGI world-saving technology). Fantasizing about super-unlikely scenarios that wouldn’t even help strikes me as a distraction from figuring out how to make alignment progress today.
I’m much more excited by scenarios like: ‘a new podcast comes out that has top-tier-excellent discussion of AI alignment stuff, it becomes super popular among ML researchers, and the culture norms, and expectations of ML thereby shift such that water-cooler conversations about AGI catastrophe are more serious, substantive, informed, candid, and frequent’.
It’s rare for a big positive cultural shift like that to happen; but it does happen sometimes, and it can result in very fast changes to the Overton window. And since it’s a podcast containing many hours of content, there’s the potential to seed subsequent conversations with a lot of high-quality background thoughts.
By comparison, I’m less excited about individual researchers who explicitly say the words ‘I’ll only work on AGI risk after a catastrophe has happened’. This is a really foolish view to consciously hold, and makes me a lot less optimistic about the relevance and quality of research that will end up produced in the unlikely event that (a) a catastrophe actually happens, and (b) they actually follow through and drop everything they’re working on to do alignment.
A thing I wrote on FB a few months ago, in response to someone asking if a warning shot might happen:
Someone then asked why I thought a warning shot might make things worse, and I said:
There are plenty of scenarios that I think make the world go a lot better, but I don’t think warning shots are one of them.
I don’t think warning shots are random, but if they have a large impact it may be in unexpected directions, perhaps for butterfly-effect-y reasons.
I’m not defining warning shots that way; I’d be much more surprised to see an event like that happen (because it’s more conjunctive), and I’d be much more confident that a warning shot like that won’t shift us from a very-bad trajectory to an OK one (because I’d expect an event like that to come very shortly before AGI destroys or saves the world, if an event like that happened at all).
When I say ‘warning shot’ I just mean ‘an event where AI is perceived to have caused a very large amount of destruction’.
The warning shots I’d expect to do the most good are ones like:
’20 or 30 or 40 years before we’d naturally reach AGI, a huge narrow-AI disaster unrelated to AGI risk occurs. This disaster is purely accidental (not terrorism or whatever). Its effect is mainly just to cause it to be in the Overton window that a wider variety of serious technical people can talk about scary AI outcomes at all, and maybe it slows timelines by five years or whatever. Also, somehow none of this causes discourse to become even dumber; e.g., people don’t start dismissing AGI risk because “the real risk is narrow AI symptoms like the one we just saw”, and there isn’t a big ML backlash to regulatory/safety efforts, and so on.′
I don’t expect anything at all like that to happen, not least because I suspect we may not have 20+ years left before AGI. But that’s a scenario where I could imagine real, modest improvements. Maybe. Optimistically.
I don’t know what you mean by “worth it”. I’m not planning to make a warning shot happen, and would strongly advise that others not do so either. :p
A very late-stage warning shot might help a little, but the whole scenario seems unimportant and ‘not where the action is’ to me. The action is in slow, unsexy earlier work to figure out how to actually align AGI systems (or failing that, how to achieve some non-AGI world-saving technology). Fantasizing about super-unlikely scenarios that wouldn’t even help strikes me as a distraction from figuring out how to make alignment progress today.
I’m much more excited by scenarios like: ‘a new podcast comes out that has top-tier-excellent discussion of AI alignment stuff, it becomes super popular among ML researchers, and the culture norms, and expectations of ML thereby shift such that water-cooler conversations about AGI catastrophe are more serious, substantive, informed, candid, and frequent’.
It’s rare for a big positive cultural shift like that to happen; but it does happen sometimes, and it can result in very fast changes to the Overton window. And since it’s a podcast containing many hours of content, there’s the potential to seed subsequent conversations with a lot of high-quality background thoughts.
By comparison, I’m less excited about individual researchers who explicitly say the words ‘I’ll only work on AGI risk after a catastrophe has happened’. This is a really foolish view to consciously hold, and makes me a lot less optimistic about the relevance and quality of research that will end up produced in the unlikely event that (a) a catastrophe actually happens, and (b) they actually follow through and drop everything they’re working on to do alignment.