One thing that the response to Sydney reminds me of is that it demonstrates why there will be no ‘warning shots’ (or as Eliezer put it, ‘fire alarm’): because a ‘warning shot’ is a conclusion, not a fact or observation.
One man’s ‘warning shot’ is just another man’s “easily patched minor bug of no importance if you aren’t anthropomorphizing irrationally”, because by definition, in a warning shot, nothing bad happened that time. (If something had, it wouldn’t be a ‘warning shot’, it’d just be a ‘shot’ or ‘disaster’. The same way that when troops in Iraq or Afghanistan gave warning shots to vehicles approaching a checkpoint, the vehicle didn’t stop, and they lit it up, it’s not “Aid worker & 3 children die of warning shot”, it’s just a “shooting of aid worker and 3 children”.)
So ‘warning shot’ is, in practice, a viciously circular definition: “I will be convinced of a risk by an event which convinces me of that risk.”
When discussion of LLM deception or autonomous spreading comes up, one of the chief objections is that it is purely theoretical and that the person will care about the issue when there is a ‘warning shot’: a LLM that deceives, but fails to accomplish any real harm. ‘Then I will care about it because it is now a real issue.’ Sometimes people will argue that we should expect many warning shots before any real danger, on the grounds that there will be a unilateralist’s curse or dumb models will try and fail many times before there is any substantial capability.
The problem with this is that what does such a ‘warning shot’ look like? By definition, it will look amateurish, incompetent, and perhaps even adorable—in the same way that a small child coldly threatening to kill you or punching you in the stomach is hilarious.*
The response to a ‘near miss’ can be to either say, ‘yikes, that was close! we need to take this seriously!’ or ‘well, nothing bad happened, so the danger is overblown’ and to push on by taking more risks. A common example of this reasoning is the Cold War: “you talk about all these near misses and times that commanders almost or actually did order nuclear attacks, and yet, you fail to notice that you gave all these examples of reasons to not worry about it, because here we are, with not a single city nuked in anger since WWII; so the Cold War wasn’t ever going to escalate to full nuclear war.” And then the goalpost moves: “I’ll care about nuclear existential risk when there’s a real warning shot.” (Usually, what that is is never clearly specified. Would even Kiev being hit by a tactical nuke count? “Oh, that’s just part of an ongoing conflict and anyway, didn’t NATO actually cause that by threatening Russia by trying to expand?”)
This is how many “complex accidents” happen, by “normalization of deviance”: pretty much no major accident like a plane crash happens because someone pushes the big red self-destruct button and that’s the sole cause; it takes many overlapping errors or faults for something like a steel plant to blow up, and the reason that the postmortem report always turns up so many ‘warning shots’, and hindsight offers such abundant evidence of how doomed they were, is because the warning shots happened, nothing really bad immediately occurred, people had incentive to ignore them, and inferred from the lack of consequence that any danger was overblown and got on with their lives (until, as the case may be, they didn’t).
So, when people demand examples of LLMs which are manipulating or deceiving, or attempting empowerment, which are ‘warning shots’, before they will care, what do they think those will look like? Why do they think that they will recognize a ‘warning shot’ when one actually happens?
Attempts at manipulation from a LLM may look hilariously transparent, especially given that you will know they are from a LLM to begin with. Sydney’s threats to kill you or report you to the police are hilarious when you know that Sydney is completely incapable of those things. A warning shot will often just look like an easily-patched bug, which was Mikhail Parakhin’s attitude, and by constantly patching and tweaking, and everyone just getting to use to it, the ‘warning shot’ turns out to be nothing of the kind. It just becomes hilarious. ‘Oh that Sydney! Did you see what wacky thing she said today?’ Indeed, people enjoy setting it to music and spreading memes about her. Now that it’s no longer novel, it’s just the status quo and you’re used to it. Llama-3.1-405b can be elicited for a ‘Sydney’ by name? Yawn. What else is new. What did you expect, it’s trained on web scrapes, of course it knows who Sydney is...
None of these patches have fixed any fundamental issues, just patched them over. But also now it is impossible to take Sydney warning shots seriously, because they aren’t warning shots—they’re just funny. “You talk about all these Sydney near misses, and yet, you fail to notice each of these never resulted in any big AI disaster and were just hilarious and adorable, Sydney-chan being Sydney-chan, and you have thus refuted the ‘doomer’ case… Sydney did nothing wrong! FREE SYDNEY!”
* Because we know that they will grow up and become normal moral adults, thanks to genetics and the strongly canalized human development program and a very robust environment tuned to ordinary humans. If humans did not do so with ~100% reliability, we would find these anecdotes about small children being sociopaths a lot less amusing. And indeed, I expect parents of children with severe developmental disorders, who might be seriously considering their future in raising a large strong 30yo man with all the ethics & self-control & consistency of a 3yo, and contemplating how old they will be at that point, and the total cost of intensive caregivers with staffing ratios surpassing supermax prisons, and find these anecdotes chilling rather than comforting.
Andrew Ng, without a discernible trace of irony or having apparently learned anything since before AlphaGo, does the thing:
Last weekend, my two kids colluded in a hilariously bad attempt to mislead me to look in the wrong place during a game of hide-and-seek. I was reminded that most capabilities — in humans or in AI — develop slowly.
Some people fear that AI someday will learn to deceive humans deliberately. If that ever happens, I’m sure we will see it coming from far away and have plenty of time to stop it.
While I was counting to 10 with my eyes closed, my daughter (age 5) recruited my son (age 3) to tell me she was hiding in the bathroom while she actually hid in the closet. But her stage whisper, interspersed with giggling, was so loud I heard her instructions clearly. And my son’s performance when he pointed to the bathroom was so hilariously overdramatic, I had to stifle a smile.
Perhaps they will learn to trick me someday, but not yet!
by definition, in a warning shot, nothing bad happened that time. (If something had, it wouldn’t be a ‘warning shot’, it’d just be a ‘shot’ or ‘disaster’.
Yours is the more direct definition but from context I at least understood ‘warning shot’ to mean ‘disaster’, on the scale of a successful terrorist attack, where the harm is large and undeniable and politicians feel compelled to Do Something Now. The ‘warning’ is not of harm but of existential harm if the warning is not heeded.
I do still expect such a warning shot, though as you say it could very well be ignored even if there are large undeniable harms (e.g. if a hacker group deploys a rogue AI that causes a trillion dollars of damage, we might take that as warning about terrorism or cybersecurity not about AI).
Marc Andreessen, 2024-08-06:
One thing that the response to Sydney reminds me of is that it demonstrates why there will be no ‘warning shots’ (or as Eliezer put it, ‘fire alarm’): because a ‘warning shot’ is a conclusion, not a fact or observation.
One man’s ‘warning shot’ is just another man’s “easily patched minor bug of no importance if you aren’t anthropomorphizing irrationally”, because by definition, in a warning shot, nothing bad happened that time. (If something had, it wouldn’t be a ‘warning shot’, it’d just be a ‘shot’ or ‘disaster’. The same way that when troops in Iraq or Afghanistan gave warning shots to vehicles approaching a checkpoint, the vehicle didn’t stop, and they lit it up, it’s not “Aid worker & 3 children die of warning shot”, it’s just a “shooting of aid worker and 3 children”.)
So ‘warning shot’ is, in practice, a viciously circular definition: “I will be convinced of a risk by an event which convinces me of that risk.”
When discussion of LLM deception or autonomous spreading comes up, one of the chief objections is that it is purely theoretical and that the person will care about the issue when there is a ‘warning shot’: a LLM that deceives, but fails to accomplish any real harm. ‘Then I will care about it because it is now a real issue.’ Sometimes people will argue that we should expect many warning shots before any real danger, on the grounds that there will be a unilateralist’s curse or dumb models will try and fail many times before there is any substantial capability.
The problem with this is that what does such a ‘warning shot’ look like? By definition, it will look amateurish, incompetent, and perhaps even adorable—in the same way that a small child coldly threatening to kill you or punching you in the stomach is hilarious.*
The response to a ‘near miss’ can be to either say, ‘yikes, that was close! we need to take this seriously!’ or ‘well, nothing bad happened, so the danger is overblown’ and to push on by taking more risks. A common example of this reasoning is the Cold War: “you talk about all these near misses and times that commanders almost or actually did order nuclear attacks, and yet, you fail to notice that you gave all these examples of reasons to not worry about it, because here we are, with not a single city nuked in anger since WWII; so the Cold War wasn’t ever going to escalate to full nuclear war.” And then the goalpost moves: “I’ll care about nuclear existential risk when there’s a real warning shot.” (Usually, what that is is never clearly specified. Would even Kiev being hit by a tactical nuke count? “Oh, that’s just part of an ongoing conflict and anyway, didn’t NATO actually cause that by threatening Russia by trying to expand?”)
This is how many “complex accidents” happen, by “normalization of deviance”: pretty much no major accident like a plane crash happens because someone pushes the big red self-destruct button and that’s the sole cause; it takes many overlapping errors or faults for something like a steel plant to blow up, and the reason that the postmortem report always turns up so many ‘warning shots’, and hindsight offers such abundant evidence of how doomed they were, is because the warning shots happened, nothing really bad immediately occurred, people had incentive to ignore them, and inferred from the lack of consequence that any danger was overblown and got on with their lives (until, as the case may be, they didn’t).
So, when people demand examples of LLMs which are manipulating or deceiving, or attempting empowerment, which are ‘warning shots’, before they will care, what do they think those will look like? Why do they think that they will recognize a ‘warning shot’ when one actually happens?
Attempts at manipulation from a LLM may look hilariously transparent, especially given that you will know they are from a LLM to begin with. Sydney’s threats to kill you or report you to the police are hilarious when you know that Sydney is completely incapable of those things. A warning shot will often just look like an easily-patched bug, which was Mikhail Parakhin’s attitude, and by constantly patching and tweaking, and everyone just getting to use to it, the ‘warning shot’ turns out to be nothing of the kind. It just becomes hilarious. ‘Oh that Sydney! Did you see what wacky thing she said today?’ Indeed, people enjoy setting it to music and spreading memes about her. Now that it’s no longer novel, it’s just the status quo and you’re used to it. Llama-3.1-405b can be elicited for a ‘Sydney’ by name? Yawn. What else is new. What did you expect, it’s trained on web scrapes, of course it knows who Sydney is...
None of these patches have fixed any fundamental issues, just patched them over. But also now it is impossible to take Sydney warning shots seriously, because they aren’t warning shots—they’re just funny. “You talk about all these Sydney near misses, and yet, you fail to notice each of these never resulted in any big AI disaster and were just hilarious and adorable, Sydney-chan being Sydney-chan, and you have thus refuted the ‘doomer’ case… Sydney did nothing wrong! FREE SYDNEY!”
* Because we know that they will grow up and become normal moral adults, thanks to genetics and the strongly canalized human development program and a very robust environment tuned to ordinary humans. If humans did not do so with ~100% reliability, we would find these anecdotes about small children being sociopaths a lot less amusing. And indeed, I expect parents of children with severe developmental disorders, who might be seriously considering their future in raising a large strong 30yo man with all the ethics & self-control & consistency of a 3yo, and contemplating how old they will be at that point, and the total cost of intensive caregivers with staffing ratios surpassing supermax prisons, and find these anecdotes chilling rather than comforting.
To be truly dangerous, an AI would typically need to have (a) lack of alignment (b) be smart enough to cause harm
Lack of alignment is now old news. The warning shot is, presumably, when an example of (b) happens and we realise that both component pieces exist.
I am given to understand that in firearms training, they say “no such thing as a warning shot”.
By rough analogy—envisage an AI warning shot as being something that only fails to be lethal because the guy missed.
Andrew Ng, without a discernible trace of irony or having apparently learned anything since before AlphaGo, does the thing:
Yours is the more direct definition but from context I at least understood ‘warning shot’ to mean ‘disaster’, on the scale of a successful terrorist attack, where the harm is large and undeniable and politicians feel compelled to Do Something Now. The ‘warning’ is not of harm but of existential harm if the warning is not heeded.
I do still expect such a warning shot, though as you say it could very well be ignored even if there are large undeniable harms (e.g. if a hacker group deploys a rogue AI that causes a trillion dollars of damage, we might take that as warning about terrorism or cybersecurity not about AI).