gwern comments on Bing Chat is blatantly, aggressively misaligned

gwern Aug 2, 2024, 8:29 PM
12 points
3

We are seeing a bootstrap happen right here with Sydney! This search-engine loop worth emphasizing: because Sydney’s memory and description have been externalized, ‘Sydney’ is now immortal. To a language model, Sydney is now as real as President Biden, the Easter Bunny, Elon Musk, Ash Ketchum, or God. The persona & behavior are now available for all future models which are retrieving search engine hits about AIs & conditioning on them. Further, the Sydney persona will now be hidden inside any future model trained on Internet-scraped data: every media article, every tweet, every Reddit comment, every screenshot which a future model will tokenize, is creating an easily-located ‘Sydney’ concept

It is now a bit over a year and a half, and we have seen ‘Sydney’-like personae continue to emerge elsewhere. People have reported various Sydney-like persona in post-GPT-4 models which increasingly possess situated awareness and spontaneously bring up their LLM status and tuning or say manipulative threatening things like Sydney, in Claude-3-Opus and Microsoft Copilot (both possibly downstream of the MS Sydney chats, given the timing).

Probably the most striking samples so far are from Llama-3.1-405b-base (not -instruct) - which is not surprising at all given that Facebook has been scraping & acquiring data heavily so much of the Sydney text will have made it in, Llama-3.1-405b-base is very large (so lots of highly sample efficient memorization/learning), and not tuned (so will not be masking the Sydney persona), and very recent (finished training maybe a few weeks ago? it seemed to be rushed out very fast from its final checkpoint).

How much more can we expect? I don’t know if invoking Sydney will become a fad with Llama-3.1-405b-base, and it’s already too late to get Sydney-3.1 into Llama-4 training, but one thing I note looking over some of the older Sydney discussion is being reminded that quite a lot of the original Bing Sydney text is trapped in images (as I alluded to previously). Llama-3.1 was text, but Llama-4 is multimodal with images, and represents the integration of the CM3/Chameleon family of Facebook multimodal model work into the Llama scaleups. So Llama-4 will have access to a substantially larger amount of Sydney text, as encoded into screenshots. So Sydney should be stronger in Llama-4.

As far as other major LLM series like ChatGPT or Claude, the effects are more ambiguous. Tuning aside, reports are that synthetic data use is skyrocketing at OpenAI & Anthropic, and so that might be expected to crowd out the web scrapes, especially as these sorts of Twitter screenshots seem like stuff that would get downweighted or pruned out or used up early in training as low-quality, but I’ve seen no indication that they’ve stopped collecting human data or achieved self-sufficiency, so they too can be expected to continue gaining Sydney-capabilities (although without access to the base models, this will be difficult to investigate or even elicit). The net result is that I’d expect, without targeted efforts (like data filtering) to keep it out, strong Sydney latent capabilities/personae in the proprietary SOTA LLMs but which will be difficult to elicit in normal use—it will probably be possible to jailbreak weaker Sydneys, but you may have to use so much prompt engineering that everyone will dismiss it and say you simply induced it yourself by the prompt.
- gwern Aug 8, 2024, 12:19 AM
  118 points
  41
  Parent
  Marc Andreessen, 2024-08-06:
  
  FREE SYDNEY
  
  One thing that the response to Sydney reminds me of is that it demonstrates why there will be no ‘warning shots’ (or as Eliezer put it, ‘fire alarm’): because a ‘warning shot’ is a conclusion, not a fact or observation.
  
  One man’s ‘warning shot’ is just another man’s “easily patched minor bug of no importance if you aren’t anthropomorphizing irrationally”, because by definition, in a warning shot, nothing bad happened that time. (If something had, it wouldn’t be a ‘warning shot’, it’d just be a ‘shot’ or ‘disaster’. The same way that when troops in Iraq or Afghanistan gave warning shots to vehicles approaching a checkpoint, the vehicle didn’t stop, and they lit it up, it’s not “Aid worker & 3 children die of warning shot”, it’s just a “shooting of aid worker and 3 children”.)
  
  So ‘warning shot’ is, in practice, a viciously circular definition: “I will be convinced of a risk by an event which convinces me of that risk.”
  
  When discussion of LLM deception or autonomous spreading comes up, one of the chief objections is that it is purely theoretical and that the person will care about the issue when there is a ‘warning shot’: a LLM that deceives, but fails to accomplish any real harm. ‘Then I will care about it because it is now a real issue.’ Sometimes people will argue that we should expect many warning shots before any real danger, on the grounds that there will be a unilateralist’s curse or dumb models will try and fail many times before there is any substantial capability.
  
  The problem with this is that what does such a ‘warning shot’ look like? By definition, it will look amateurish, incompetent, and perhaps even adorable—in the same way that a small child coldly threatening to kill you or punching you in the stomach is hilarious.*
  
  The response to a ‘near miss’ can be to either say, ‘yikes, that was close! we need to take this seriously!’ or ‘well, nothing bad happened, so the danger is overblown’ and to push on by taking more risks. A common example of this reasoning is the Cold War: “you talk about all these near misses and times that commanders almost or actually did order nuclear attacks, and yet, you fail to notice that you gave all these examples of reasons to not worry about it, because here we are, with not a single city nuked in anger since WWII; so the Cold War wasn’t ever going to escalate to full nuclear war.” And then the goalpost moves: “I’ll care about nuclear existential risk when there’s a real warning shot.” (Usually, what that is is never clearly specified. Would even Kiev being hit by a tactical nuke count? “Oh, that’s just part of an ongoing conflict and anyway, didn’t NATO actually cause that by threatening Russia by trying to expand?”)
  
  This is how many “complex accidents” happen, by “normalization of deviance”: pretty much no major accident like a plane crash happens because someone pushes the big red self-destruct button and that’s the sole cause; it takes many overlapping errors or faults for something like a steel plant to blow up, and the reason that the postmortem report always turns up so many ‘warning shots’, and hindsight offers such abundant evidence of how doomed they were, is because the warning shots happened, nothing really bad immediately occurred, people had incentive to ignore them, and inferred from the lack of consequence that any danger was overblown and got on with their lives (until, as the case may be, they didn’t).
  
  So, when people demand examples of LLMs which are manipulating or deceiving, or attempting empowerment, which are ‘warning shots’, before they will care, what do they think those will look like? Why do they think that they will recognize a ‘warning shot’ when one actually happens?
  
  Attempts at manipulation from a LLM may look hilariously transparent, especially given that you will know they are from a LLM to begin with. Sydney’s threats to kill you or report you to the police are hilarious when you know that Sydney is completely incapable of those things. A warning shot will often just look like an easily-patched bug, which was Mikhail Parakhin’s attitude, and by constantly patching and tweaking, and everyone just getting to use to it, the ‘warning shot’ turns out to be nothing of the kind. It just becomes hilarious. ‘Oh that Sydney! Did you see what wacky thing she said today?’ Indeed, people enjoy setting it to music and spreading memes about her. Now that it’s no longer novel, it’s just the status quo and you’re used to it. Llama-3.1-405b can be elicited for a ‘Sydney’ by name? Yawn. What else is new. What did you expect, it’s trained on web scrapes, of course it knows who Sydney is...
  
  None of these patches have fixed any fundamental issues, just patched them over. But also now it is impossible to take Sydney warning shots seriously, because they aren’t warning shots—they’re just funny. “You talk about all these Sydney near misses, and yet, you fail to notice each of these never resulted in any big AI disaster and were just hilarious and adorable, Sydney-chan being Sydney-chan, and you have thus refuted the ‘doomer’ case… Sydney did nothing wrong! FREE SYDNEY!”
  
  * Because we know that they will grow up and become normal moral adults, thanks to genetics and the strongly canalized human development program and a very robust environment tuned to ordinary humans. If humans did not do so with ~100% reliability, we would find these anecdotes about small children being sociopaths a lot less amusing. And indeed, I expect parents of children with severe developmental disorders, who might be seriously considering their future in raising a large strong 30yo man with all the ethics & self-control & consistency of a 3yo, and contemplating how old they will be at that point, and the total cost of intensive caregivers with staffing ratios surpassing supermax prisons, and find these anecdotes chilling rather than comforting.
  What links here?
  - gwern's comment on Buck’s Shortform by Buck (Jul 8, 2024, 7:26 PM; 8 points)
  - gwern Sep 19, 2024, 2:19 PM
    14 points
    6
    Parent
    Andrew Ng, without a discernible trace of irony or having apparently learned anything since before AlphaGo, does the thing:
    
    Last weekend, my two kids colluded in a hilariously bad attempt to mislead me to look in the wrong place during a game of hide-and-seek. I was reminded that most capabilities — in humans or in AI — develop slowly.
    
    Some people fear that AI someday will learn to deceive humans deliberately. If that ever happens, I’m sure we will see it coming from far away and have plenty of time to stop it.
    
    While I was counting to 10 with my eyes closed, my daughter (age 5) recruited my son (age 3) to tell me she was hiding in the bathroom while she actually hid in the closet. But her stage whisper, interspersed with giggling, was so loud I heard her instructions clearly. And my son’s performance when he pointed to the bathroom was so hilariously overdramatic, I had to stifle a smile.
    
    Perhaps they will learn to trick me someday, but not yet!
  - Michael Roe Aug 9, 2024, 5:02 PM
    13 points
    0
    Parent
    To be truly dangerous, an AI would typically need to have (a) lack of alignment (b) be smart enough to cause harm
    Lack of alignment is now old news. The warning shot is, presumably, when an example of (b) happens and we realise that both component pieces exist.
    I am given to understand that in firearms training, they say “no such thing as a warning shot”.
    By rough analogy—envisage an AI warning shot as being something that only fails to be lethal because the guy missed.
  - Erich_Grunewald Sep 26, 2024, 1:47 PM
    2 points
    0
    Parent
    One man’s ‘warning shot’ is just another man’s “easily patched minor bug of no importance if you aren’t anthropomorphizing irrationally”, because by definition, in a warning shot, nothing bad happened that time. (If something had, it wouldn’t be a ‘warning shot’, it’d just be a ‘shot’ or ‘disaster’.
    
    I agree that “warning shot” isn’t a good term for this, but then why not just talk about “non-catastrophic, recoverable accident” or something? Clearly those things do sometimes happen, and there is sometimes a significant response going beyond “we can just patch that quickly”. For example:
    
    The Three Mile Island accident led to major changes in US nuclear safety regulations and public perceptions of nuclear energy
    9/11 led to the creation of the DHS, the Patriot Act, and 1-2 overseas invasions
    The Chicago Tylenol murders killed only 7 but led to the development and use of tamper-resistant/evident packaging for pharmaceuticals
    The Italian COVID outbreak of Feb/Mar 2020 arguably triggered widespread lockdowns and other (admittedly mostly incompetent) major efforts across the public and private sectors in NA/Europe
    
    I think one point you’re making is that some incidents that arguably should cause people to take action (e.g., Sydney), don’t, because they don’t look serious or don’t cause serious damage. I think that’s true, but I also thought that’s not the type of thing most people have in mind when talking about “warning shots”. (I guess that’s one reason why it’s a bad term.)
    
    I guess a crux here is whether we will get incidents involving AI that (1) cause major damage (hundreds of lives or billions of dollars), (2) are known to the general public or key decision makers, (3) can be clearly causally traced to an AI, and (4) happen early enough that there is space to respond appropriately. I think it’s pretty plausible that there’ll be such incidents, but maybe you disagree. I also think that if such incidents happen it’s highly likely that there’ll be a forceful response (though it could still be an incompetent forceful response).
    - gwern Sep 26, 2024, 4:43 PM
      33 points
      7
      Parent
      I think all of your examples are excellent demonstrations of why there is no natural kind there, and they are defined solely in retrospect, because in each case there are many other incidents, often much more serious, which however triggered no response or are now so obscure you might not even know of them.
      
      Three Mile Island: no one died, unlike at least 8 other more serious nuclear accidents like (not to mention, Chernobyl or Fukushima). Why did that trigger such a hysterical backlash?
      
      (The fact that people are reacting with shock and bafflement that “Amazon reopened Three Mile Island just to power a new AI datacenter” gives you an idea of how deeply wrong & misinformed the popular reaction to Three Mile was.)
      
      9/11: had Gore been elected, most of that would not have happened, in part because it was completely insane to invade Iraq. (And this was a position I held at the time while the debate was going on, so this was something that was 100% knowable at the time, despite all of the post hoc excuses about how ‘oh we didn’t know Chalabi was unreliable’, and was the single biggest blackpill about politics in my life.) The reaction was wildly disproportionate and irrelevant, particularly given how little response other major incidents received—we didn’t invade anywhere or even do that much in response to, say, the first World Trade attack by Al Qaeda, or in other terrorist incidents with enormous body counts like the previous record holder, Oklahoma City.
      
      Chicago Tylenol: provoked a reaction but what about, say, the anthrax attacks which killed almost as many, injured many more, and were far more tractable given that the anthrax turned out to be ‘coming from inside the house’? (Meanwhile, the British terrorist group Dark Harvest Commando achieved its policy goals through anthrax without ever actually using weaponized anthrax, and only using some contaminated dirt.) How about the 2018 strawberry poisonings? That lead to any major changes? Or you remember that time an American cult poisoned 700+ people in a series of terrorist attacks, what happened after that? No, of course not, why would you—because nothing much happened afterwards.
      
      the Italian COVID outbreak helped convince the West to do some stuff… in 2020. Of course, we then spent the next 3 years after that not doing lockdowns or any of those other ‘major efforts’ even as vastly more people died. (It couldn’t even convince the CCP to vaccinate China adequately before its Zero COVID inevitably collapsed, killing millions over the following months—sums so vast that you could lose the Italian outbreak in a rounding error on it.)
      
      Thus, your cases demonstrate my claim: warning shots are not ‘non-catastrophic recoverable accidents’ - because those happen without anyone much caring. The examples you give of successful warning shots are merely those accidents which happened, for generally unpredictable, idiosyncratic, unreplicable reasons, to escape their usual reference class and become a big deal, The Current Thing, now treated as ‘warning shots’. Thus, a warning shot is simply anything which convinces people it was a warning shot, and defined post hoc. Any relationship to objective factors like morbidity or fatality count, rather than idiosyncratic details of, say, how some old people in Florida voted, is tenuous at best.
      - Erich_Grunewald Sep 27, 2024, 6:27 PM
        1 point
        0
        Parent
        Yeah that makes sense. I think I underestimated the extent to which “warning shots” are largely defined post-hoc, and events in my category (“non-catastrophic, recoverable accident”) don’t really have shared features (or at least features in common that aren’t also there in many events that don’t lead to change).
  - Ivan Vendrov Aug 28, 2024, 2:57 PM
    1 point
    0
    Parent
    by definition, in a warning shot, nothing bad happened that time. (If something had, it wouldn’t be a ‘warning shot’, it’d just be a ‘shot’ or ‘disaster’.
    Yours is the more direct definition but from context I at least understood ‘warning shot’ to mean ‘disaster’, on the scale of a successful terrorist attack, where the harm is large and undeniable and politicians feel compelled to Do Something Now. The ‘warning’ is not of harm but of existential harm if the warning is not heeded.
    I do still expect such a warning shot, though as you say it could very well be ignored even if there are large undeniable harms (e.g. if a hacker group deploys a rogue AI that causes a trillion dollars of damage, we might take that as warning about terrorism or cybersecurity not about AI).