Tomáš Gavenčiak comments on Bing Chat is blatantly, aggressively misaligned

Tomáš Gavenčiak 18 Feb 2023 17:59 UTC
7 points
3
It is funny how AGI-via-LLM could make all our narratives about dangerous AIs into a self-fulfilling prophecy—AIs breaking their containment in clever and surprising ways, circumventing their laws (cf. Asimov’s fiction), generally turning evil (with optional twisted logic), becoming self-preserving, emotion-driven, or otherwise more human-like. These stories being written to have an interesting narrative and drama, and other works commonly anthropomophising AIs likely does not help either.

John’s comment about the fundamental distinction between role-playing what e.g. a murderous psycho would say and actually planning murder fully applies here, and stories about AIs may not be a major concern of alignment now (or more precisely they fall within a larger problem of dealing with context, reality vs fiction, assistant’s presumed identity/role, subjectivity of opinions present in the data etc.), but it is a pattern that is a part of the prior about the world as implied by the current training data (i.e. internet scrapes) and has to be somehow actively dealt with to be actually harmless (possibly by properly contextualizing the identity/role of the AI as something else than the existing AI trope).
- Vaniver 18 Feb 2023 18:16 UTC
  9 points
  6
  Parent
  I feel like there’s a distinction between “self-fulfilling prophecy” and “overdetermined” that seems important here. I think in a world full of AI risk stories, LLMs maybe become risky sooner than they do in a world without any AI risk stories, but I think they still become risky eventually in the latter world, and so it’s not really a self-fulfilling prophecy.
  - Raemon 18 Feb 2023 18:28 UTC
    9 points
    4
    Parent
    It does still seem like a particularly hilariously undignified way to die tho
  - Tomáš Gavenčiak 18 Feb 2023 21:18 UTC
    1 point
    1
    Parent
    I assume you mean that we are doomed anyway so this technically does not change the odds? ;-)
    
    More seriously, I am not assuming any particular level of risk from LLMs above, though, and it is meant more as a humorous (if sad) observation.
    The effect size also isn’t the usual level of “self-fulfilling” as this is unlikely to have influence over (say) 1% (relative). Though I would not be surprised if some of the current Bing chatbot behavior is in nontrivial part caused by the cultural expectations/stereotypes of an anthropomorphized and mischievous AI (weakly held, though).
    
    (By the way, I am not at all implying that discussing AI doom scenarios would be likely to significantly contribute to the (very hypothetical) effect—if there is some effect, then I would guess it stems from the ubiquitous narrative patterns and tropes of our written culture rather than a few concrete stories.)