1a3orn comments on Alignment Implications of LLM Successes: a Debate in One Act

1a3orn 21 Oct 2023 20:39 UTC
26 points
20
But … it doesn’t look like that’s what’s happening with the technology in front of us? In your kidnapped alien actress thought experiment, the alien was already an animal with its own goals and drives, and is using its general intelligence to backwards-chain from “I don’t want to be punished by my captors” to “Therefore I should learn my lines”.

This is one part of the doom rhetoric that just seems to depart from accurate views of the world very abruptly.

”LLMs are nice because they feel about niceness like a human does inside” is a bad model of the world. But “LLMs are nice because they have a hidden, instrumental motive for being nice, like an actress does” is an even worse model with which to replace it.

I wish your model of Doomimir had responded to this point because I don’t know what he’d say.
- RobertM 27 Oct 2023 5:12 UTC
  4 points
  1
  Parent
  Of course very few “doomers” think that current LLMs behave in ways that we parse as “nice” because they have a “hidden, instrumental motive for being nice” (in the sense that I expect you meant that). Current LLMs likely aren’t coherent & self-aware enough to have such hidden, instrumental motives at all.
  - 1a3orn 27 Oct 2023 9:08 UTC
    5 points
    3
    Parent
    I agree with you about LLMs!
    
    If MIRI-adjacent pessimists think that, I think they should stop saying things like this, which—if you don’t think LLMs have instrumental motives—is the actual opposite of good communication:
    
    @Pradyumna: “I’m struggling to understand why LLMs are existential risks. So let’s say you did have a highly capable large language model. How could RLHF + scalable oversight fail in the training that could lead to every single person on this earth dying?”
    
    @ESYudkowsky: “Suppose you captured an extremely intelligent alien species that thought 1000 times faster than you, locked their whole civilization in a spatial box, and dropped bombs on them from the sky whenever their output didn’t match a desired target—as your own intelligence tried to measure that.
    
    What could they do to you, if when the ‘training’ phase was done, you tried using them the same way as current LLMs—eg, connecting them directly to the Internet?”
    
    (To the reader, lest you are concerned by this—the process of RLHF has no resemblance to this.)