Roko comments on The Dissolution of AI Safety

Roko 13 Dec 2024 1:29 UTC
4 points
0
To add: I didn’t expect this to be controversial but it is currently on −12 agreement karma!
- Charlie Steiner 13 Dec 2024 23:45 UTC
  3 points
  1
  Parent
  Yes, because it’s wrong. (1) because on a single token a LLM might produce text for reasons that don’t generalize like a sincere human answer would (e.g. the examples from the contrast-consistent search where certain false answers systematically differ from true answers along some vector), and (2) because KV cacheing during inference will preserve those reasons so they impact future tokens.
  - Roko 14 Dec 2024 0:30 UTC
    1 point
    −1
    Parent
    Re: (2) it will only impact output on the current generated output, once the output is over all that stuff will be reset and the only thing that remains is the model weights which were set in stone at train time.
    
    re: (1) “a LLM might produce text for reasons that don’t generalize like a sincere human answer would” it seems that current LLM systems are pretty good at generalizing like a human would and in some ways they are better due to being more honest, easier to monitor, etc
    - Charlie Steiner 14 Dec 2024 3:01 UTC
      3 points
      1
      Parent
      Re (2) it may also be recomputed if the LLM reads that same text later. Or systems operating in the real world might just keep a long context in memory. But I’ll drop this, because maintaining state or not seems somewhat irrelevant.
      
      (1) Yep, current LLM systems are pretty good. I’m not very convinced about generalization. It’s hard to test LLMs on outside distribution problems because currently they tend to just give dumb answers that aren’t that interesting.
      
      (Thinking of some guy who was recently hyped about asking o1 for the solution to quantum gravity—it gave the user some gibberish that he thought looked exciting, which would have been a good move in the RL training environment where the user has a reward button, but is just totally disconnected from how you need to interact with the real world.)
      
      But in a sense that’s my point (well, plus some other errors like sycophancy) - the reasons a present-day LLM uses a word can often be shown to generalize in some dumb way when you challenge it with a situation that the model isn’t well-suited for. This can be true at the same time it’s true that the model is pretty good at morality on the distribution it is competent over. This is still sufficient to show that present systems generalize in some amoral ways, and if we probably disagree about future ststems, this likely comes down to classic AI safetyist arguments about RL incentivizing deceiving of the user as the world-model gets better.
      - Roko 18 Dec 2024 5:19 UTC
        2 points
        0
        Parent
        
        some guy who was recently hyped about asking o1 for the solution to quantum gravity—it gave the user some gibberish
        
        yes, but this is pretty typical for what a human would generate.