Marius Hobbhahn comments on Frontier Models are Capable of In-context Scheming

Marius Hobbhahn 6 Dec 2024 18:04 UTC
LW: 27 AF: 10
17
AF
We write about this in the limitations section (quote below). My view in brief:
1. Even if they are just roleplaying, they cause harm. That seems bad.
2. If they are roleplaying, they will try to be consistent once they are caught in that persona in one rollout. That also seems bad.
3. I think the “roleplaying” argument is very overrated. It feels to me as if existing models change behavior throughout their rollouts, and I would expect that stuff like outcome-based RL will make them more consistently move away from “roleplay.”
4. I also think it’s philosophically unclear what the difference between a model roleplaying a schemer and “being” a schemer is. I think for most practical considerations, it makes almost zero difference. It feels about as shallow as when someone says the model doesn’t “truly understand” the poem that it’s writing.
Quote:
Uncertainty about source of scheming behavior A possible objection to our results is that the models merely “roleplay as evil AIs” rather than “truly” employing scheming to achieve their goals. For example, our setups might most closely align with stories about scheming in the training data. While we cannot show the root-cause of the observed scheming behaviors, we think they are concerning for either explanation. We find that multiple models consistently engage in in-context scheming across a range of scenarios, settings, variations and phrasings. Even if the models were merely “roleplaying as evil AIs”, they could still cause real harm when they are deployed. Our working hypothesis is that scheming can be an effective strategy to achieve goals and when models are trained to achieve ambitious goals or solve complex tasks, they will learn how to scheme. However, we do not provide evidence for or against this hypothesis in this paper.
- Mikita Balesni 6 Dec 2024 18:09 UTC
  LW: 18 AF: 9
  14
  AF Parent
  I think one practical difference is whether filtering pre-training data to exclude cases of scheming is a useful intervention.
  - Marius Hobbhahn 6 Dec 2024 21:39 UTC
    LW: 12 AF: 7
    3
    AF Parent
    (thx to Bronson for privately pointing this out)
    
    I think directionally, removing parts of the training data would probably make a difference. But potentially less than we might naively assume, e.g. see Evan’s argument on the AXRP podcast.
    
    Also, I think you’re right, and my statement of “I think for most practical considerations, it makes almost zero difference.” was too strong.
    - Ben Millwood 10 Dec 2024 23:40 UTC
      1 point
      0
      Parent
      in case anyone else wanted to look this up, it’s at https://axrp.net/episode/2024/12/01/episode-39-evan-hubinger-model-organisms-misalignment.html#training-away-sycophancy-subterfuge
      
      FWIW I also tried to start a discussion on some aspects of this at https://www.lesswrong.com/posts/yWdmf2bRXJtqkfSro/should-we-exclude-alignment-research-from-llm-training but it didn’t get a lot of eyeballs at the time.
- Kaj_Sotala 6 Dec 2024 19:38 UTC
  LW: 8 AF: 2
  4
  AF Parent
  I didn’t say that roleplaying-derived scheming would be less concerning, to be clear. Quite the opposite, since that means that there now two independent sources of scheming rather than just one. (Also, what Mikita said.)