habryka comments on Prometheus’s Shortform

habryka 16 Apr 2024 22:52 UTC
2 points
−11
It seems pretty likely to me that current AGIs are already scheming. At least it seems like the simplest explanation for things like the behavior observed in this paper: https://www.alignmentforum.org/posts/ZAsJv7xijKTfZkMtr/sleeper-agents-training-deceptive-llms-that-persist-through
- ryan_greenblatt 16 Apr 2024 23:28 UTC
  4 points
  0
  Parent
  I assume that by scheming you mean ~deceptive alignment? I think it’s very unlikely that current AIs are scheming and I don’t see how you draw this conclusion from that paper. (Maybe something about the distilled CoT results?)
  - habryka 16 Apr 2024 23:40 UTC
    2 points
    −2
    Parent
    The best definition I would have of “scheming” would be “the model is acting deceptively about its own intentions or capabilities in order to fool a supervisor” ^[1]. This behavior seems to satisfy that pretty solidly:
    Of course, in this case the scheming goal was explicitly trained for (as opposed to arising naturally out of convergent instrumental power drives), but it sure seems to me like its engaging in the relevant kind of scheming.
    I agree there is more uncertainty and lack of clarity on whether deceptively-aligned systems will arise “naturally”, but the above seems like a clear example of someone artificially creating a deceptively-aligned system.
    ^
    Joe Carlsmith uses “whether advanced AIs that perform well in training will be doing so in order to gain power later”, but IDK, that feels really underspecified. Like, there are just tons of reasons for why the AI will want to perform well in training for power-seeking reasons, and when I read the rest of the report it seems like Joe was more analyzing it through the deception of supervisors lens.
    - ryan_greenblatt 16 Apr 2024 23:43 UTC
      3 points
      0
      Parent
      I agree current models sometimes trick their supervisors ~intentionally and it’s certainly easy to train/prompt them to do so.
      
      I don’t think current models are deceptively aligned and I think that this poses substantial additional risk.
      
      I personally like Joe’s definition and it feels like a natural category in my head, but I can see why you don’t like it. You should consider tabooing the word scheming or saying something more specific as many people mean something more specific that is different from what you mean.
      - habryka 17 Apr 2024 0:04 UTC
        5 points
        0
        Parent
        Yeah, that makes sense. I’ve noticed miscommunications around the word “scheming” a few times, so am in favor of tabooing it more. “Engage in deception for instrumental reasons” seems like an obvious extension that captures a lot of what I care about.