ryan_greenblatt comments on RobertM’s Shortform

ryan_greenblatt Jan 2, 2025, 4:11 AM
12 points
3
FWIW, I don’t think “scheming was very unlikely in the default course of events” is “decisively refuted” by our results. (Maybe depends a bit on how we operationalize scheming and “the default course of events”, but for a relatively normal operationalization.)

It’s somewhat sensitive to the exact objection the person came in with.

My guess is that most reasonable perspectives should update toward thinking scheming has at least a tiny of chance of occuring (>2%), but I wouldn’t say a view of <<2% was decisively refuted.
- RobertM Jan 2, 2025, 6:20 AM
  2 points
  0
  Parent
  FWIW, I don’t think “scheming was very unlikely in the default course of events” is “decisively refuted” by our results. (Maybe depends a bit on how we operationalize scheming and “the default course of events”, but for a relatively normal operationalization.)
  Thank you for the nudge on operationalization; my initial wording was annoyingly sloppy, especially given that I myself have a more cognitivist slant on what I would find concerning re: “scheming”. I’ve replaced “scheming” with “scheming behavior”.
  It’s somewhat sensitive to the exact objection the person came in with.
  I agree with this. That said, as per above, I think the strongest objections I can generate to “scheming was very unlikely in the default course of events” being refuted are of the following shape: if we had the tools to examine Claud’s internal cognition and figure out what “caused” the scheming behavior, it would be something non-central like “priming”, “role-playing” (in a way that wouldn’t generalize to “real” scenarios), etc. Do you have other objections in mind?
  - ryan_greenblatt Jan 2, 2025, 4:18 PM
    2 points
    0
    Parent
    
    Do you have other objections in mind?
    
    Someone could have objections to validity or the assumptions of our paper. On validity, something like priming could be relevant. On the assumptions, they could e.g. think scheming is very unlikely due to thinking that future AIs will be intentionally trained to be highly myopic and corrigible while also thinking that other possible sources of goal conflict are very unlikely. (I’d disagree with this view, but I don’t think this view is totally crazy and it isn’t refuted by our paper.)
    
    I think our work doesn’t very clearly refute this post, though I also just think the post is missing multiple important considerations and is overall pretty wrong and confused in its arguments.