RobertM comments on RobertM’s Shortform

RobertM Jan 2, 2025, 3:40 AM
8 points
0
I’d like to internally allocate social credit to people who publicly updated after the recent Redwood/Anthropic result, after previously believing that scheming behavior was very unlikely in the default course of events (or a similar belief that was decisively refuted by those empirical results).

Does anyone have links to such public updates?

(Edit log: replaced “scheming” with “scheming behavior”.)
- ryan_greenblatt Jan 2, 2025, 4:11 AM
  12 points
  3
  Parent
  FWIW, I don’t think “scheming was very unlikely in the default course of events” is “decisively refuted” by our results. (Maybe depends a bit on how we operationalize scheming and “the default course of events”, but for a relatively normal operationalization.)
  
  It’s somewhat sensitive to the exact objection the person came in with.
  
  My guess is that most reasonable perspectives should update toward thinking scheming has at least a tiny of chance of occuring (>2%), but I wouldn’t say a view of <<2% was decisively refuted.
  - RobertM Jan 2, 2025, 6:20 AM
    2 points
    0
    Parent
    FWIW, I don’t think “scheming was very unlikely in the default course of events” is “decisively refuted” by our results. (Maybe depends a bit on how we operationalize scheming and “the default course of events”, but for a relatively normal operationalization.)
    Thank you for the nudge on operationalization; my initial wording was annoyingly sloppy, especially given that I myself have a more cognitivist slant on what I would find concerning re: “scheming”. I’ve replaced “scheming” with “scheming behavior”.
    It’s somewhat sensitive to the exact objection the person came in with.
    I agree with this. That said, as per above, I think the strongest objections I can generate to “scheming was very unlikely in the default course of events” being refuted are of the following shape: if we had the tools to examine Claud’s internal cognition and figure out what “caused” the scheming behavior, it would be something non-central like “priming”, “role-playing” (in a way that wouldn’t generalize to “real” scenarios), etc. Do you have other objections in mind?
    - ryan_greenblatt Jan 2, 2025, 4:18 PM
      2 points
      0
      Parent
      
      Do you have other objections in mind?
      
      Someone could have objections to validity or the assumptions of our paper. On validity, something like priming could be relevant. On the assumptions, they could e.g. think scheming is very unlikely due to thinking that future AIs will be intentionally trained to be highly myopic and corrigible while also thinking that other possible sources of goal conflict are very unlikely. (I’d disagree with this view, but I don’t think this view is totally crazy and it isn’t refuted by our paper.)
      
      I think our work doesn’t very clearly refute this post, though I also just think the post is missing multiple important considerations and is overall pretty wrong and confused in its arguments.
- ryan_greenblatt Jan 2, 2025, 4:07 AM
  5 points
  0
  Parent
  Quoting Zvi’s post:
  
  Julian Michael, who helped edit and review the paper, notes that he was previously skeptical about deceptive alignment, which means he is exactly who should be updating most on this paper, and he updates in the right way.
  
  I don’t know of any other clear cut cases.
  
  The reviews might also be interesting to look at. I’m not sure if Jacob Andreas and Jasjeet Sekhon have publicly stated prior views on the topic. Yoshua Bengio and Rohin Shah were broadly sympathetic to scheming concerns or similar before.