I’d like to internally allocate social credit to people who publicly updated after the recent Redwood/Anthropic result, after previously believing that scheming behavior was very unlikely in the default course of events (or a similar belief that was decisively refuted by those empirical results).
Does anyone have links to such public updates?
(Edit log: replaced “scheming” with “scheming behavior”.)
FWIW, I don’t think “scheming was very unlikely in the default course of events” is “decisively refuted” by our results. (Maybe depends a bit on how we operationalize scheming and “the default course of events”, but for a relatively normal operationalization.)
It’s somewhat sensitive to the exact objection the person came in with.
My guess is that most reasonable perspectives should update toward thinking scheming has at least a tiny of chance of occuring (>2%), but I wouldn’t say a view of <<2% was decisively refuted.
FWIW, I don’t think “scheming was very unlikely in the default course of events” is “decisively refuted” by our results. (Maybe depends a bit on how we operationalize scheming and “the default course of events”, but for a relatively normal operationalization.)
Thank you for the nudge on operationalization; my initial wording was annoyingly sloppy, especially given that I myself have a more cognitivist slant on what I would find concerning re: “scheming”. I’ve replaced “scheming” with “scheming behavior”.
It’s somewhat sensitive to the exact objection the person came in with.
I agree with this. That said, as per above, I think the strongest objections I can generate to “scheming was very unlikely in the default course of events” being refuted are of the following shape: if we had the tools to examine Claud’s internal cognition and figure out what “caused” the scheming behavior, it would be something non-central like “priming”, “role-playing” (in a way that wouldn’t generalize to “real” scenarios), etc. Do you have other objections in mind?
Someone could have objections to validity or the assumptions of our paper. On validity, something like priming could be relevant. On the assumptions, they could e.g. think scheming is very unlikely due to thinking that future AIs will be intentionally trained to be highly myopic and corrigible while also thinking that other possible sources of goal conflict are very unlikely. (I’d disagree with this view, but I don’t think this view is totally crazy and it isn’t refuted by our paper.)
I think our work doesn’t very clearly refute this post, though I also just think the post is missing multiple important considerations and is overall pretty wrong and confused in its arguments.
The reviews might also be interesting to look at. I’m not sure if Jacob Andreas and Jasjeet Sekhon have publicly stated prior views on the topic. Yoshua Bengio and Rohin Shah were broadly sympathetic to scheming concerns or similar before.
I’d like to internally allocate social credit to people who publicly updated after the recent Redwood/Anthropic result, after previously believing that scheming behavior was very unlikely in the default course of events (or a similar belief that was decisively refuted by those empirical results).
Does anyone have links to such public updates?
(Edit log: replaced “scheming” with “scheming behavior”.)
FWIW, I don’t think “scheming was very unlikely in the default course of events” is “decisively refuted” by our results. (Maybe depends a bit on how we operationalize scheming and “the default course of events”, but for a relatively normal operationalization.)
It’s somewhat sensitive to the exact objection the person came in with.
My guess is that most reasonable perspectives should update toward thinking scheming has at least a tiny of chance of occuring (>2%), but I wouldn’t say a view of <<2% was decisively refuted.
Thank you for the nudge on operationalization; my initial wording was annoyingly sloppy, especially given that I myself have a more cognitivist slant on what I would find concerning re: “scheming”. I’ve replaced “scheming” with “scheming behavior”.
I agree with this. That said, as per above, I think the strongest objections I can generate to “scheming was very unlikely in the default course of events” being refuted are of the following shape: if we had the tools to examine Claud’s internal cognition and figure out what “caused” the scheming behavior, it would be something non-central like “priming”, “role-playing” (in a way that wouldn’t generalize to “real” scenarios), etc. Do you have other objections in mind?
Someone could have objections to validity or the assumptions of our paper. On validity, something like priming could be relevant. On the assumptions, they could e.g. think scheming is very unlikely due to thinking that future AIs will be intentionally trained to be highly myopic and corrigible while also thinking that other possible sources of goal conflict are very unlikely. (I’d disagree with this view, but I don’t think this view is totally crazy and it isn’t refuted by our paper.)
I think our work doesn’t very clearly refute this post, though I also just think the post is missing multiple important considerations and is overall pretty wrong and confused in its arguments.
Quoting Zvi’s post:
I don’t know of any other clear cut cases.
The reviews might also be interesting to look at. I’m not sure if Jacob Andreas and Jasjeet Sekhon have publicly stated prior views on the topic. Yoshua Bengio and Rohin Shah were broadly sympathetic to scheming concerns or similar before.