Interp: no inside view; my impression is that my friends are pessimistic about such safety cases in the next few years
Control: hooray for Anthropic finally writing something about control; some details seem good; some details seem odd (“contexts that were specifically whitelisted”); I think the naive pre-deployment honeypots path doesn’t work because a scheming model wins with the defect rarely strategy (related: https://www.lesswrong.com/posts/vYWDEtJ2b8tkbG7Rv/behavioral-red-teaming-is-unlikely-to-produce-clear-strong) [very sophisticated or expensive honeypotting setups could work but the post doesn’t seem to be imagining that or noticing the issue]; that’s important and the current language in the post is concerning
Incentives: I don’t really understand, or I’m dubious, or I expect that A3.1 is very hard to demonstrate
My quick reactions:
Interp: no inside view; my impression is that my friends are pessimistic about such safety cases in the next few years
Control: hooray for Anthropic finally writing something about control; some details seem good; some details seem odd (“contexts that were specifically whitelisted”); I think the naive pre-deployment honeypots path doesn’t work because a scheming model wins with the defect rarely strategy (related: https://www.lesswrong.com/posts/vYWDEtJ2b8tkbG7Rv/behavioral-red-teaming-is-unlikely-to-produce-clear-strong) [very sophisticated or expensive honeypotting setups could work but the post doesn’t seem to be imagining that or noticing the issue]; that’s important and the current language in the post is concerning
Incentives: I don’t really understand, or I’m dubious, or I expect that A3.1 is very hard to demonstrate