@Zach Stein-Perlman I appreciate your recent willingness to evaluate and criticize safety plans from labs. I think this is likely a public good that is underprovided, given the strong incentives that many people have to maintain a good standing with labs (not to mention more explicit forms of pressure applied by OpenAI and presumably other labs).
One thought: I feel like the difference between how you described the Anthropic RSP and how you described the OpenAI PF feels stronger than the actual quality differences between the documents. I agree with you that the thresholds in the OpenAI PF are too high, but I think the PF should get “points” for spelling out risks that go beyond ASL-3/misuse.
OpenAI has commitments that are insufficiently cautious for ASL-4+ (or what they would call high/critical on model autonomy), but Anthropic circumvents this problem by simply refusing to make any commitments around ASL-4 (for now).
You note this limitation when describing Anthropic’s RSP, but you describe it as “promising” while describing the PF as “unpromising.” In my view, this might be unfairly rewarding Anthropic for just not engaging with the hardest parts of the problem (or unfairly penalizing OpenAI for giving their best guess answers RE how to deal with the hardest parts of the problem).
We might also just disagree on how firm or useful the commitments in each document are– I walked away with a much better understanding of how OpenAI plans to evaluate & handle risks than how Anthropic plans to handle & evaluate risks. I do think OpenAI’s thresholds are too high, but It’s likely that I’ll feel the same way about Anthropic’s thresholds. In particular, I don’t expect either (any?) lab to be able to resist the temptation to internally deploy models with autonomous persuasion capabilities or autonomous AI R&D capabilities (partially because the competitive pressures and race dynamics pushing them to do so will be intense). I don’t see evidence that either lab is taking these kinds of risks particularly seriously, has ideas about what safeguards would be considered sufficient, or is seriously entertaining the idea that we might need to do a lot (>1 year) of dedicated safety work (that potentially involves coming up with major theoretical insights, as opposed to a “we will just solve it with empiricism” perspective) before we are confident that we can control such systems.
TLDR: I would remove the word “promising” and maybe characterize the RSP more like “an initial RSP that mostly spells out high-level reasoning, makes few hard commitments, and focuses on misuse while missing the all-important evals and safety practices for ASL-4.”
In particular, I don’t expect either (any?) lab to be able to resist the temptation to internally deploy models with autonomous persuasion capabilities or autonomous AI R&D capabilities
I agree with this as stated, but don’t think that avoiding deploying such models is needed to mitigate risk.
I think various labs are to some extent in denial of this because massively deploying possibly misaligned systems sounds crazy (and is somewhat crazy), but I would prefer if various people realized this was likely the default outcome and prepared accordingly.
More strongly, I think most of the relevant bit of the safety usefulness trade-off curve involves deploying such models. (With countermeasures.)
or is seriously entertaining the idea that we might need to do a lot (>1 year) of dedicated safety work (that potentially involves coming up with major theoretical insights, as opposed to a “we will just solve it with empiricism” perspective) before we are confident that we can control such systems.
I think this is a real possibility, but unlikely to be necessarily depending on the risk target. E.g., I think you can deploy ASL-4 models with <5% risk without theoretical insights and instead just via being very careful with various prosaic countermeasures (mostly control).
<1% risk probably requires stronger stuff, though it will depend on the architecture and various other random details.
(That said, I’m pretty sure that these lab’s aren’t making decisions based on carefully analyzing the situation and are instead just operating like “idk human level models don’t seem that bad, we’ll probably be able to figure it out, humans can solve most problems with empiricism on priors”. But, this prior seems more right than overwhelming pessimism IMO.)
Also, I think you should seriously entertain the idea that just trying quite hard with various prosaic countermeasures might suffice for reasonably high levels of safety. And thus pushing on this could potentially be very leveraged relative to trying to hit a higher target.
I personally have a large amount of uncertainty around how useful prosaic techniques & control techniques will be. Here are a few statements I’m more confident in:
Ideally, AGI development would have much more oversight than we see in the status quo. Whether or not development or deployment activities keep national security risks below acceptable levels should be a question that governments are involved in answering. A sensible oversight regime would require evidence of positive safety or “affirmative safety”.
My biggest concern with the prosaic/control metastrategy is that I think race dynamics substantially decrease its usefulness. Even if ASL-4 systems are deployed internally in a safe way, we’re still not out of the acute risk period. And even if the leading lab (Lab A) is trustworthy/cautious, it will be worried that incautious Lab B is about to get to ASL-4 in 1-3 months. This will cause the leading lab to underinvest into control, feel like it doesn’t have much time to figure out how to use its ASL-4 system (assuming it can be controlled), and feel like it needs to get to ASL-5+ rather quickly.
It’s still plausible to me that perhaps this period of a few months is enough to pull off actions that get us out of the acute risk period (e.g., use the ASL-4 system to generate evidence that controlling more powerful systems would require years of dedicated effort and have Lab A devote all of their energy toward getting governments to intervene).
Given my understanding of the current leading labs, it’s more likely to me that they’ll underestimate the difficulties of bootstrapped alignment and assume that things are OK as long as empirical tests don’t show imminent evidence of danger. I don’t think this prior is reasonable in the context of developing existentially dangerous technologies, particularly technologies that are intended to be smarter than you. I think sensible risk management in such contexts should require a stronger theoretical/conceptual understanding of the systems one is designing.
(My guess is that you agree with some of these points and I agree with some points along the lines of “maybe prosaic/control techniques will just work, we aren’t 100% sure they’re not going to work”, but we’re mostly operating in different frames.)
(I also do like/respect a lot of the work you and Buck have done on control. I’m a bit worried that the control meme is overhyped, partially because it fits into the current interests of labs. Like, control seems like a great idea and a useful conceptual frame, but I haven’t yet seen a solid case for why we should expect specific control techniques to work once we get to ASL-4 or ASL-4.5 systems, as well as what we plan to do with those systems to get us out of the acute risk period. Like, the early work on using GPT-3 to evaluate GPT-4 was interesting, but it feels like the assumption about the human red-teamers being better at attacking than GPT-4 will go away– or at least be much less robust– once we get to ASL-4. But I’m also sympathetic to the idea that we’re at the early stages of control work, and I am genuinely interested in seeing what you, Buck, and others come up with as the control agenda progresses.)
I agree with 1 and think that race dynamics makes the situation considerably worse when we only have access to prosaic approaches. (Though I don’t think this is the biggest issue with these approaches.)
I think I expect a period substantially longer than several months by default due to slower takeoff than this. (More like 2 years than 2 months.)
Insofar as the hope was for governments to step in at some point, I think the best and easiest point for them to step in is actually during the point where AIs are already becoming very powerful:
Prior to this point, we don’t get substantial value from pausing, especially if we’re not pausing/dismantling all of semi-conductor R&D globally.
Prior to this point AI won’t be concerning enough for governments to take agressive action.
At this point, additional time is extremely useful due to access to powerful AIs.
The main counterargument is that at this point more powerful AI will also look very attractive. So, it will seem too expensive to stop.
So, I don’t really see very compelling alternatives to push on at the margin as far as “metastrategy” (though I’m not sure I know exactly what you’re pointing at here). Pushing for bigger asks seems fine, but probably less leveraged.
I actually don’t think control is a great meme for the interests of labs which purely optimize for power as it is a relatively legible ask which is potentially considerably more expensive than just “our model looks aligned because we red teamed it” which is more like the default IMO.
The same way “secure these model weights from China” isn’t a great meme for these interests IMO.
We just disagree. E.g. you “walked away with a much better understanding of how OpenAI plans to evaluate & handle risks than how Anthropic plans to handle & evaluate risks”; I felt like Anthropic was thinking about most stuff better.
I think Anthropic’s ASL-3 is reasonable and OpenAI’s thresholds and corresponding commitments are unreasonable. If the ASL-4 threshold was high or commitments are poor such that ASL-4 was meaningless, I agree Anthropic’s RSP would be at least as bad as OpenAI’s.
One thing I think is a big deal: Anthropic’s RSP treats internal deployment like external deployment; OpenAI’s has almost no protections for internal deployment.
I agree “an initial RSP that mostly spells out high-level reasoning, makes few hard commitments, and focuses on misuse while missing the all-important evals and safety practices for ASL-4” is also a fine characterization of Anthropic’s current RSP.
Quick edit: PF thresholds are too high; PF seems doomed / not on track. But RSPv1 is consistent with RSPv1.1 being great. At least Anthropic knows and says there’s a big hole. That’s not super relevant to evaluating labs’ current commitments but is very relevant to predicting.
I agree with ~all of your subpoints but it seems like we disagree in terms of the overall appraisal.
Thanks for explaining your overall reasoning though. Also big +1 that the internal deployment stuff is scary. I don’t think either lab has told me what protections they’re going to use for internally deploying dangerous (~ASL-4) systems, but the fact that Anthropic treats internal deployment like external deployment is a good sign. OpenAI at least acknowledges that internal deployment can be dangerous through its distinction between high risk (can be internally deployed) and critical risk (cannot be), but I agree that the thresholds are too high, particularly for model autonomy.
@Zach Stein-Perlman I appreciate your recent willingness to evaluate and criticize safety plans from labs. I think this is likely a public good that is underprovided, given the strong incentives that many people have to maintain a good standing with labs (not to mention more explicit forms of pressure applied by OpenAI and presumably other labs).
One thought: I feel like the difference between how you described the Anthropic RSP and how you described the OpenAI PF feels stronger than the actual quality differences between the documents. I agree with you that the thresholds in the OpenAI PF are too high, but I think the PF should get “points” for spelling out risks that go beyond ASL-3/misuse.
OpenAI has commitments that are insufficiently cautious for ASL-4+ (or what they would call high/critical on model autonomy), but Anthropic circumvents this problem by simply refusing to make any commitments around ASL-4 (for now).
You note this limitation when describing Anthropic’s RSP, but you describe it as “promising” while describing the PF as “unpromising.” In my view, this might be unfairly rewarding Anthropic for just not engaging with the hardest parts of the problem (or unfairly penalizing OpenAI for giving their best guess answers RE how to deal with the hardest parts of the problem).
We might also just disagree on how firm or useful the commitments in each document are– I walked away with a much better understanding of how OpenAI plans to evaluate & handle risks than how Anthropic plans to handle & evaluate risks. I do think OpenAI’s thresholds are too high, but It’s likely that I’ll feel the same way about Anthropic’s thresholds. In particular, I don’t expect either (any?) lab to be able to resist the temptation to internally deploy models with autonomous persuasion capabilities or autonomous AI R&D capabilities (partially because the competitive pressures and race dynamics pushing them to do so will be intense). I don’t see evidence that either lab is taking these kinds of risks particularly seriously, has ideas about what safeguards would be considered sufficient, or is seriously entertaining the idea that we might need to do a lot (>1 year) of dedicated safety work (that potentially involves coming up with major theoretical insights, as opposed to a “we will just solve it with empiricism” perspective) before we are confident that we can control such systems.
TLDR: I would remove the word “promising” and maybe characterize the RSP more like “an initial RSP that mostly spells out high-level reasoning, makes few hard commitments, and focuses on misuse while missing the all-important evals and safety practices for ASL-4.”
I agree with this as stated, but don’t think that avoiding deploying such models is needed to mitigate risk.
I think various labs are to some extent in denial of this because massively deploying possibly misaligned systems sounds crazy (and is somewhat crazy), but I would prefer if various people realized this was likely the default outcome and prepared accordingly.
More strongly, I think most of the relevant bit of the safety usefulness trade-off curve involves deploying such models. (With countermeasures.)
I think this is a real possibility, but unlikely to be necessarily depending on the risk target. E.g., I think you can deploy ASL-4 models with <5% risk without theoretical insights and instead just via being very careful with various prosaic countermeasures (mostly control).
<1% risk probably requires stronger stuff, though it will depend on the architecture and various other random details.
(That said, I’m pretty sure that these lab’s aren’t making decisions based on carefully analyzing the situation and are instead just operating like “idk human level models don’t seem that bad, we’ll probably be able to figure it out, humans can solve most problems with empiricism on priors”. But, this prior seems more right than overwhelming pessimism IMO.)
Also, I think you should seriously entertain the idea that just trying quite hard with various prosaic countermeasures might suffice for reasonably high levels of safety. And thus pushing on this could potentially be very leveraged relative to trying to hit a higher target.
I personally have a large amount of uncertainty around how useful prosaic techniques & control techniques will be. Here are a few statements I’m more confident in:
Ideally, AGI development would have much more oversight than we see in the status quo. Whether or not development or deployment activities keep national security risks below acceptable levels should be a question that governments are involved in answering. A sensible oversight regime would require evidence of positive safety or “affirmative safety”.
My biggest concern with the prosaic/control metastrategy is that I think race dynamics substantially decrease its usefulness. Even if ASL-4 systems are deployed internally in a safe way, we’re still not out of the acute risk period. And even if the leading lab (Lab A) is trustworthy/cautious, it will be worried that incautious Lab B is about to get to ASL-4 in 1-3 months. This will cause the leading lab to underinvest into control, feel like it doesn’t have much time to figure out how to use its ASL-4 system (assuming it can be controlled), and feel like it needs to get to ASL-5+ rather quickly.
It’s still plausible to me that perhaps this period of a few months is enough to pull off actions that get us out of the acute risk period (e.g., use the ASL-4 system to generate evidence that controlling more powerful systems would require years of dedicated effort and have Lab A devote all of their energy toward getting governments to intervene).
Given my understanding of the current leading labs, it’s more likely to me that they’ll underestimate the difficulties of bootstrapped alignment and assume that things are OK as long as empirical tests don’t show imminent evidence of danger. I don’t think this prior is reasonable in the context of developing existentially dangerous technologies, particularly technologies that are intended to be smarter than you. I think sensible risk management in such contexts should require a stronger theoretical/conceptual understanding of the systems one is designing.
(My guess is that you agree with some of these points and I agree with some points along the lines of “maybe prosaic/control techniques will just work, we aren’t 100% sure they’re not going to work”, but we’re mostly operating in different frames.)
(I also do like/respect a lot of the work you and Buck have done on control. I’m a bit worried that the control meme is overhyped, partially because it fits into the current interests of labs. Like, control seems like a great idea and a useful conceptual frame, but I haven’t yet seen a solid case for why we should expect specific control techniques to work once we get to ASL-4 or ASL-4.5 systems, as well as what we plan to do with those systems to get us out of the acute risk period. Like, the early work on using GPT-3 to evaluate GPT-4 was interesting, but it feels like the assumption about the human red-teamers being better at attacking than GPT-4 will go away– or at least be much less robust– once we get to ASL-4. But I’m also sympathetic to the idea that we’re at the early stages of control work, and I am genuinely interested in seeing what you, Buck, and others come up with as the control agenda progresses.)
I agree with 1 and think that race dynamics makes the situation considerably worse when we only have access to prosaic approaches. (Though I don’t think this is the biggest issue with these approaches.)
I think I expect a period substantially longer than several months by default due to slower takeoff than this. (More like 2 years than 2 months.)
Insofar as the hope was for governments to step in at some point, I think the best and easiest point for them to step in is actually during the point where AIs are already becoming very powerful:
Prior to this point, we don’t get substantial value from pausing, especially if we’re not pausing/dismantling all of semi-conductor R&D globally.
Prior to this point AI won’t be concerning enough for governments to take agressive action.
At this point, additional time is extremely useful due to access to powerful AIs.
The main counterargument is that at this point more powerful AI will also look very attractive. So, it will seem too expensive to stop.
So, I don’t really see very compelling alternatives to push on at the margin as far as “metastrategy” (though I’m not sure I know exactly what you’re pointing at here). Pushing for bigger asks seems fine, but probably less leveraged.
I actually don’t think control is a great meme for the interests of labs which purely optimize for power as it is a relatively legible ask which is potentially considerably more expensive than just “our model looks aligned because we red teamed it” which is more like the default IMO.
The same way “secure these model weights from China” isn’t a great meme for these interests IMO.
Sorry for brevity.
We just disagree. E.g. you “walked away with a much better understanding of how OpenAI plans to evaluate & handle risks than how Anthropic plans to handle & evaluate risks”; I felt like Anthropic was thinking about most stuff better.
I think Anthropic’s ASL-3 is reasonable and OpenAI’s thresholds and corresponding commitments are unreasonable. If the ASL-4 threshold was high or commitments are poor such that ASL-4 was meaningless, I agree Anthropic’s RSP would be at least as bad as OpenAI’s.
One thing I think is a big deal: Anthropic’s RSP treats internal deployment like external deployment; OpenAI’s has almost no protections for internal deployment.
I agree “an initial RSP that mostly spells out high-level reasoning, makes few hard commitments, and focuses on misuse while missing the all-important evals and safety practices for ASL-4” is also a fine characterization of Anthropic’s current RSP.
Quick edit: PF thresholds are too high; PF seems doomed / not on track. But RSPv1 is consistent with RSPv1.1 being great. At least Anthropic knows and says there’s a big hole. That’s not super relevant to evaluating labs’ current commitments but is very relevant to predicting.
I agree with ~all of your subpoints but it seems like we disagree in terms of the overall appraisal.
Thanks for explaining your overall reasoning though. Also big +1 that the internal deployment stuff is scary. I don’t think either lab has told me what protections they’re going to use for internally deploying dangerous (~ASL-4) systems, but the fact that Anthropic treats internal deployment like external deployment is a good sign. OpenAI at least acknowledges that internal deployment can be dangerous through its distinction between high risk (can be internally deployed) and critical risk (cannot be), but I agree that the thresholds are too high, particularly for model autonomy.