Thanks for writing this, Zac and team, I personally appreciate more transparency about AGI labs’ safety plans!
Something I find myself confused about is how Anthropic should draw the line between what to publish or not. This post seems to draw the line between “The Three Types of AI Research at Anthropic” (you generally don’t publish capabilities research but do publish the rest), but I wonder if there’s a case to be more nuanced than that.
To get more concrete, the post brings up how “the AI safety community oftendebates whether the development of RLHF – which also generates economic value – ‘really’ was safety research” and says that Anthropic thinks it was. However, the post also states Anthropic “decided to prioritize using [Claude] for safety research rather than public deployments” in the spring of 2022. To me, this feels slightly dissonant—insofar as much of the safety and capabilities benefits from Claude came from RLHF/RLAIF (and perhaps more data, though that’s perhaps less infohazardous), this seems like Anthropic decided not to publish “Alignment Capabilities” research for (IMO justified) fear of negative externalities. Perhaps the boundaries around what should generally not be published should also extend to some Alignment Capabilities research then, especially research like RLHF that might have higher capabilities externalities.
Additionally, I’ve also been thinking that more interpretability research maybe should be kept private. As a recent concrete example, Stanford’s Hazy Research lab just published Hyena, a convolutional architecture that’s meant to rival transformers by scaling to much longer context lengths while taking significantly fewer FLOPs to train. This is clearly very much public research in the “AI Capabilities” camp, but they cite several results from Transformer Circuits for motivating the theory and design choices behind their new architecture, and say “This work wouldn’t have been possible without inspiring progress on … mechanistic interpretability.” That’s all to say that some of the “Alignment Science” research might also be useful as ML theory research and then motivate advances in AI capabilities.
I’m curious if you have thoughts on these cases, and how Anthropic might draw a more cautious line around what they choose to publish.
Unfortunately, I don’t think a detailed discussion of what we regard as safe to publish would be responsible, but I can share how we operate at a procedural level. We don’t consider any research area to be blanket safe to publish. Instead, we consider all releases on a case by case basis, weighing expected safety benefit against capabilities/acceleratory risk. In the case of difficult scenarios, we have a formal infohazard review procedure.
Thanks for writing this, Zac and team, I personally appreciate more transparency about AGI labs’ safety plans!
Something I find myself confused about is how Anthropic should draw the line between what to publish or not. This post seems to draw the line between “The Three Types of AI Research at Anthropic” (you generally don’t publish capabilities research but do publish the rest), but I wonder if there’s a case to be more nuanced than that.
To get more concrete, the post brings up how “the AI safety community often debates whether the development of RLHF – which also generates economic value – ‘really’ was safety research” and says that Anthropic thinks it was. However, the post also states Anthropic “decided to prioritize using [Claude] for safety research rather than public deployments” in the spring of 2022. To me, this feels slightly dissonant—insofar as much of the safety and capabilities benefits from Claude came from RLHF/RLAIF (and perhaps more data, though that’s perhaps less infohazardous), this seems like Anthropic decided not to publish “Alignment Capabilities” research for (IMO justified) fear of negative externalities. Perhaps the boundaries around what should generally not be published should also extend to some Alignment Capabilities research then, especially research like RLHF that might have higher capabilities externalities.
Additionally, I’ve also been thinking that more interpretability research maybe should be kept private. As a recent concrete example, Stanford’s Hazy Research lab just published Hyena, a convolutional architecture that’s meant to rival transformers by scaling to much longer context lengths while taking significantly fewer FLOPs to train. This is clearly very much public research in the “AI Capabilities” camp, but they cite several results from Transformer Circuits for motivating the theory and design choices behind their new architecture, and say “This work wouldn’t have been possible without inspiring progress on … mechanistic interpretability.” That’s all to say that some of the “Alignment Science” research might also be useful as ML theory research and then motivate advances in AI capabilities.
I’m curious if you have thoughts on these cases, and how Anthropic might draw a more cautious line around what they choose to publish.
Unfortunately, I don’t think a detailed discussion of what we regard as safe to publish would be responsible, but I can share how we operate at a procedural level. We don’t consider any research area to be blanket safe to publish. Instead, we consider all releases on a case by case basis, weighing expected safety benefit against capabilities/acceleratory risk. In the case of difficult scenarios, we have a formal infohazard review procedure.
Thanks for the response, Chris, that makes sense and I’m glad to read you have a formal infohazard procedure!