I was recently surprised to notice that Anthropic doesn’t seem to have a commitment to publish its safety research.[1] It has a huge safety team but has only published ~5 papers this year (plus an evals report). So probably it has lots of safety research it’s not publishing. E.g. my impression is that it’s not publishing its scalable oversight and especially adversarial robustness and inference-time misuse prevention research.
Not-publishing-safety-research is consistent with Anthropic prioritizing the win the race goal over the help all labs improve safety goal, insofar as the research is commercially advantageous. (Insofar as it’s not, not-publishing-safety-reseach is baffling.)
Maybe it would be better if Anthropic published ~all of its safety research, including product-y stuff like adversarial robustness such that others can copy its practices.
(I think this is not a priority for me to investigate but I’m interested in info and takes.)
[Edit: in some cases you can achieve most of the benefit with little downside except losing commercial advantage by sharing your research/techniques with other labs, nonpublicly.]
President Daniela Amodei said “we publish our safety research” on a podcast once.
Edit: cofounder Chris Olah said “we plan to share the work that we do on safety with the world, because we ultimately just want to help people build safe models, and don’t want to hoard safety knowledge” on a podcast once.
Cofounder Nick Joseph said this on a podcast recently (seems false but it’s just a podcast so that’s not so bad): > we publish our safety research, so in some ways we’re making it as easy as we can for [other labs]. We’re like, “Here’s all the safety research we’ve done. Here’s as much detail as we can give about it. Please go reproduce it.”
Edit: also cofounder Chris Olah said “we consider all releases on a case by case basis, weighing expected safety benefit against capabilities/acceleratory risk.” But he seems to be saying that safety benefit > social cost is a necessary condition for publishing, not necessarily that the policy is to publish all such research.
One thing I’d really like labs to do is encourage their researchers to blog about their thoughts on the future, on alignment plans, etc.
Another related but distinct thing is have safety cases and have an anytime alignment plan and publish redacted versions of them.
Safety cases: Argument for why the current AI system isn’t going to cause a catastrophe. (Right now, this is very easy to do: ‘it’s too dumb’)
Anytime alignment plan: Detailed exploration of a hypothetical in which a system trained in the next year turns out to be AGI, with particular focus on what alignment techniques would be applied.
One thing I’d really like labs to do is encourage their researchers to blog about their thoughts on the future, on alignment plans, etc.
Or, as a more minimal ask, they could avoid discouraging researchers from sharing thoughts implicitly due to various chilling effects and also avoid explicitly discouraging researchers.
Anytime alignment plan: Detailed exploration of a hypothetical in which a system trained in the next year turns out to be AGI, with particular focus on what alignment techniques would be applied.
I’d personally love to see similar plans from AI safety orgs, especially (big) funders.
Here is the doc, though note that it is very out of date. I don’t particularly want to recommend people read this doc, but it is possible that someone will find it valuable to read.
I don’t think funders are in a good position to do this. Also, funders are generally not “coherant”. Like they don’t have much top down strategy. Individual granters could write up thoughts.
Fwiw I am somewhat more sympathetic here to “the line between safety and capabilities is blurry, Anthropic has previously published some interpretability research that turned out to help someone else do some capabilities advances.”
I have heard Anthropic is bottlenecked on having people with enough context and discretion to evaluate various things that are “probably fine to publish” but “not obviously fine enough to ship without taking at least a chunk of some busy person’s time”. I think in this case I basically take the claim at face value.
I do want to generally keep pressuring them to somehow resolve that bottleneck because it seems very important, but, I don’t know that I disproportionately would complain at them about this particular thing.
(I’d also not surprised if, while the above claim is true, Anthropic is still suspiciously dragging it’s feet disproportionately in areas that feel like they make more of a competitive sacrifice, but, I wouldn’t actively bet on it)
If you’re right, I think the upshot is (a) Anthropic should figure out whether to publish stuff rather than let it languish and/or (b) it would be better for lots of Anthropic safety researchers to instead do research that’s safe to share (rather than research that only has value if Anthropic wins the race)
I would expect that some amount of good safety research is of the form, “We tried several ways of persuading several leading AI models how to give accurate instructions for breeding antibiotic-resistant bacteria. Here are the ways that succeeded, here are some first-level workarounds, here’s how we beat those workarounds...”: in other words, stuff that would be dangerous to publish. In the most extreme cases, a mere title (“Telling the AI it’s writing a play defeats all existing safety RLHF” or “Claude + Coverity finds zero-day RCE exploits in many codebases”) could be dangerous.
That said, some large amount should be publishable, and 5 papers does seem low.
Though maybe they’re not making an effort to distinguish what’s safe to publish from what’s not, and erring towards assuming the latter? (Maybe someone set a policy of “Before publishing any safety research, you have to get Important Person X to look through it and/or go through some big process to ensure publishing it is safe”, and the individual researchers are consistently choosing “Meh, I have other work to do, I won’t bother with that” and therefore not publishing?)
Seems like evidence towards the claim here: Open source AI has been vital for alignment. My rough impression is also that the other big labs’ output has largely been similarly disappointing in terms of public research output on safety.
I also wish to see more safety papers. I guess/from my experience that it might also be—really good quality research takes time, and the papers so far from them seems pretty good. Though I don’t know if they are actively withholding things on purpose which could also be true—any insider/sources for this guess?
My impression from skimming posts here is that people seem to be continually surprised by Anthropic, while those modeling it as basically “Pepsi to OpenAI’s Coke” wouldn’t be.
Meta seems to be the only group doing something meaningfully different from the others.
I’d say it’s slightly more like “Labor vs Conservatives”, where I’ve seen politicians deflect criticisms of their behavior by arguing about that the other side is worse, instead of evaluating their policies or behavior by objective standards (where both sides can typically score exceedingly low).
This shortform is relevant to e.g. understanding what’s going on and considerations on the value of working on safety at Anthropic, not just pressuring Anthropic.
Is this where we think our pressuring-Anthropic points are best spent ?
I think if someone has a 30-minute meeting with some highly influential and very busy person at Anthropic, it makes sense for them to have thought in advance about the most important things to ask & curate the agenda appropriately.
But I don’t think LW users should be thinking much about “pressuring-Anthropic points”. I see LW primarily as a platform for discourse (as opposed to a direct lobbying channel to labs), and I think it would be bad for the discourse if people felt like they had to censor questions/concerns about labs on LW unless it met some sort of “is this one of the most important things to be pushing for” bar.
I think it’s bad for discourse for us to pretend that discourse doesn’t have impacts on others in a democratic society. And I think the meta-censoring of discourse by claiming that certain questions might have implicit censorship impacts is one of the most anti-rationality trends in the rationalist sphere.
I recognize most users of this platform will likely disagree, and predict negative agreement-karma on this post.
I think it’s bad for discourse for us to pretend that discourse doesn’t have impacts on others in a democratic society.
I think I agree with this in principle. Possible that the crux between us is more like “what is the role of LessWrong.”
For instance, if Bob wrote a NYT article titled “Anthropic is not publishing its safety research”, I would be like “meh, this doesn’t seem like a particularly useful or high-priority thing to be bringing to everyone’s attention– there are like at least 10+ topics I would’ve much rather Bob spent his points on.”
But LW generally isn’t a place where you’re going to get EG thousands of readers or have a huge effect on general discourse (with the exception of a few things that go viral or AIS-viral).
So I’m not particularly worried about LW discussions having big second-order effects on democratic society. Whereas LW can be a space for people to have a relatively low bar for raising questions, being curious, trying to understand the world, offering criticism/praise without thinking much about how they want to be spending “points”, etc.
Of course it has impacts on others in society! In finding out the truth and investigating and finding strong arguments and evidence. The overall effect of a lot of high quality, curious, public investigation is to greatly improve others maps of the world in surprising ways and help people make better decisions, and this is true even if no individual thread of questioning is primarily optimized to help people make better decisions.
Re censoriousness: I think your question of how best to pressure an unethical company to be less unethical is a fine question, but to imply it’s the only good question (which I read into your comment, perhaps inaccurately) goes against the spirit of intellectual discourse.
It is genuinely a sign that we are all very bad at predicting others’ minds that it didn’t occur to me that if I said effectively “OP asked for ‘takes’, here’s a take on why I think this is pragmatically a bad idea” would also mean that I was saying “and therefore there is no other good question here”. That’s, as the meme goes, a whole different sentence.
Well, but you didn’t give a take on why it’s pragmatically a bad idea. If you’d written a comment with a pointer to something else worth pressuring them on, or gave a reason why publishing all the safety research doesn’t help very much / has hidden costs, I would’ve thought it a fine contribution to the discussion. Without that, the comment read to me as dismissive of the idea of exploring this question.
I disagree. I think the standard of “Am I contributing anything of substance to the conversation, such as a new argument or new information that someone can engage with?” is a pretty good standard for most/all comments to hold themselves to, regardless of the amount of engagement that is expected.
[Edit: Just FWIW, I have not voted on any of your comments in this thread.]
I think, having been raised in a series of very debate- and seminar-centric discussion cultures, that a quick-hit question like that is indeed contributing something of substance. I think it’s fair that folks disagree, and I think it’s also fair that people signal (e.g., with karma) that they think “hey man, let’s go a little less Socratic in our inquiry mode here.”
But, put in more rationalist-centric terms, sometimes the most useful Bayesian update you can offer someone else is, “I do not think everyone is having the same reaction to your argument that you expected.” (Also true for others doing that to me!)
(Edit to add two words to avoid ambiguity in meaning of my last sentence)
Ok, then to ask it again in your preferred question format: is this where we think our getting-potential-employees-of-Anthropic-to-consider-the-value-of-working-on-safety-at-Anthropic points are best spent?
I was recently surprised to notice that Anthropic doesn’t seem to have a commitment to publish its safety research.[1] It has a huge safety team but has only published ~5 papers this year (plus an evals report). So probably it has lots of safety research it’s not publishing. E.g. my impression is that it’s not publishing its scalable oversight and especially adversarial robustness and inference-time misuse prevention research.
Not-publishing-safety-research is consistent with Anthropic prioritizing the win the race goal over the help all labs improve safety goal, insofar as the research is commercially advantageous. (Insofar as it’s not, not-publishing-safety-reseach is baffling.)
Maybe it would be better if Anthropic published ~all of its safety research, including product-y stuff like adversarial robustness such that others can copy its practices.
(I think this is not a priority for me to investigate but I’m interested in info and takes.)
[Edit: in some cases you can achieve most of the benefit with little downside except losing commercial advantage by sharing your research/techniques with other labs, nonpublicly.]
I failed to find good sources saying Anthropic publishes its safety research. I did find:
https://www.anthropic.com/research says “we . . . share what we learn [on safety].”
President Daniela Amodei said “we publish our safety research” on a podcast once.
Edit: cofounder Chris Olah said “we plan to share the work that we do on safety with the world, because we ultimately just want to help people build safe models, and don’t want to hoard safety knowledge” on a podcast once.
Cofounder Nick Joseph said this on a podcast recently (seems false but it’s just a podcast so that’s not so bad):
> we publish our safety research, so in some ways we’re making it as easy as we can for [other labs]. We’re like, “Here’s all the safety research we’ve done. Here’s as much detail as we can give about it. Please go reproduce it.”
Edit: also cofounder Chris Olah said “we consider all releases on a case by case basis, weighing expected safety benefit against capabilities/acceleratory risk.” But he seems to be saying that safety benefit > social cost is a necessary condition for publishing, not necessarily that the policy is to publish all such research.
One argument against publishing adversarial robustness research is that it might make your systems easier to attack.
One thing I’d really like labs to do is encourage their researchers to blog about their thoughts on the future, on alignment plans, etc.
Another related but distinct thing is have safety cases and have an anytime alignment plan and publish redacted versions of them.
Safety cases: Argument for why the current AI system isn’t going to cause a catastrophe. (Right now, this is very easy to do: ‘it’s too dumb’)
Anytime alignment plan: Detailed exploration of a hypothetical in which a system trained in the next year turns out to be AGI, with particular focus on what alignment techniques would be applied.
Or, as a more minimal ask, they could avoid discouraging researchers from sharing thoughts implicitly due to various chilling effects and also avoid explicitly discouraging researchers.
I’d personally love to see similar plans from AI safety orgs, especially (big) funders.
We’re working on something along these lines. The most up-to-date published post is just our control post and our Notes on control evaluations for safety cases which is obviously incomplete.
I’m planing on posting a link to our best draft of a ready-to-go-ish plan as of 1 year ago, though it is quite out of date and incomplete.
I posted the link here.
Here is the doc, though note that it is very out of date. I don’t particularly want to recommend people read this doc, but it is possible that someone will find it valuable to read.
I don’t think funders are in a good position to do this. Also, funders are generally not “coherant”. Like they don’t have much top down strategy. Individual granters could write up thoughts.
Fwiw I am somewhat more sympathetic here to “the line between safety and capabilities is blurry, Anthropic has previously published some interpretability research that turned out to help someone else do some capabilities advances.”
I have heard Anthropic is bottlenecked on having people with enough context and discretion to evaluate various things that are “probably fine to publish” but “not obviously fine enough to ship without taking at least a chunk of some busy person’s time”. I think in this case I basically take the claim at face value.
I do want to generally keep pressuring them to somehow resolve that bottleneck because it seems very important, but, I don’t know that I disproportionately would complain at them about this particular thing.
(I’d also not surprised if, while the above claim is true, Anthropic is still suspiciously dragging it’s feet disproportionately in areas that feel like they make more of a competitive sacrifice, but, I wouldn’t actively bet on it)
Sounds fatebookable tho, so let’s use ye Olde Fatebook Chrome extension:
⚖ In 4 years, Ray will think it is pretty obviously clear that Anthropic was strategically avoiding posting alignment research for race-winning reasons. (Raymond Arnold: 17%)
(low probability because I expect it to still be murky/unclear)
I tentatively think this is a high-priority ask
Capabilities research isn’t a monolith and improving capabilities without increasing spooky black-box reasoning seems pretty fine
If you’re right, I think the upshot is (a) Anthropic should figure out whether to publish stuff rather than let it languish and/or (b) it would be better for lots of Anthropic safety researchers to instead do research that’s safe to share (rather than research that only has value if Anthropic wins the race)
I would expect that some amount of good safety research is of the form, “We tried several ways of persuading several leading AI models how to give accurate instructions for breeding antibiotic-resistant bacteria. Here are the ways that succeeded, here are some first-level workarounds, here’s how we beat those workarounds...”: in other words, stuff that would be dangerous to publish. In the most extreme cases, a mere title (“Telling the AI it’s writing a play defeats all existing safety RLHF” or “Claude + Coverity finds zero-day RCE exploits in many codebases”) could be dangerous.
That said, some large amount should be publishable, and 5 papers does seem low.
Though maybe they’re not making an effort to distinguish what’s safe to publish from what’s not, and erring towards assuming the latter? (Maybe someone set a policy of “Before publishing any safety research, you have to get Important Person X to look through it and/or go through some big process to ensure publishing it is safe”, and the individual researchers are consistently choosing “Meh, I have other work to do, I won’t bother with that” and therefore not publishing?)
Seems like evidence towards the claim here: Open source AI has been vital for alignment. My rough impression is also that the other big labs’ output has largely been similarly disappointing in terms of public research output on safety.
I also wish to see more safety papers. I guess/from my experience that it might also be—really good quality research takes time, and the papers so far from them seems pretty good. Though I don’t know if they are actively withholding things on purpose which could also be true—any insider/sources for this guess?
My impression from skimming posts here is that people seem to be continually surprised by Anthropic, while those modeling it as basically “Pepsi to OpenAI’s Coke” wouldn’t be.
Meta seems to be the only group doing something meaningfully different from the others.
There’s a selection effect in what gets posted about. Maybe someone should write the “ways Anthropic is better than others” list to combat this.
Edit: there’s also a selection effect in what you see, since negative stuff gets more upvotes…
I’d say it’s slightly more like “Labor vs Conservatives”, where I’ve seen politicians deflect criticisms of their behavior by arguing about that the other side is worse, instead of evaluating their policies or behavior by objective standards (where both sides can typically score exceedingly low).
Is this where we think our pressuring-Anthropic points are best spent ?
This shortform is relevant to e.g. understanding what’s going on and considerations on the value of working on safety at Anthropic, not just pressuring Anthropic.
@Neel Nanda
Yeah, fair point, disagreement retracted
I think if someone has a 30-minute meeting with some highly influential and very busy person at Anthropic, it makes sense for them to have thought in advance about the most important things to ask & curate the agenda appropriately.
But I don’t think LW users should be thinking much about “pressuring-Anthropic points”. I see LW primarily as a platform for discourse (as opposed to a direct lobbying channel to labs), and I think it would be bad for the discourse if people felt like they had to censor questions/concerns about labs on LW unless it met some sort of “is this one of the most important things to be pushing for” bar.
I agree! I hope people regularly ask questions about Anthropic that they feel curious about, as well as questions that seem important to them :)
I think it’s bad for discourse for us to pretend that discourse doesn’t have impacts on others in a democratic society. And I think the meta-censoring of discourse by claiming that certain questions might have implicit censorship impacts is one of the most anti-rationality trends in the rationalist sphere.
I recognize most users of this platform will likely disagree, and predict negative agreement-karma on this post.
I think I agree with this in principle. Possible that the crux between us is more like “what is the role of LessWrong.”
For instance, if Bob wrote a NYT article titled “Anthropic is not publishing its safety research”, I would be like “meh, this doesn’t seem like a particularly useful or high-priority thing to be bringing to everyone’s attention– there are like at least 10+ topics I would’ve much rather Bob spent his points on.”
But LW generally isn’t a place where you’re going to get EG thousands of readers or have a huge effect on general discourse (with the exception of a few things that go viral or AIS-viral).
So I’m not particularly worried about LW discussions having big second-order effects on democratic society. Whereas LW can be a space for people to have a relatively low bar for raising questions, being curious, trying to understand the world, offering criticism/praise without thinking much about how they want to be spending “points”, etc.
Of course it has impacts on others in society! In finding out the truth and investigating and finding strong arguments and evidence. The overall effect of a lot of high quality, curious, public investigation is to greatly improve others maps of the world in surprising ways and help people make better decisions, and this is true even if no individual thread of questioning is primarily optimized to help people make better decisions.
Re censoriousness: I think your question of how best to pressure an unethical company to be less unethical is a fine question, but to imply it’s the only good question (which I read into your comment, perhaps inaccurately) goes against the spirit of intellectual discourse.
It is genuinely a sign that we are all very bad at predicting others’ minds that it didn’t occur to me that if I said effectively “OP asked for ‘takes’, here’s a take on why I think this is pragmatically a bad idea” would also mean that I was saying “and therefore there is no other good question here”. That’s, as the meme goes, a whole different sentence.
Well, but you didn’t give a take on why it’s pragmatically a bad idea. If you’d written a comment with a pointer to something else worth pressuring them on, or gave a reason why publishing all the safety research doesn’t help very much / has hidden costs, I would’ve thought it a fine contribution to the discussion. Without that, the comment read to me as dismissive of the idea of exploring this question.
Yes, I would agree that if I expected a short take to have this degree of attention, I would probably have written a longer comment.
Well, no, I take that back. I probably wouldn’t have written anything at all. To some, that might be a feature; to me, that’s a bug.
I disagree. I think the standard of “Am I contributing anything of substance to the conversation, such as a new argument or new information that someone can engage with?” is a pretty good standard for most/all comments to hold themselves to, regardless of the amount of engagement that is expected.
[Edit: Just FWIW, I have not voted on any of your comments in this thread.]
I think, having been raised in a series of very debate- and seminar-centric discussion cultures, that a quick-hit question like that is indeed contributing something of substance. I think it’s fair that folks disagree, and I think it’s also fair that people signal (e.g., with karma) that they think “hey man, let’s go a little less Socratic in our inquiry mode here.”
But, put in more rationalist-centric terms, sometimes the most useful Bayesian update you can offer someone else is, “I do not think everyone is having the same reaction to your argument that you expected.” (Also true for others doing that to me!)
(Edit to add two words to avoid ambiguity in meaning of my last sentence)
Ok, then to ask it again in your preferred question format: is this where we think our getting-potential-employees-of-Anthropic-to-consider-the-value-of-working-on-safety-at-Anthropic points are best spent?