Edit, 2.5 days later: I think this list is fine but sharing/publishing it was a poor use of everyone’s attention. Oops.
Asks for Anthropic
Note: I think Anthropic is the best frontier AI lab on safety. I wrote up asks for Anthropic because it’s most likely to listen to me. A list of asks for any other lab would include most of these things plus lots more. This list was originally supposed to be more part of my help labs improve project than my hold labs accountable crusade.
Numbering is just for ease of reference.
1. RSP: Anthropic should strengthen/clarify the ASL-3 mitigations, or define ASL-4 such that the threshold is not much above ASL-3 but the mitigations much stronger. I’m not sure where the lowest-hanging mitigation-fruit is, except that it includes control.
3. External model auditing for risk assessment: Anthropic (like all labs) should let auditors like METR, UK AISI, and US AISI audit its models if they want to — Anthropic should offer them good access pre-deployment and let them publish their findings or flag if they’re concerned. (Anthropic shared some access with UK AISI before deploying Claude 3.5 Sonnet, but it doesn’t seem to have been deep.) (Anthropic hassaid that sharing with external auditors is hard or costly. It’s not clear why, for just sharing normal API access + helpful-only access + control over inference-time safety features, without high-touch support.)
4. Policy advocacy (this is murky, and maybe driven by disagreements-on-the-merits and thus intractable): Anthropic (like all labs) should stop advocating against good policy and ideally should advocate for good policy. Maybe it should also be more transparent about policy advocacy. [It’s hard to make precise what I believe is optimal and what I believe is unreasonable, but at the least I think Dario is clearly too bullish on self-governance, and Jack Clark is clearly too anti-regulation, and all of this would be OK if it was balanced out by some public statements or policy advocacy that’s more pro-real-regulation but as far as I can tell it’s not. Not justified here but I predict almost all of my friends would agree if they looked into it for an hour.]
5a. Security: Anthropic (like all labs) should ideally implement RAND SL4 for model weights and code when reaching ASL-3. I think that’s unrealistic, but lesser security improvements would also be good. (Anthropic said in May 2024 that 8% of staff work in security-related areas. I think this is pretty good. I think on current margins Anthropic could still turn money into better security reasonably effectively, and should do so.)
5b. Anthropic (like all labs) should be more transparent about the quality of its security. Anthropic should publish the private reports on https://trust.anthropic.com/, redacted as appropriate. It should commit to publish information on future security incidents and should publish information on all security incidents from the last year or two.
7. Anthropic takes credit for its Long-Term Benefit Trust but Anthropic hasn’tpublished enough to show that it’s effective. Anthropic should publish the Trust Agreement, clarify the ambiguities discussed in the linked posts, and make accountability-y commitments like if major changes happen to the LTBT we’ll quickly tell the public.
8. Anthropic should avoid exaggerating interpretability research or causing observers to have excessively optimistic impressions of Anthropic’s interpretability research. (See e.g. Stephen Casper.)
9. Maybe Anthropic (like all labs) should make safety cases for its models or deployments, especially after the simple “no dangerous capabilities” safety case doesn’t work anymore, and publish them (or maybe just share with external auditors).
9.5. Anthropic should clarify a few confusing RSP things, including (a) the deal with substantially raising the ARA bar for ASL-3, and moreover deciding the old threshold is a “yellow line” and not creating a new threshold, and doing so without officially updating the RSP (and quietly); and (b) when the “every 3 months” trigger for RSP evals is active. I haven’t tried hard to get to the bottom of these.
Minor stuff:
10. Anthropic (like all labs) should fully release everyone from nondisparagement agreements and not use nondisparagement agreements in the future.
11. Anthropic should commit to publish updates on risk assessment practices and results, including low-level details, perhaps for all major model releases and at least quarterly or so. (Anthropic says its Responsible Scaling Officer does this internally. Anthropic publishes model cards and has published one Responsible Scaling Policy Evaluations Report.)
12. Anthropic should confirm that its old policy don’t meaningfully advance the frontier with a public launch has been replaced by the RSP, if that’s true, and otherwise clarify Anthropic’s policy.
Done!13. Anthropic committed to establish a bug bounty program (for model issues) or similar, over a year ago. Anthropic hasn’t; it is the only frontier lab without a bug bounty program (although others don’t necessarily comply with the commitment, e.g. OpenAI’s excludes model issues). It should do this or talk about its plans.
14. [Anthropic should clarify its security commitments; I expect it will in its forthcoming RSP update.]
15. [Maybe Anthropic (like all labs) should better boost external safety research, especially by giving more external researchers deep model access (e.g. fine-tuning or helpful-only). I hear this might be costly but I don’t really understand why.]
16. [Probably Anthropic (like all labs) should encourage staff members to talk about their views (on AI progress and risk and what Anthropic is doing and what Anthropic should do) with people outside Anthropic, as long as they (1) clarify that they’re not speaking for Anthropic and (2) don’t share secrets.]
17. [Maybe Anthropic (like all labs) should talk about its views on AI progress and risk. At the least, probably Anthropic (like all labs) should clearly describe a worst-case plausible outcome from AI and state how likely the lab considers it.]
18. [Most of my peers say: Anthropic (like all labs) should publish info like training compute and #parameters for each model. I’m inside-view agnostic on this.]
19. [Maybe Anthropic could cheaply improve its model evals for dangerous capabilities or share more information about them. Specific asks/recommendations TBD. As Anthropic notes, its CBRN eval is kinda bad and its elicitation is kinda bad (and it doesn’t share enough info for us to evaluate its elicitation from the outside).]
I shared this list—except 9.5 and 19, which are new—with @Zac Hatfield-Dodds two weeks ago.
You are encouraged to comment with other asks for Anthropic. (Or things Anthropic does very well, if you feel so moved.)
I think both Zach and I care about labs doing good things on safety, communicating that clearly, and helping people understand both what labs are doing and the range of views on what they should be doing. I shared Zach’s doc with some colleagues, but won’t try for a point-by-point response. Two high-level responses:
First, at a meta level, you say:
[Probably Anthropic (like all labs) should encourage staff members to talk about their views (on AI progress and risk and what Anthropic is doing and what Anthropic should do) with people outside Anthropic, as long as they (1) clarify that they’re not speaking for Anthropic and (2) don’t share secrets.]
I do feel welcome to talk about my views on this basis, and often do so with friends and family, at public events, and sometimes even in writing on the internet (hi!). However, it takes way more effort than you might think to avoid inaccurate or misleading statements while also maintaining confidentiality. Public writing tends to be higher-stakes due to the much larger audience and durability, so I routinely run comments past several colleagues before posting, and often redraft in response (including these comments and this very point!).
My guess is that most don’t do this much in public or on the internet, because it’s absolutely exhausting, and if you say something misremembered or misinterpreted you’re treated as a liar, it’ll be taken out of context either way, and you probably can’t make corrections. I keep doing it anyway because I occasionally find useful perspectives or insights this way, and think it’s important to share mine. That said, there’s a loud minority which makes the AI-safety-adjacent community by far the most hostile and least charitable environment I spend any time in, and I fully understand why many of my colleagues might not want to.
Imagine, if you will, trying to hold a long conversation about AI risk—but you can’t reveal any information about, or learned from, or even just informative about LessWrong. Every claim needs an independent public source, as do any jargon or concepts that would give an informed listener information about the site, etc.; you have to find different analogies and check that citations are public and for all that you get pretty regular hostility anyway because of… well, there are plenty of misunderstandings and caricatures to go around.
I run intro-to-AGI-safety courses for Anthropic employees (based on AGI-SF), and we draw a clear distinction between public and confidential resources specifically so that people can go talk to family and friends and anyone else they wish about the public information we cover.
Second, and more concretely, many of these asks are unimplementable for various reasons, and often gesture in a direction without giving reasons to think that there’s a better tradeoff available than we’re already making. Some quick examples:
Both AI Control and safety cases are research areas less than a year old; we’re investigating them and e.g. hiring safety-case specialists, but best-practices we could implement don’t exist yet. Similarly, there simply aren’t any auditors or audit standards for AI safety yet (see e.g. METR’s statement); we’re working to make this possible but the thing you’re asking for just doesn’t exist yet. Some implementation questions that “let auditors audit our models” glosses over:
If you have dozens of organizations asking to be auditors, and none of them are officially auditors yet, what criteria do you use to decide who you collaborate with?
What kind of pre deployment model access would you provide? If it’s helpful-only or other nonpublic access, do they meet our security bar to avoid leaking privileged API keys? (We’ve already seen unauthorized sharing or compromise lead to serious abuse.)
How do you decide who gets to say what about the testing? What if they have very different priorities than you and think that a different level of risk or a different kind of harm is unacceptable?
I strongly support Anthropic’s nondisclosure of information about pretraining. I have never seen a compelling argument that publishing this kind of information is, on net, beneficial for safety.
There are many cases where I’d be happy if Anthropic shared more about what we’re doing and what we’re thinking about. Some of the things you’re asking about I think we’ve already said, e.g. for [7] LTBT changes would require an RSP update, and for [17] our RSP requires us to “enforce an acceptable use policy [against …] using the model to generate content that could cause severe risks to the continued existence of humankind”.
So, saying “do more X” just isn’t that useful; we’ve generally thought about it and concluded that that the current amount of X is our best available tradeoff at the moment. For many more of the other asks above, I just disagree with implicit or explicit claims about the facts in question. Even for the communication issues where I’d celebrate us sharing more—and for some I expect we will—doing so is yet another demand on heavily loaded people and teams, and it can take longer than we’d like to find the time.
I just want to note that people who’ve never worked in a true high-confidentiality environment (professional services, national defense, professional services for national defense) probably radically underestimate the level of brain damage and friction that Zac is describing here:
“Imagine, if you will, trying to hold a long conversation about AI risk—but you can’t reveal any information about, or learned from, or even just informative about LessWrong. Every claim needs an independent public source, as do any jargon or concepts that would give an informed listener information about the site, etc.; you have to find different analogies and check that citations are public and for all that you get pretty regular hostility anyway because of… well, there are plenty of misunderstandings and caricatures to go around.”
Confidentiality is really, really hard to maintain. Doing so while also engaging the public is terrifying. I really admire the frontier labs folks who try to engage publicly despite that quite severe constraint, and really worry a lot as a policy guy about the incentives we’re creating to make that even less likely in the future.
I’m sympathetic to how this process might be exhausting, but at an institutional level I think Anthropic (and all labs) owe humanity a much clearer image of how they would approach a potentially serious and dangerous situation with their models. Especially so, given that the RSP is fairly silent on this point, leaving the response to evaluations up to the discretion of Anthropic. In other words, the reason I want to hear more from employees is in part because I don’t know what the decision process inside of Anthropic will look like if an evaluation indicates something like “yeah, it’s excellent at inserting backdoors, and also, the vibe is that it’s overall pretty capable.” And given that Anthropic is making these decisions on behalf of everyone, Anthropic (like all labs) really owes it to humanity to be more upfront about how it’ll make these decisions (imo).
I will also note what I feel is a somewhat concerning trend. It’s happened many times now that I’ve critiqued something about Anthropic (its RSP, advocating to eliminate pre-harm from SB 1047, the silent reneging on the commitment to not push the frontier), and someone has said something to the effect of: “this wouldn’t seem so bad if you knew what was happening behind the scenes.” They of course cannot tell me what the “behind the scenes” information is, so I have no way of knowing whether that’s true. And, maybe I would in fact update positively about Anthropic if I knew. But I do think the shape of “we’re doing something which might be incredibly dangerous, many external bits of evidence point to us not taking the safety of this endeavor seriously, but actually you should think we are based on me telling you we are” is pretty sketchy.
I will also note what I feel is a somewhat concerning trend. It’s happened many times now that I’ve critiqued something about Anthropic (its RSP, advocating to eliminate pre-harm from SB 1047, the silent reneging on the commitment to not push the frontier), and someone has said something to the effect of: “this wouldn’t seem so bad if you knew what was happening behind the scenes.”
I just wanted to +1 that I am also concerned about this trend, and I view it as one of the things that I think has pushed me (as well as many others in the community) to lose a lot of faith in corporate governance (especially of the “look, we can’t make any tangible commitments but you should just trust us to do what’s right” variety) and instead look to governments to get things under control.
I don’t think Anthropic is solely to blame for this trend, of course, but I think Anthropic has performed less well on comms/policy than I [and IMO many others] would’ve predicted if you had asked me [or us] in 2022.
@Zac Hatfield-Dodds do you have any thoughts on official comms from Anthropic and Anthropic’s policy team?
For example, I’m curious if you have thoughts on this anecdote– Jack Clark was asked an open-ended question by Senator Cory Booker and he told policymakers that his top policy priority was getting the government to deploy AI successfully. There was no mention of AGI, existential risks, misalignment risks, or anything along those lines, even though it would’ve been (IMO) entirely appropriate for him to bring such concerns up in response to such an open-ended question.
I was left thinking that either Jack does not care much about misalignment risks or he was not being particularly honest/transparent with policymakers. Both of these raise some concerns for me.
(Noting that I hold Anthropic’s comms and policy teams to higher standards than individual employees. I don’t have particularly strong takes on what Anthropic employees should be doing in their personal capacity– like in general I’m pretty in favor of transparency, but I get it, it’s hard and there’s a lot that you have to do. Whereas the comms and policy teams are explicitly hired/paid/empowered to do comms and policy, so I feel like it’s fair to have higher expectations of them.)
very powerful systems [] may have national security uses or misuses. And for that I think we need to come up with tests that make sure that we don’t put technologies into the market which could—unwittingly to us—advantage someone or allow some nonstate actor to commit something harmful. Beyond that I think we can mostly rely on existing regulations and law and existing testing procedures . . . and we don’t need to create some entirely new infrastructure.
At Anthropic we discover that the more ways we find to use this technology the more ways we find it could help us. And you also need a testing and measurement regime that closely looks at whether the technology is working—and if it’s not how you fix it from a technological level, and if it continues to not work whether you need some additional regulation—but . . . I think the greatest risk is us [viz. America] not using it [viz. AI]. Private industry is making itself faster and smarter by experimenting with this technology . . . and I think if we fail to do that at the level of the nation, some other entrepreneurial nation will succeed here.
My guess is that most don’t do this much in public or on the internet, because it’s absolutely exhausting, and if you say something misremembered or misinterpreted you’re treated as a liar, it’ll be taken out of context either way, and you probably can’t make corrections. I keep doing it anyway because I occasionally find useful perspectives or insights this way, and think it’s important to share mine. That said, there’s a loud minority which makes the AI-safety-adjacent community by far the most hostile and least charitable environment I spend any time in, and I fully understand why many of my colleagues might not want to.
My guess is that this seems so stressful mostly because Anthropic’s plan is in fact so hard to defend, due to making little sense. Anthropic is attempting to build a new mind vastly smarter than any human, and as I understand it, plans to ensure this goes well basically by doing periodic vibe checks to see whether their staff feel sketched out yet. I think a plan this shoddy obviously endangers life on Earth, so it seems unsurprising (and good) that people might sometimes strongly object; if Anthropic had more reassuring things to say, I’m guessing it would feel less stressful to try to reassure them.
Meta aside: normally this wouldn’t seem worth digging into but as a moderator/site-culture-guardian, I feel compelled to justify my negative react on the disagree votes.
I’m actually not entirely sure what downvote-reacting is for. Habryka has said the intent is to override inappropriate uses of reacts. We haven’t actually really had a sit-down-and-argue-this-out on the moderator team. I’m pretty sure we haven’t told or tried to enforce that “override inappropriate use of reacts” as the intended use
I think Adam’s line:
Anthropic is attempting to build a new mind vastly smarter than any human, and as I understand it, plans to ensure this goes well basically by doing periodic vibe checks to see whether their staff feel sketched out yet.
Is psychologizing and summarizing Anthropic unfairly. So I wouldn’t agree vote with it. I do think it has some kind of grain of truth to it (me believing this is also kind of “doubting the experience of Anthropic employees” which is also group-epistemologically dicey IMO, but, feels kinda important enough to do in this case). The claim isn’t true… but I also don’t belief report that it’s not true.
I initially downvoted the Disagree when it was just Noosphere, since I didn’t think Noosphere was really in a position to have an opinion and if he was the only reactor it felt more like noise. A few others who are more positioned to know relevant stuff have since added their own disagree reacts. I… feel sort of justified leaving the anti-react up, with an overall indicator of “a bunch of people disagree with this, but the weight of that disagreement is slightly reduced.” (I think I’d remove the anti-react if the the disagree count went much lower than it is now).
I don’t know whether I particularly endorse any of this, but wanted people to have a bit more model of what one site-admin was thinking here.
What seemed psychologizing/unfair to you, Raemon? I think it was probably unnecessarily rude/a mistake to try to summarize Anthropic’s whole RSP in a sentence, given that the inferential distance here is obviously large. But I do think the sentence was fair.
As I understand it, Anthropic’s plan for detecting threats is mostly based on red-teaming (i.e., asking the models to do things to gain evidence about whether they can). But nobody understands the models well enough to check for the actual concerning properties themselves, so red teamers instead check for distant proxies, or properties that seem plausibly like precursors. (E.g., for “ability to search filesystems for passwords” as a partial proxy for “ability to autonomously self-replicate,” since maybe the former is a prerequisite for the latter).
But notice that this activity does not involve directly measuring the concerning behavior. Rather, it instead measures something more like “the amount the model strikes the evaluators as broadly sketchy-seeming/suggestive that it might be capable of doing other bad stuff.” And the RSP’s description of Anthropic’s planned responses to these triggers is so chock full of weasel words and caveats and vague ambiguous language that I think it barely constrains their response at all.
So in practice, I think both Anthropic’s plan for detecting threats, and for deciding how to respond, fundamentally hinge on wildly subjective judgment calls, based on broad, high-level, gestalt-ish impressions of how these systems seem likely to behave. I grant that this process is more involved than the typical thing people describe as a “vibe check,” but I do think it’s basically the same epistemic process, and I expect will generate conclusions around as sound.
I don’t really think any of that affects the difficulty of public communication; your implication that it must be the cause reads to me more like an insult than a well-considered psychological model
I don’t really think any of that affects the difficulty of public communication
The basic point would be that it’s hard to write publicly about how you are taking responsible steps that grapple directly with the real issues… if you are not in fact doing those responsible things in the first place. This seems locally valid to me; you may disagree on the object level about whether Adam Scholl’s characterization of Anthropic’s agenda/internal work is correct, but if it is, then it would certainly affect the difficulty of public communication to such an extent that it might well become the primary factor that needs to be discussed in this matter.
Indeed, the suggestion is for Anthropic employees to “talk about their views (on AI progress and risk and what Anthropic is doing and what Anthropic should do) with people outside Anthropic” and the counterargument is that doing so would be nice in an ideal world, except it’s very psychologically exhausting because every public statement you make is likely to get maliciously interpreted by those who will use it to argue that your company is irresponsible. In this situation, there is a straightforward direct correlation between the difficulty of public communication and the likelihood that your statements will get you and your company in trouble.
But the more responsible you are in your actual work, the more responsible-looking details you will be able to bring up in conversations with others when you discuss said work. AI safety community members are not actually arbitrarily intelligent Machiavellians with the ability to convincingly twist every (in-reality) success story into an (in-perception) irresponsible gaffe;[1] the extent to which they can do this depends very heavily on the extent to which you have anything substantive to bring up in the first place. After all, as Paul Graham often says, “If you want to convince people of something, it’s much easier if it’s true.”
As I see it, not being able to bring up Anthropic’s work/views on this matter without some AI safety person successfully making it seem like Anthropic is behaving badly is rather strong Bayesian evidence that Anthropic is in fact behaving badly. So this entire discussion, far from being an insult, seems directly on point to the topic at hand, and locally valid to boot (although not necessarily sound, as that depends on an individualized assessment of the particular object-level claims about the usefulness of the company’s safety team).
Quite the opposite, actually, if the change in the wider society’s opinions about EA in the wake of the SBF scandal is any representative indication of how the rationalist/EA/AI safety cluster typically handles PR stuff.
I think communication as careful as it must be to maintain the confidentiality distinction here is always difficult in the manner described, and that communication to large quantities of people will ~always result in someone running with an insane misinterpretation of what was said.
I understand that this confidentiality point might seem to you like the end of the fault analysis, but have you considered the hypothesis that Anthropic leadership has set such stringent confidentiality policies in part to make it hard for Zac to engage in public discourse?
Look, I don’t think Anthropic leadership is just trying to keep their training skills private or their models secure. Their company does not merely keep trade secrets. When I speak to staff from this company about issues with their ‘Responsible Scaling Policies’, they say that they want to tell me more information about how they think it can be better or how they think it might change, but cannot due to confidentiality constraints. That’s their safety policies, not information about their training policies that they want to keep secret so that they can make money.
I believe the Anthropic leadership cares very little about the public’s ability to have arguments and evidence and access to information about Anthropic’s behavior. The leadership roughly ~never shows up to engage with critical discourse about itself, unless there’s a potential major embarrassment. There is no regular Q&A session with the leadership of a company who believes their own product poses a 10-25% chance of existential catastrophe, no comment section on their website, no official twitter accounts that regularly engage with and share info with critics, no debates with the many people who outright oppose their actions.
No, they go far in the other direction of committing to no-public-discourse. I challenge any Anthropic staffer to openly deny that there is a mutual non-disparagement agreement between Anthropic and OpenAI leadership, whereby neither is legally allowed to openly criticize the other’s company. (You can read cofounder Sam McCandlish write that Anthropic has mutual non-disparagement agreements in this comment.) Anthropic leadership say they quit OpenAI primarily due to safety concerns, and yet I believe they simultaneously signed away their ability to criticize that very organization that they had such unique levels of information about and believed poses an existential threat to civilization.
Where Daniel Kokotajlo refused to sign a non-disparagement agreement (by-default forfeiting his equity) so that he could potentially criticize OpenAI in the future, the Amodei’s quit purportedly due to having damning criticisms of OpenAI in the present and then (I believe) chose to sign a non-disparagement agreement while quitting (and kept their equity). A complete inversion of good collective epistemic principles.
To quote from Zac’s above analogy explaining how difficult his situation at Anthropic is.
Imagine, if you will, trying to hold a long conversation about AI risk—but you can’t reveal any information about, or learned from, or even just informative about LessWrong. Every claim needs an independent public source, as do any jargon or concepts that would give an informed listener information about the site, etc.; you have to find different analogies and check that citations are public
The analogous goal described here for Anthropic is to have complete separation between internal and external information. This does not describe a set of blacklisted trade-secrets or security practices. My sense is that for most safety-related issues Anthropic has a set of whitelisted information, which is primarily the already public stuff. The goal here is for you to not have access to any information about them that they did not explicitly decide that they wanted you to know, and they do not want people in their org to share new information when engaging in public, critical discourse.
Yes, yes, Zac’s situation is stressful and I respect his effort to engage in public discourse nonetheless. Good on Zac. But I can’t help but wrankle at the implication that the primary reason he and others don’t talk more is the public commentariat not empathizing enough with having confidential info. Sure, people could do better to understand the difficulty of communicating while holding confidential info. It is hard to repeatedly walk right up to the line and not over it, it’s stressful to think you might have gone over it, and it’s stressful to suddenly find yourself unable to engage well with people’s criticisms because you hit a confidential crux. But as to the fault analysis for Zac’s particularly difficult position? In my opinion the blame is surely first with the Anthropic leadership who have given him way too stringent confidentiality constraints, due to seeming to anti-care about helping people external to Anthropic understand what is going on.
I don’t think the placement of fault is causally related to whether communication is difficult for him, really. To refer back to the original claim being made, Adam Schollsaid that
My guess is that this seems so stressful mostly because Anthropic’s plan is in fact so hard to defend… [I]t seems unsurprising (and good) that people might sometimes strongly object; if Anthropic had more reassuring things to say, I’m guessing it would feel less stressful to try to reassure them.
I think the amount of stress incurred when doing public communication is nearly orthogonal to these factors, and in particular is, when trying to be as careful about anything as Zac is trying to be about confidentiality, quite high at baseline. I don’t think Adam Scholl’s assessment arose from a usefully-predictive model, nor one which was likely to reflect the inside view.
Ben Pace has said that perhaps he doesn’t disagree with you in particular about this, but I sure think I do.[1]
I think the amount of stress incurred when doing public communication is nearly orthogonal to these factors, and in particular is, when trying to be as careful about anything as Zac is trying to be about confidentiality, quite high at baseline.
I don’t see how the first half of this could be correct, and while the second half could be true, it doesn’t seem to me to offer meaningful support for the first half either (instead, it seems rather… off-topic).
As a general matter, even if it were the case that no matter what you say, at least one person will actively misinterpret your words, this fact would have little bearing on whether you can causally influence the proportion of readers/community members that end up with (what seem to you like) the correct takeaways from a discussion of that kind.
Moreover, in a spot where you have something meaningful and responsible, etc, that you and your company have done to deal with safety issues, the major concern in your mind when communicating publicly is figuring out how to make it clear to everyone that you are on top of things without revealing confidential information. That is certainly stressful, but much less so than the additional constraint you have in a world in which you do not have anything concrete that you can back your generic claims of responsibility with, since that is a spot where you can no longer fall back on (a partial version of) the truth as your defense. For the vast majority of human beings, lying and intentional obfuscation with the intent to mislead are significantly more psychologically straining than telling the truth as-you-see-it is.
Overall, I also think I disagree about the amount of stress that would be caused by conversations with AI safety community members. As I have said earlier:
AI safety community members are not actually arbitrarily intelligent Machiavellians with the ability to convincingly twist every (in-reality) success story into an (in-perception) irresponsible gaffe;[1] the extent to which they can do this depends very heavily on the extent to which you have anything substantive to bring up in the first place.
[1] Quite the opposite, actually, if the change in the wider society’s opinions about EA in the wake of the SBF scandal is any representative indication of how the rationalist/EA/AI safety cluster typically handles PR stuff.
In any case, I have already made all these points in a number of ways in my previous response to you (which you haven’t addressed, and which still seem to me to be entirely correct).
Yeah, I totally think your perspective makes sense and I appreciate you bringing it up, even though I disagree.
I acknowledge that someone who has good justifications for their position but just has made a bunch of reasonable confidentiality agreements around the topic should expect to run into a bunch of difficulties and stresses around public conflicts and arguments.
I think you go too far in saying that the stress is orthogonal to whether you have a good case to make, I think you can’t really think that it’s not a top-3 factor to how much stress you’re experiencing. As a pretty simple hypothetical, if you’re responding to a public scandal about whether you stole money, you’re gonna have a way more stressful time if you did steal money than if you didn’t (in substantial part because you’d be able to show the books and prove it).
Perhaps not so much disagreeing with you in particular, but disagreeing with my sense of what was being agreed upon in Zac’s comment and in the reacts, I further wanted to raise my hypothesis that a lot of the confidentiality constraints are unwarranted and actively obfuscatory, which does change who is responsible for the stress, but doesn’t change the fact that there is stress.
Added: Also, I think we would both agree that there would be less stress if there were fewer confidentiality restrictions.
For what it’s worth, I endorse Anthopic’s confidentiality policies, and am confident that everyone involved in setting them sees the increased difficulty of public communication as a cost rather than a benefit. Unfortunately, the unilateralist’s curse and entangled truths mean that confidential-by-default is the only viable policy.
That might be the case, but then it only increases the amount of work your company should be doing to carve out and figure out the info that can be made public, and engage with criticism. There should be whole teams who have Twitter accounts and LW accounts and do regular AMAs and show up to podcasts and who have a mandate internally to seek information in the organization and publish relevant info, and there should be internal policies that reflect an understanding that it is correct for some research teams to spend 10-50% of their yearly effort toward making publishable version of research and decision-making principles in order to inform your stakeholders (read: the citizens of earth) and critics about decisions you are making directly related to existential catastrophes that you are getting rich running toward. Not monologue-style blogposts, but dialogue-style comment sections & interviews.
Confidentiality-by-default does not mean you get to abdicate responsibility for answering questions to the people whose lives you are risking about how-and-why you are making decisions, it means you have to put more work into doing it well. If your company valued the rest of the world understanding what is going on yet thought confidentiality by-default was required, I think it would be trying significantly harder to overcome this barrier.
My general principle is that if you are wielding a lot of power over people that they didn’t otherwise legitimately grant you (in this case building a potential doomsday device), you owe them to be auditable. You are supposed to show up and answer their questions directly – not “thank you so much for the questions, in six months I will publish a related blogpost on this topic” but more like “with the public info available to me, here’s my best guess answer to your specific question today”. Especially so if you are doing something the people you have power over perceive as norm-violating, and even more-so when you are keeping the answers to some very important questions secret from them.
Anthropic is attempting to build a new mind vastly smarter than any human, and as I understand it, plans to ensure this goes well basically by “doing periodic vibe checks”
This obvious straw-man makes your argument easy to dismiss.
However I think the point is basically correct. Anthropic’s strategy to reduce x-risk also includes lobbying against pre-harm enforcement of liability for AI companies in SB 1047.
How is it a straw-man? How is the plan meaningfully different from that?
Imagine a group of people has already gathered a substantial amount of uranium, is already refining it, is already selling power generated by their pile of uranium, etc. And doing so right near and upwind of a major city. And they’re shoveling more and more uranium onto the pile, basically as fast as they can. And when you ask them why they think this is going to turn out well, they’re like “well, we trust our leadership, and you know we have various documents, and we’re hiring for people to ‘Develop and write comprehensive safety cases that demonstrate the effectiveness of our safety measures in mitigating risks from huge piles of uranium’, and we have various detectors such as an EM detector which we will privately check and then see how we feel”. And then the people in the city are like “Hey wait, why do you think this isn’t going to cause a huge disaster? Sure seems like it’s going to by any reasonable understanding of what’s going on”. And the response is “well we’ve thought very hard about it and yes there are risks but it’s fine and we are working on safety cases”. But… there’s something basic missing, which is like, an explanation of what it could even look like to safely have a huge pile of superhot uranium. (Also in this fantasy world no one has ever done so and can’t explain how it would work.)
In the AI case, there’s lots of inaction risk: if Anthropic doesn’t make powerful AI, someone less safety-focused will.
It’s reasonable to think e.g. I want to boost Anthropic in the current world because others are substantially less safe, but if other labs didn’t exist, I would want Anthropic to slow down.
I disagree. It would be one thing if Anthropic were advocating for AI to go slower, trying to get op-eds in the New York Times about how disastrous of a situation this was, or actually gaming out and detailing their hopes for how their influence will buy saving the world points if everything does become quite grim, and so on. But they aren’t doing that, and as far as I can tell they basically take all of the same actions as the other labs except with a slight bent towards safety.
Like, I don’t feel at all confident that Anthropic’s credit has exceeded their debit, even on their own consequentialist calculus. They are clearly exacerbating race dynamics, both by pushing the frontier, and by lobbying against regulation. And what they have done to help strikes me as marginal at best and meaningless at worst. E.g., I don’t think an RSP is helpful if we don’t know how to scale safely; we don’t, so I feel like this device is mostly just a glorified description of what was already happening, namely that the labs would use their judgment to decide what was safe. Because when it comes down to it, if an evaluation threshold triggers, the first step is to decide whether that was actually a red-line, based on the opaque and subjective judgment calls of people at Anthropic. But if the meaning of evaluations can be reinterpreted at Anthropic’s whims, then we’re back to just trusting “they seem to have a good safety culture,” and that isn’t a real plan, nor really any different to what was happening before. Which is why I don’t consider Adam’s comment to be a strawman. It really is, at the end of the day, a vibe check.
And I feel pretty sketched out in general by bids to consider their actions relative to other extremely reckless players like OpenAI. Because when we have so little sense of how to build this safely, it’s not like someone can come in and completely change the game. At best they can do small improvements on the margins, but once you’re at that level, it feels kind of like noise to me. Maybe one lab is slightly better than the others, but they’re still careening towards the same end. And at the very least it feels like there is a bit of a missing mood about this, when people are requesting we consider safety plans relatively. I grant Anthropic is better than OpenAI on that axis, but my god, is that really the standard we’re aiming for here? Should we not get to ask “hey, could you please not build machines that might kill everyone, or like, at least show that you’re pretty sure that won’t happen before you do?”
@Zach Stein-Perlman , you’re missing the point. They don’t have a plan. Here’s the thread (paraphrased in my words):
Zach: [asks, for Anthropic] Zac: … I do talk about Anthropic’s safety plan and orientation, but it’s hard because of confidentiality and because many responses here are hostile. … Adam: Actually I think it’s hard because Anthropic doesn’t have a real plan. Joseph: That’s a straw-man. [implying they do have a real plan?] Tsvi: No it’s not a straw-man, they don’t have a real plan. Zach: Something must be done. Anthropic’s plan is something. Tsvi: They don’t have a real plan.
I agree Anthropic doesn’t have a “real plan” in your sense, and narrow disagreement with Zac on that is fine.
I just think that’s not a big deal and is missing some broader point (maybe that’s a motte and Anthropic is doing something bad—vibes from Adam’s comment—is a bailey).
[Edit: “Something must be done. Anthropic’s plan is something.” is a very bad summary of my position. My position is more like various facts about Anthropic mean that them-making-powerful-AI is likely better than the counterfactual, and evaluating a lab in a vacuum or disregarding inaction risk is a mistake.]
[Edit: replies to this shortform tend to make me sad and distracted—this is my fault, nobody is doing something wrong—so I wish I could disable replies and I will probably stop replying and would prefer that others stop commenting. Tsvi, I’m ok with one more reply to this.]
various facts about Anthropic mean that them-making-powerful-AI is likely better than the counterfactual, and evaluating a lab in a vacuum or disregarding inaction risk is a mistake
Look, if Anthropic was honestly and publically saying
We do not have a credible plan for how to make AGI, and we have no credible reason to think we can come up with a plan later. Neither does anyone else. But—on the off chance there’s something that could be done with a nascent AGI that makes a nonomnicide outcome marginally more likely, if the nascent AGI is created and observed by people are at least thinking about the problem—on that off chance, we’re going to keep up with the other leading labs. But again, given that no one has a credible plan or a credible credible-plan plan, better would be if everyone including us stopped. Please stop this industry.
If they were saying and doing that, then I would still raise my eyebrows a lot and wouldn’t really trust it. But at least it would be plausibly consistent with doing good.
But that doesn’t sound like either what they’re saying or doing. IIUC they lobbied to remove protection for AI capabilities whistleblowers from SB 1047! That happened! Wow! And it seems like Zac feels he has to pretend to have a credible credible-plan plan.
Hm. I imagine you don’t want to drill down on this, but just to state for the record, this exchange seems like something weird is happening in the discourse. Like, people are having different senses of “the point” and “the vibe” and such, and so the discourse has already broken down. (Not that this is some big revelation.) Like, there’s the Great Stonewall of the AGI makers. And then Zac is crossing through the gates of the Great Stonewall to come and talk to the AGI please-don’t-makers. But then Zac is like (putting words in his mouth) “there’s no Great Stonewall, or like, it’s not there in order to stonewall you in order to pretend that we have a safe AGI plan or to muddy the waters about whether or not we should have one, it’s there because something something trade secrets and exfohazards, and actually you’re making it difficult to talk by making me work harder to pretend that we have a safe AGI plan or intentions that should promissorily satisfy the need for one”.
Seems like most people believe (implicitly or explicitly) that empirical research is the only feasible path forward to building a somewhat aligned generally intelligent AI scientist. This is an underspecified claim, and given certain fully-specified instances of it, I’d agree.
But this belief leads to the following reasoning: (1) if we don’t eat all this free energy in the form of researchers+compute+funding, someone else will; (2) other people are clearly less trustworthy compared to us (Anthropic, in this hypothetical); (3) let’s do whatever it takes to maintain our lead and prevent other labs from gaining power, while using whatever resources we have to also do alignment research, preferably in ways that also help us maintain or strengthen our lead in this race.
most people believe (implicitly or explicitly) that empirical research is the only feasible path forward to building a somewhat aligned generally intelligent AI scientist.
I don’t credit that they believe that. And, I don’t credit that you believe that they believe that. What did they do, to truly test their belief—such that it could have been changed? For most of them the answer is “basically nothing”. Such a “belief” is not a belief (though it may be an investment, if that’s what you mean). What did you do to truly test that they truly tested their belief? If nothing, then yours isn’t a belief either (though it may be an investment). If yours is an investment in a behavioral stance, that investment may or may not be advisable, but it would DEFINITELY be inadvisable to pretend to yourself that yours is a belief.
My guess is that most don’t do this much in public or on the internet, because it’s absolutely exhausting, and if you say something misremembered or misinterpreted you’re treated as a liar, it’ll be taken out of context either way, and you probably can’t make corrections. I keep doing it anyway because I occasionally find useful perspectives or insights this way, and think it’s important to share mine. That said, there’s a loud minority which makes the AI-safety-adjacent community by far the most hostile and least charitable environment I spend any time in, and I fully understand why many of my colleagues might not want to.
I’d be very interested to have references to occassions of people in the AI-safety-adjacent community treating Anthropic employees as liars because of things those people misremembered or misinterpreted. (My guess is that you aren’t interested in litigating these cases; I care about it for internal bookkeeping and so am happy to receive examples e.g. via DM rather than as a public comment.)
Not Zach Hatfield-Dodds, but people claimed that Anthropic had a commitment to not advance the frontier of capabilities, but as it turns out people misinterpreted communications, and no such commitment actually happened.
Not sure I’d go as far as saying that they treated Anthropic as liars, but this seems to me a central example of Zach Hatfield-Dodds’s concerns.
Contrary to the above, for the record, here is a link to a thread where a major Anthropic investor (Moskovitz) and the researcher who coined the term “The Scaling Hypothesis” (Gwern) both report that the Anthropic CEO told them in private that this is what Anthropic would do, in accordance with what many others also report hearing privately. (There is disagreement about whether this constituted a commitment.)
I agree with Zach that Anthropic is the best frontier lab on safety, and I feel not very worried about Anthropic causing an AI related catastrophe. So I think the most important asks for Anthropic to make the world better are on its policy and comms.
I think that Anthropic should more clearly state its beliefs about AGI, especially in its work on policy. For example, the SB-1047 letter they wrote states:
Broad pre-harm enforcement. The current bill requires AI companies to design and implement SSPs that meet certain standards – for example they must include testing sufficient to provide a “reasonable assurance” that the AI system will not cause a catastrophe, and must “consider” yet-to-be-written guidance from state agencies. To enforce these standards, the state can sue AI companies for large penalties, even if no actual harm has occurred. While this approach might make sense in a more mature industry where best practices are known, AI safety is a nascent field where best practices are the subject of original scientific research. For example, despite a substantial effort from leaders in our company, including our CEO, to draft and refine Anthropic’s RSP over a number of months, applying it to our first product launch uncovered many ambiguities. Our RSP was also the first such policy in the industry, and it is less than a year old. What is needed in such a new environment is iteration and experimentation, not prescriptive enforcement. There is a substantial risk that the bill and state agencies will simply be wrong about what is actually effective in preventing catastrophic risk, leading to ineffective and/or burdensome compliance requirements.
Liability doesn’t not address the central threat model of AI takeover, for which pre-harm mitigations are necessary due to the irreversible nature of the harm. I think that this letter should have acknowledged that explicitly, and that not doing so is misleading. I feel that Anthropic is trying to play a game of courting political favor by not being very straightforward about its beliefs around AGI, and that this is bad.
To be clear, I think it is reasonable that they argue that the FMD and government in general will be bad at implementing safety guidelines while still thinking that AGI will soon be transformative. I just really think they should be much clearer about the latter belief.
Perhaps that was overstated. I think there is maybe a 2-5% chance that Anthropic directly causes an existential catastrophe (e.g. by building a misaligned AGI). Some reasoning for that:
I doubt Anthropic will continue to be in the lead because they are behind OAI/GDM in capital. They do seem around the frontier of AI models now, though, which might translate to increased returns, but it seems like they do best on very short timelines worlds.
I think that if they could cause an intelligence explosion, it is more likely than not that they would pause for at least long enough to allow other labs into the lead. This is especially true in short timelines worlds because the gap between labs is smaller.
I think they have much better AGI safety culture than other labs (though still far from perfect), which will probably result in better adherence to voluntary commitments.
On the other hand, they haven’t been very transparent, and we haven’t seen their ASL-4 commitments. So these commitments might amount to nothing, or Anthropic might just walk them back at a critical juncture.
2-5% is still wildly high in an absolute sense! However, risk from other labs seems even higher to me, and I think that Anthropic could reduce this risk by advocating for reasonable regulations (e.g. transparency into frontier AI projects so no one can build ASI without the government noticing).
I think you probably under-rate the effect of having both a large number & concentration of very high quality researchers & engineers (more than OpenAI now, I think, and I wouldn’t be too surprised if the concentration of high quality researchers was higher than at GDM), being free from corporate chafe, and also having many of those high quality researchers thinking (and perhaps being correct in thinking, I don’t know) they’re value aligned with the overall direction of the company at large. Probably also Nvidia rate-limiting the purchases of large labs to keep competition among the AI companies.
All of this is also compounded by smart models leading to better data curation and RLAIF (given quality researchers & lack of crust) leading to even better models (this being the big reason I think llama had to be so big to be SOTA, and Gemini not even SOTA), which of course leads to money in the future even if they have no money now.
FYI I believe the correct language is “directly causes an existential catastrophe”. “Existential risk” is a measure of the probability of an existential catastrophe, but is not itself an event.
I want to avoid this being negative-comms for Anthropic. I’m generally happy to loudly criticize Anthropic, obviously, but this was supposed to be part of the 5% of my work that I do because someone at the lab is receptive to feedback, where the audience was Zac and publishing was an afterthought. (Maybe the disclaimers at the top fail to negate the negative-comms; maybe I should list some good things Anthropic does that no other labs do...)
Edit, 2.5 days later: I think this list is fine but sharing/publishing it was a poor use of everyone’s attention. Oops.
Asks for Anthropic
Note: I think Anthropic is the best frontier AI lab on safety. I wrote up asks for Anthropic because it’s most likely to listen to me. A list of asks for any other lab would include most of these things plus lots more. This list was originally supposed to be more part of my help labs improve project than my hold labs accountable crusade.
Numbering is just for ease of reference.
1. RSP: Anthropic should strengthen/clarify the ASL-3 mitigations, or define ASL-4 such that the threshold is not much above ASL-3 but the mitigations much stronger. I’m not sure where the lowest-hanging mitigation-fruit is, except that it includes control.
2. Control: Anthropic (like all labs) should use control mitigations and control evaluations to reduce risks from AIs scheming, including escape during internal deployment.
3. External model auditing for risk assessment: Anthropic (like all labs) should let auditors like METR, UK AISI, and US AISI audit its models if they want to — Anthropic should offer them good access pre-deployment and let them publish their findings or flag if they’re concerned. (Anthropic shared some access with UK AISI before deploying Claude 3.5 Sonnet, but it doesn’t seem to have been deep.) (Anthropic has said that sharing with external auditors is hard or costly. It’s not clear why, for just sharing normal API access + helpful-only access + control over inference-time safety features, without high-touch support.)
4. Policy advocacy (this is murky, and maybe driven by disagreements-on-the-merits and thus intractable): Anthropic (like all labs) should stop advocating against good policy and ideally should advocate for good policy. Maybe it should also be more transparent about policy advocacy. [It’s hard to make precise what I believe is optimal and what I believe is unreasonable, but at the least I think Dario is clearly too bullish on self-governance, and Jack Clark is clearly too anti-regulation, and all of this would be OK if it was balanced out by some public statements or policy advocacy that’s more pro-real-regulation but as far as I can tell it’s not. Not justified here but I predict almost all of my friends would agree if they looked into it for an hour.]
5a. Security: Anthropic (like all labs) should ideally implement RAND SL4 for model weights and code when reaching ASL-3. I think that’s unrealistic, but lesser security improvements would also be good. (Anthropic said in May 2024 that 8% of staff work in security-related areas. I think this is pretty good. I think on current margins Anthropic could still turn money into better security reasonably effectively, and should do so.)
5b. Anthropic (like all labs) should be more transparent about the quality of its security. Anthropic should publish the private reports on https://trust.anthropic.com/, redacted as appropriate. It should commit to publish information on future security incidents and should publish information on all security incidents from the last year or two.
6. Anthropic (like all labs) should facilitate employees publicly flagging false statements or violated processes.
7. Anthropic takes credit for its Long-Term Benefit Trust but Anthropic hasn’t published enough to show that it’s effective. Anthropic should publish the Trust Agreement, clarify the ambiguities discussed in the linked posts, and make accountability-y commitments like if major changes happen to the LTBT we’ll quickly tell the public.
8. Anthropic should avoid exaggerating interpretability research or causing observers to have excessively optimistic impressions of Anthropic’s interpretability research. (See e.g. Stephen Casper.)
9. Maybe Anthropic (like all labs) should make safety cases for its models or deployments, especially after the simple “no dangerous capabilities” safety case doesn’t work anymore, and publish them (or maybe just share with external auditors).
9.5. Anthropic should clarify a few confusing RSP things, including (a) the deal with substantially raising the ARA bar for ASL-3, and moreover deciding the old threshold is a “yellow line” and not creating a new threshold, and doing so without officially updating the RSP (and quietly); and (b) when the “every 3 months” trigger for RSP evals is active. I haven’t tried hard to get to the bottom of these.
Minor stuff:
10. Anthropic (like all labs) should fully release everyone from nondisparagement agreements and not use nondisparagement agreements in the future.
11. Anthropic should commit to publish updates on risk assessment practices and results, including low-level details, perhaps for all major model releases and at least quarterly or so. (Anthropic says its Responsible Scaling Officer does this internally. Anthropic publishes model cards and has published one Responsible Scaling Policy Evaluations Report.)
12. Anthropic should confirm that its old policy don’t meaningfully advance the frontier with a public launch has been replaced by the RSP, if that’s true, and otherwise clarify Anthropic’s policy.
Done!
13. Anthropiccommittedto establish a bug bounty program (for model issues) or similar, over a year ago. Anthropic hasn’t; it is the only frontier lab without a bug bounty program (although others don’t necessarily comply with the commitment, e.g. OpenAI’s excludes model issues). It should do this or talk about its plans.14. [Anthropic should clarify its security commitments; I expect it will in its forthcoming RSP update.]
15. [Maybe Anthropic (like all labs) should better boost external safety research, especially by giving more external researchers deep model access (e.g. fine-tuning or helpful-only). I hear this might be costly but I don’t really understand why.]
16. [Probably Anthropic (like all labs) should encourage staff members to talk about their views (on AI progress and risk and what Anthropic is doing and what Anthropic should do) with people outside Anthropic, as long as they (1) clarify that they’re not speaking for Anthropic and (2) don’t share secrets.]
17. [Maybe Anthropic (like all labs) should talk about its views on AI progress and risk. At the least, probably Anthropic (like all labs) should clearly describe a worst-case plausible outcome from AI and state how likely the lab considers it.]
18. [Most of my peers say: Anthropic (like all labs) should publish info like training compute and #parameters for each model. I’m inside-view agnostic on this.]
19. [Maybe Anthropic could cheaply improve its model evals for dangerous capabilities or share more information about them. Specific asks/recommendations TBD. As Anthropic notes, its CBRN eval is kinda bad and its elicitation is kinda bad (and it doesn’t share enough info for us to evaluate its elicitation from the outside).]
I shared this list—except 9.5 and 19, which are new—with @Zac Hatfield-Dodds two weeks ago.
You are encouraged to comment with other asks for Anthropic. (Or things Anthropic does very well, if you feel so moved.)
I think both Zach and I care about labs doing good things on safety, communicating that clearly, and helping people understand both what labs are doing and the range of views on what they should be doing. I shared Zach’s doc with some colleagues, but won’t try for a point-by-point response. Two high-level responses:
First, at a meta level, you say:
I do feel welcome to talk about my views on this basis, and often do so with friends and family, at public events, and sometimes even in writing on the internet (hi!). However, it takes way more effort than you might think to avoid inaccurate or misleading statements while also maintaining confidentiality. Public writing tends to be higher-stakes due to the much larger audience and durability, so I routinely run comments past several colleagues before posting, and often redraft in response (including these comments and this very point!).
My guess is that most don’t do this much in public or on the internet, because it’s absolutely exhausting, and if you say something misremembered or misinterpreted you’re treated as a liar, it’ll be taken out of context either way, and you probably can’t make corrections. I keep doing it anyway because I occasionally find useful perspectives or insights this way, and think it’s important to share mine. That said, there’s a loud minority which makes the AI-safety-adjacent community by far the most hostile and least charitable environment I spend any time in, and I fully understand why many of my colleagues might not want to.
Imagine, if you will, trying to hold a long conversation about AI risk—but you can’t reveal any information about, or learned from, or even just informative about LessWrong. Every claim needs an independent public source, as do any jargon or concepts that would give an informed listener information about the site, etc.; you have to find different analogies and check that citations are public and for all that you get pretty regular hostility anyway because of… well, there are plenty of misunderstandings and caricatures to go around.
I run intro-to-AGI-safety courses for Anthropic employees (based on AGI-SF), and we draw a clear distinction between public and confidential resources specifically so that people can go talk to family and friends and anyone else they wish about the public information we cover.
Second, and more concretely, many of these asks are unimplementable for various reasons, and often gesture in a direction without giving reasons to think that there’s a better tradeoff available than we’re already making. Some quick examples:
Both AI Control and safety cases are research areas less than a year old; we’re investigating them and e.g. hiring safety-case specialists, but best-practices we could implement don’t exist yet. Similarly, there simply aren’t any auditors or audit standards for AI safety yet (see e.g. METR’s statement); we’re working to make this possible but the thing you’re asking for just doesn’t exist yet. Some implementation questions that “let auditors audit our models” glosses over:
If you have dozens of organizations asking to be auditors, and none of them are officially auditors yet, what criteria do you use to decide who you collaborate with?
What kind of pre deployment model access would you provide? If it’s helpful-only or other nonpublic access, do they meet our security bar to avoid leaking privileged API keys? (We’ve already seen unauthorized sharing or compromise lead to serious abuse.)
How do you decide who gets to say what about the testing? What if they have very different priorities than you and think that a different level of risk or a different kind of harm is unacceptable?
I strongly support Anthropic’s nondisclosure of information about pretraining. I have never seen a compelling argument that publishing this kind of information is, on net, beneficial for safety.
There are many cases where I’d be happy if Anthropic shared more about what we’re doing and what we’re thinking about. Some of the things you’re asking about I think we’ve already said, e.g. for [7] LTBT changes would require an RSP update, and for [17] our RSP requires us to “enforce an acceptable use policy [against …] using the model to generate content that could cause severe risks to the continued existence of humankind”.
So, saying “do more X” just isn’t that useful; we’ve generally thought about it and concluded that that the current amount of X is our best available tradeoff at the moment. For many more of the other asks above, I just disagree with implicit or explicit claims about the facts in question. Even for the communication issues where I’d celebrate us sharing more—and for some I expect we will—doing so is yet another demand on heavily loaded people and teams, and it can take longer than we’d like to find the time.
I just want to note that people who’ve never worked in a true high-confidentiality environment (professional services, national defense, professional services for national defense) probably radically underestimate the level of brain damage and friction that Zac is describing here:
Confidentiality is really, really hard to maintain. Doing so while also engaging the public is terrifying. I really admire the frontier labs folks who try to engage publicly despite that quite severe constraint, and really worry a lot as a policy guy about the incentives we’re creating to make that even less likely in the future.
I’m sympathetic to how this process might be exhausting, but at an institutional level I think Anthropic (and all labs) owe humanity a much clearer image of how they would approach a potentially serious and dangerous situation with their models. Especially so, given that the RSP is fairly silent on this point, leaving the response to evaluations up to the discretion of Anthropic. In other words, the reason I want to hear more from employees is in part because I don’t know what the decision process inside of Anthropic will look like if an evaluation indicates something like “yeah, it’s excellent at inserting backdoors, and also, the vibe is that it’s overall pretty capable.” And given that Anthropic is making these decisions on behalf of everyone, Anthropic (like all labs) really owes it to humanity to be more upfront about how it’ll make these decisions (imo).
I will also note what I feel is a somewhat concerning trend. It’s happened many times now that I’ve critiqued something about Anthropic (its RSP, advocating to eliminate pre-harm from SB 1047, the silent reneging on the commitment to not push the frontier), and someone has said something to the effect of: “this wouldn’t seem so bad if you knew what was happening behind the scenes.” They of course cannot tell me what the “behind the scenes” information is, so I have no way of knowing whether that’s true. And, maybe I would in fact update positively about Anthropic if I knew. But I do think the shape of “we’re doing something which might be incredibly dangerous, many external bits of evidence point to us not taking the safety of this endeavor seriously, but actually you should think we are based on me telling you we are” is pretty sketchy.
I just wanted to +1 that I am also concerned about this trend, and I view it as one of the things that I think has pushed me (as well as many others in the community) to lose a lot of faith in corporate governance (especially of the “look, we can’t make any tangible commitments but you should just trust us to do what’s right” variety) and instead look to governments to get things under control.
I don’t think Anthropic is solely to blame for this trend, of course, but I think Anthropic has performed less well on comms/policy than I [and IMO many others] would’ve predicted if you had asked me [or us] in 2022.
@Zac Hatfield-Dodds do you have any thoughts on official comms from Anthropic and Anthropic’s policy team?
For example, I’m curious if you have thoughts on this anecdote– Jack Clark was asked an open-ended question by Senator Cory Booker and he told policymakers that his top policy priority was getting the government to deploy AI successfully. There was no mention of AGI, existential risks, misalignment risks, or anything along those lines, even though it would’ve been (IMO) entirely appropriate for him to bring such concerns up in response to such an open-ended question.
I was left thinking that either Jack does not care much about misalignment risks or he was not being particularly honest/transparent with policymakers. Both of these raise some concerns for me.
(Noting that I hold Anthropic’s comms and policy teams to higher standards than individual employees. I don’t have particularly strong takes on what Anthropic employees should be doing in their personal capacity– like in general I’m pretty in favor of transparency, but I get it, it’s hard and there’s a lot that you have to do. Whereas the comms and policy teams are explicitly hired/paid/empowered to do comms and policy, so I feel like it’s fair to have higher expectations of them.)
Source: Hill & Valley Forum on AI Security (May 2024):
https://www.youtube.com/live/RqxE3ub7wWA?t=13338s:
https://www.youtube.com/live/RqxE3ub7wWA?t=13551
My guess is that this seems so stressful mostly because Anthropic’s plan is in fact so hard to defend, due to making little sense. Anthropic is attempting to build a new mind vastly smarter than any human, and as I understand it, plans to ensure this goes well basically by doing periodic vibe checks to see whether their staff feel sketched out yet. I think a plan this shoddy obviously endangers life on Earth, so it seems unsurprising (and good) that people might sometimes strongly object; if Anthropic had more reassuring things to say, I’m guessing it would feel less stressful to try to reassure them.
Meta aside: normally this wouldn’t seem worth digging into but as a moderator/site-culture-guardian, I feel compelled to justify my negative react on the disagree votes.
I’m actually not entirely sure what downvote-reacting is for. Habryka has said the intent is to override inappropriate uses of reacts. We haven’t actually really had a sit-down-and-argue-this-out on the moderator team. I’m pretty sure we haven’t told or tried to enforce that “override inappropriate use of reacts” as the intended use
I think Adam’s line:
Is psychologizing and summarizing Anthropic unfairly. So I wouldn’t agree vote with it. I do think it has some kind of grain of truth to it (me believing this is also kind of “doubting the experience of Anthropic employees” which is also group-epistemologically dicey IMO, but, feels kinda important enough to do in this case). The claim isn’t true… but I also don’t belief report that it’s not true.
I initially downvoted the Disagree when it was just Noosphere, since I didn’t think Noosphere was really in a position to have an opinion and if he was the only reactor it felt more like noise. A few others who are more positioned to know relevant stuff have since added their own disagree reacts. I… feel sort of justified leaving the anti-react up, with an overall indicator of “a bunch of people disagree with this, but the weight of that disagreement is slightly reduced.” (I think I’d remove the anti-react if the the disagree count went much lower than it is now).
I don’t know whether I particularly endorse any of this, but wanted people to have a bit more model of what one site-admin was thinking here.
[/end of rambly meta commentary]
What seemed psychologizing/unfair to you, Raemon? I think it was probably unnecessarily rude/a mistake to try to summarize Anthropic’s whole RSP in a sentence, given that the inferential distance here is obviously large. But I do think the sentence was fair.
As I understand it, Anthropic’s plan for detecting threats is mostly based on red-teaming (i.e., asking the models to do things to gain evidence about whether they can). But nobody understands the models well enough to check for the actual concerning properties themselves, so red teamers instead check for distant proxies, or properties that seem plausibly like precursors. (E.g., for “ability to search filesystems for passwords” as a partial proxy for “ability to autonomously self-replicate,” since maybe the former is a prerequisite for the latter).
But notice that this activity does not involve directly measuring the concerning behavior. Rather, it instead measures something more like “the amount the model strikes the evaluators as broadly sketchy-seeming/suggestive that it might be capable of doing other bad stuff.” And the RSP’s description of Anthropic’s planned responses to these triggers is so chock full of weasel words and caveats and vague ambiguous language that I think it barely constrains their response at all.
So in practice, I think both Anthropic’s plan for detecting threats, and for deciding how to respond, fundamentally hinge on wildly subjective judgment calls, based on broad, high-level, gestalt-ish impressions of how these systems seem likely to behave. I grant that this process is more involved than the typical thing people describe as a “vibe check,” but I do think it’s basically the same epistemic process, and I expect will generate conclusions around as sound.
I don’t really think any of that affects the difficulty of public communication; your implication that it must be the cause reads to me more like an insult than a well-considered psychological model
The basic point would be that it’s hard to write publicly about how you are taking responsible steps that grapple directly with the real issues… if you are not in fact doing those responsible things in the first place. This seems locally valid to me; you may disagree on the object level about whether Adam Scholl’s characterization of Anthropic’s agenda/internal work is correct, but if it is, then it would certainly affect the difficulty of public communication to such an extent that it might well become the primary factor that needs to be discussed in this matter.
Indeed, the suggestion is for Anthropic employees to “talk about their views (on AI progress and risk and what Anthropic is doing and what Anthropic should do) with people outside Anthropic” and the counterargument is that doing so would be nice in an ideal world, except it’s very psychologically exhausting because every public statement you make is likely to get maliciously interpreted by those who will use it to argue that your company is irresponsible. In this situation, there is a straightforward direct correlation between the difficulty of public communication and the likelihood that your statements will get you and your company in trouble.
But the more responsible you are in your actual work, the more responsible-looking details you will be able to bring up in conversations with others when you discuss said work. AI safety community members are not actually arbitrarily intelligent Machiavellians with the ability to convincingly twist every (in-reality) success story into an (in-perception) irresponsible gaffe;[1] the extent to which they can do this depends very heavily on the extent to which you have anything substantive to bring up in the first place. After all, as Paul Graham often says, “If you want to convince people of something, it’s much easier if it’s true.”
As I see it, not being able to bring up Anthropic’s work/views on this matter without some AI safety person successfully making it seem like Anthropic is behaving badly is rather strong Bayesian evidence that Anthropic is in fact behaving badly. So this entire discussion, far from being an insult, seems directly on point to the topic at hand, and locally valid to boot (although not necessarily sound, as that depends on an individualized assessment of the particular object-level claims about the usefulness of the company’s safety team).
Quite the opposite, actually, if the change in the wider society’s opinions about EA in the wake of the SBF scandal is any representative indication of how the rationalist/EA/AI safety cluster typically handles PR stuff.
I think communication as careful as it must be to maintain the confidentiality distinction here is always difficult in the manner described, and that communication to large quantities of people will ~always result in someone running with an insane misinterpretation of what was said.
I understand that this confidentiality point might seem to you like the end of the fault analysis, but have you considered the hypothesis that Anthropic leadership has set such stringent confidentiality policies in part to make it hard for Zac to engage in public discourse?
Look, I don’t think Anthropic leadership is just trying to keep their training skills private or their models secure. Their company does not merely keep trade secrets. When I speak to staff from this company about issues with their ‘Responsible Scaling Policies’, they say that they want to tell me more information about how they think it can be better or how they think it might change, but cannot due to confidentiality constraints. That’s their safety policies, not information about their training policies that they want to keep secret so that they can make money.
I believe the Anthropic leadership cares very little about the public’s ability to have arguments and evidence and access to information about Anthropic’s behavior. The leadership roughly ~never shows up to engage with critical discourse about itself, unless there’s a potential major embarrassment. There is no regular Q&A session with the leadership of a company who believes their own product poses a 10-25% chance of existential catastrophe, no comment section on their website, no official twitter accounts that regularly engage with and share info with critics, no debates with the many people who outright oppose their actions.
No, they go far in the other direction of committing to no-public-discourse. I challenge any Anthropic staffer to openly deny that there is a mutual non-disparagement agreement between Anthropic and OpenAI leadership, whereby neither is legally allowed to openly criticize the other’s company. (You can read cofounder Sam McCandlish write that Anthropic has mutual non-disparagement agreements in this comment.) Anthropic leadership say they quit OpenAI primarily due to safety concerns, and yet I believe they simultaneously signed away their ability to criticize that very organization that they had such unique levels of information about and believed poses an existential threat to civilization.
Where Daniel Kokotajlo refused to sign a non-disparagement agreement (by-default forfeiting his equity) so that he could potentially criticize OpenAI in the future, the Amodei’s quit purportedly due to having damning criticisms of OpenAI in the present and then (I believe) chose to sign a non-disparagement agreement while quitting (and kept their equity). A complete inversion of good collective epistemic principles.
To quote from Zac’s above analogy explaining how difficult his situation at Anthropic is.
The analogous goal described here for Anthropic is to have complete separation between internal and external information. This does not describe a set of blacklisted trade-secrets or security practices. My sense is that for most safety-related issues Anthropic has a set of whitelisted information, which is primarily the already public stuff. The goal here is for you to not have access to any information about them that they did not explicitly decide that they wanted you to know, and they do not want people in their org to share new information when engaging in public, critical discourse.
Yes, yes, Zac’s situation is stressful and I respect his effort to engage in public discourse nonetheless. Good on Zac. But I can’t help but wrankle at the implication that the primary reason he and others don’t talk more is the public commentariat not empathizing enough with having confidential info. Sure, people could do better to understand the difficulty of communicating while holding confidential info. It is hard to repeatedly walk right up to the line and not over it, it’s stressful to think you might have gone over it, and it’s stressful to suddenly find yourself unable to engage well with people’s criticisms because you hit a confidential crux. But as to the fault analysis for Zac’s particularly difficult position? In my opinion the blame is surely first with the Anthropic leadership who have given him way too stringent confidentiality constraints, due to seeming to anti-care about helping people external to Anthropic understand what is going on.
I don’t think the placement of fault is causally related to whether communication is difficult for him, really. To refer back to the original claim being made, Adam Scholl said that
I think the amount of stress incurred when doing public communication is nearly orthogonal to these factors, and in particular is, when trying to be as careful about anything as Zac is trying to be about confidentiality, quite high at baseline. I don’t think Adam Scholl’s assessment arose from a usefully-predictive model, nor one which was likely to reflect the inside view.
Ben Pace has said that perhaps he doesn’t disagree with you in particular about this, but I sure think I do.[1]
I don’t see how the first half of this could be correct, and while the second half could be true, it doesn’t seem to me to offer meaningful support for the first half either (instead, it seems rather… off-topic).
As a general matter, even if it were the case that no matter what you say, at least one person will actively misinterpret your words, this fact would have little bearing on whether you can causally influence the proportion of readers/community members that end up with (what seem to you like) the correct takeaways from a discussion of that kind.
Moreover, in a spot where you have something meaningful and responsible, etc, that you and your company have done to deal with safety issues, the major concern in your mind when communicating publicly is figuring out how to make it clear to everyone that you are on top of things without revealing confidential information. That is certainly stressful, but much less so than the additional constraint you have in a world in which you do not have anything concrete that you can back your generic claims of responsibility with, since that is a spot where you can no longer fall back on (a partial version of) the truth as your defense. For the vast majority of human beings, lying and intentional obfuscation with the intent to mislead are significantly more psychologically straining than telling the truth as-you-see-it is.
Overall, I also think I disagree about the amount of stress that would be caused by conversations with AI safety community members. As I have said earlier:
In any case, I have already made all these points in a number of ways in my previous response to you (which you haven’t addressed, and which still seem to me to be entirely correct).
He also said that he thinks your perspective makes sense, which… I’m not really sure about.
Yeah, I totally think your perspective makes sense and I appreciate you bringing it up, even though I disagree.
I acknowledge that someone who has good justifications for their position but just has made a bunch of reasonable confidentiality agreements around the topic should expect to run into a bunch of difficulties and stresses around public conflicts and arguments.
I think you go too far in saying that the stress is orthogonal to whether you have a good case to make, I think you can’t really think that it’s not a top-3 factor to how much stress you’re experiencing. As a pretty simple hypothetical, if you’re responding to a public scandal about whether you stole money, you’re gonna have a way more stressful time if you did steal money than if you didn’t (in substantial part because you’d be able to show the books and prove it).
Perhaps not so much disagreeing with you in particular, but disagreeing with my sense of what was being agreed upon in Zac’s comment and in the reacts, I further wanted to raise my hypothesis that a lot of the confidentiality constraints are unwarranted and actively obfuscatory, which does change who is responsible for the stress, but doesn’t change the fact that there is stress.
Added: Also, I think we would both agree that there would be less stress if there were fewer confidentiality restrictions.
For what it’s worth, I endorse Anthopic’s confidentiality policies, and am confident that everyone involved in setting them sees the increased difficulty of public communication as a cost rather than a benefit. Unfortunately, the unilateralist’s curse and entangled truths mean that confidential-by-default is the only viable policy.
That might be the case, but then it only increases the amount of work your company should be doing to carve out and figure out the info that can be made public, and engage with criticism. There should be whole teams who have Twitter accounts and LW accounts and do regular AMAs and show up to podcasts and who have a mandate internally to seek information in the organization and publish relevant info, and there should be internal policies that reflect an understanding that it is correct for some research teams to spend 10-50% of their yearly effort toward making publishable version of research and decision-making principles in order to inform your stakeholders (read: the citizens of earth) and critics about decisions you are making directly related to existential catastrophes that you are getting rich running toward. Not monologue-style blogposts, but dialogue-style comment sections & interviews.
Confidentiality-by-default does not mean you get to abdicate responsibility for answering questions to the people whose lives you are risking about how-and-why you are making decisions, it means you have to put more work into doing it well. If your company valued the rest of the world understanding what is going on yet thought confidentiality by-default was required, I think it would be trying significantly harder to overcome this barrier.
My general principle is that if you are wielding a lot of power over people that they didn’t otherwise legitimately grant you (in this case building a potential doomsday device), you owe them to be auditable. You are supposed to show up and answer their questions directly – not “thank you so much for the questions, in six months I will publish a related blogpost on this topic” but more like “with the public info available to me, here’s my best guess answer to your specific question today”. Especially so if you are doing something the people you have power over perceive as norm-violating, and even more-so when you are keeping the answers to some very important questions secret from them.
(not going to respond in this context out of respect for Zach’s wishes. May chat later, and am mulling over my own top-level post on the subject)
This obvious straw-man makes your argument easy to dismiss.
However I think the point is basically correct. Anthropic’s strategy to reduce x-risk also includes lobbying against pre-harm enforcement of liability for AI companies in SB 1047.
How is it a straw-man? How is the plan meaningfully different from that?
Imagine a group of people has already gathered a substantial amount of uranium, is already refining it, is already selling power generated by their pile of uranium, etc. And doing so right near and upwind of a major city. And they’re shoveling more and more uranium onto the pile, basically as fast as they can. And when you ask them why they think this is going to turn out well, they’re like “well, we trust our leadership, and you know we have various documents, and we’re hiring for people to ‘Develop and write comprehensive safety cases that demonstrate the effectiveness of our safety measures in mitigating risks from huge piles of uranium’, and we have various detectors such as an EM detector which we will privately check and then see how we feel”. And then the people in the city are like “Hey wait, why do you think this isn’t going to cause a huge disaster? Sure seems like it’s going to by any reasonable understanding of what’s going on”. And the response is “well we’ve thought very hard about it and yes there are risks but it’s fine and we are working on safety cases”. But… there’s something basic missing, which is like, an explanation of what it could even look like to safely have a huge pile of superhot uranium. (Also in this fantasy world no one has ever done so and can’t explain how it would work.)
In the AI case, there’s lots of inaction risk: if Anthropic doesn’t make powerful AI, someone less safety-focused will.
It’s reasonable to think e.g. I want to boost Anthropic in the current world because others are substantially less safe, but if other labs didn’t exist, I would want Anthropic to slow down.
I disagree. It would be one thing if Anthropic were advocating for AI to go slower, trying to get op-eds in the New York Times about how disastrous of a situation this was, or actually gaming out and detailing their hopes for how their influence will buy saving the world points if everything does become quite grim, and so on. But they aren’t doing that, and as far as I can tell they basically take all of the same actions as the other labs except with a slight bent towards safety.
Like, I don’t feel at all confident that Anthropic’s credit has exceeded their debit, even on their own consequentialist calculus. They are clearly exacerbating race dynamics, both by pushing the frontier, and by lobbying against regulation. And what they have done to help strikes me as marginal at best and meaningless at worst. E.g., I don’t think an RSP is helpful if we don’t know how to scale safely; we don’t, so I feel like this device is mostly just a glorified description of what was already happening, namely that the labs would use their judgment to decide what was safe. Because when it comes down to it, if an evaluation threshold triggers, the first step is to decide whether that was actually a red-line, based on the opaque and subjective judgment calls of people at Anthropic. But if the meaning of evaluations can be reinterpreted at Anthropic’s whims, then we’re back to just trusting “they seem to have a good safety culture,” and that isn’t a real plan, nor really any different to what was happening before. Which is why I don’t consider Adam’s comment to be a strawman. It really is, at the end of the day, a vibe check.
And I feel pretty sketched out in general by bids to consider their actions relative to other extremely reckless players like OpenAI. Because when we have so little sense of how to build this safely, it’s not like someone can come in and completely change the game. At best they can do small improvements on the margins, but once you’re at that level, it feels kind of like noise to me. Maybe one lab is slightly better than the others, but they’re still careening towards the same end. And at the very least it feels like there is a bit of a missing mood about this, when people are requesting we consider safety plans relatively. I grant Anthropic is better than OpenAI on that axis, but my god, is that really the standard we’re aiming for here? Should we not get to ask “hey, could you please not build machines that might kill everyone, or like, at least show that you’re pretty sure that won’t happen before you do?”
But that’s not a plan to ensure their uranium pile goes well.
@Zach Stein-Perlman , you’re missing the point. They don’t have a plan. Here’s the thread (paraphrased in my words):
Zach: [asks, for Anthropic]
Zac: … I do talk about Anthropic’s safety plan and orientation, but it’s hard because of confidentiality and because many responses here are hostile. …
Adam: Actually I think it’s hard because Anthropic doesn’t have a real plan.
Joseph: That’s a straw-man. [implying they do have a real plan?]
Tsvi: No it’s not a straw-man, they don’t have a real plan.
Zach: Something must be done. Anthropic’s plan is something.
Tsvi: They don’t have a real plan.
I explicitly said “However I think the point is basically correct” in the next sentence.
Sorry, reacts are ambiguous.
I agree Anthropic doesn’t have a “real plan” in your sense, and narrow disagreement with Zac on that is fine.
I just think that’s not a big deal and is missing some broader point (maybe that’s a motte and Anthropic is doing something bad—vibes from Adam’s comment—is a bailey).
[Edit: “Something must be done. Anthropic’s plan is something.” is a very bad summary of my position. My position is more like various facts about Anthropic mean that them-making-powerful-AI is likely better than the counterfactual, and evaluating a lab in a vacuum or disregarding inaction risk is a mistake.]
[Edit: replies to this shortform tend to make me sad and distracted—this is my fault, nobody is doing something wrong—so I wish I could disable replies and I will probably stop replying and would prefer that others stop commenting. Tsvi, I’m ok with one more reply to this.]
(I won’t reply more, by default.)
Look, if Anthropic was honestly and publically saying
If they were saying and doing that, then I would still raise my eyebrows a lot and wouldn’t really trust it. But at least it would be plausibly consistent with doing good.
But that doesn’t sound like either what they’re saying or doing. IIUC they lobbied to remove protection for AI capabilities whistleblowers from SB 1047! That happened! Wow! And it seems like Zac feels he has to pretend to have a credible credible-plan plan.
Hm. I imagine you don’t want to drill down on this, but just to state for the record, this exchange seems like something weird is happening in the discourse. Like, people are having different senses of “the point” and “the vibe” and such, and so the discourse has already broken down. (Not that this is some big revelation.) Like, there’s the Great Stonewall of the AGI makers. And then Zac is crossing through the gates of the Great Stonewall to come and talk to the AGI please-don’t-makers. But then Zac is like (putting words in his mouth) “there’s no Great Stonewall, or like, it’s not there in order to stonewall you in order to pretend that we have a safe AGI plan or to muddy the waters about whether or not we should have one, it’s there because something something trade secrets and exfohazards, and actually you’re making it difficult to talk by making me work harder to pretend that we have a safe AGI plan or intentions that should promissorily satisfy the need for one”.
Seems like most people believe (implicitly or explicitly) that empirical research is the only feasible path forward to building a somewhat aligned generally intelligent AI scientist. This is an underspecified claim, and given certain fully-specified instances of it, I’d agree.
But this belief leads to the following reasoning: (1) if we don’t eat all this free energy in the form of researchers+compute+funding, someone else will; (2) other people are clearly less trustworthy compared to us (Anthropic, in this hypothetical); (3) let’s do whatever it takes to maintain our lead and prevent other labs from gaining power, while using whatever resources we have to also do alignment research, preferably in ways that also help us maintain or strengthen our lead in this race.
I don’t credit that they believe that. And, I don’t credit that you believe that they believe that. What did they do, to truly test their belief—such that it could have been changed? For most of them the answer is “basically nothing”. Such a “belief” is not a belief (though it may be an investment, if that’s what you mean). What did you do to truly test that they truly tested their belief? If nothing, then yours isn’t a belief either (though it may be an investment). If yours is an investment in a behavioral stance, that investment may or may not be advisable, but it would DEFINITELY be inadvisable to pretend to yourself that yours is a belief.
I’d be very interested to have references to occassions of people in the AI-safety-adjacent community treating Anthropic employees as liars because of things those people misremembered or misinterpreted. (My guess is that you aren’t interested in litigating these cases; I care about it for internal bookkeeping and so am happy to receive examples e.g. via DM rather than as a public comment.)
Not Zach Hatfield-Dodds, but people claimed that Anthropic had a commitment to not advance the frontier of capabilities, but as it turns out people misinterpreted communications, and no such commitment actually happened.
Not sure I’d go as far as saying that they treated Anthropic as liars, but this seems to me a central example of Zach Hatfield-Dodds’s concerns.
From Evhub:
https://www.lesswrong.com/posts/BaLAgoEvsczbSzmng/?commentId=yd2t6YymWdfGBFhFa
Contrary to the above, for the record, here is a link to a thread where a major Anthropic investor (Moskovitz) and the researcher who coined the term “The Scaling Hypothesis” (Gwern) both report that the Anthropic CEO told them in private that this is what Anthropic would do, in accordance with what many others also report hearing privately. (There is disagreement about whether this constituted a commitment.)
The one thing I do conclude is that Anthropic’s comms are very inconsistent, and this is bad, actually.
I agree with Zach that Anthropic is the best frontier lab on safety, and I feel not very worried about Anthropic causing an AI related catastrophe. So I think the most important asks for Anthropic to make the world better are on its policy and comms.
I think that Anthropic should more clearly state its beliefs about AGI, especially in its work on policy. For example, the SB-1047 letter they wrote states:
Liability doesn’t not address the central threat model of AI takeover, for which pre-harm mitigations are necessary due to the irreversible nature of the harm. I think that this letter should have acknowledged that explicitly, and that not doing so is misleading. I feel that Anthropic is trying to play a game of courting political favor by not being very straightforward about its beliefs around AGI, and that this is bad.
To be clear, I think it is reasonable that they argue that the FMD and government in general will be bad at implementing safety guidelines while still thinking that AGI will soon be transformative. I just really think they should be much clearer about the latter belief.
This does not fit my model of your risk model. Why do you think this?
Perhaps that was overstated. I think there is maybe a 2-5% chance that Anthropic directly causes an existential catastrophe (e.g. by building a misaligned AGI). Some reasoning for that:
I doubt Anthropic will continue to be in the lead because they are behind OAI/GDM in capital. They do seem around the frontier of AI models now, though, which might translate to increased returns, but it seems like they do best on very short timelines worlds.
I think that if they could cause an intelligence explosion, it is more likely than not that they would pause for at least long enough to allow other labs into the lead. This is especially true in short timelines worlds because the gap between labs is smaller.
I think they have much better AGI safety culture than other labs (though still far from perfect), which will probably result in better adherence to voluntary commitments.
On the other hand, they haven’t been very transparent, and we haven’t seen their ASL-4 commitments. So these commitments might amount to nothing, or Anthropic might just walk them back at a critical juncture.
2-5% is still wildly high in an absolute sense! However, risk from other labs seems even higher to me, and I think that Anthropic could reduce this risk by advocating for reasonable regulations (e.g. transparency into frontier AI projects so no one can build ASI without the government noticing).
I think you probably under-rate the effect of having both a large number & concentration of very high quality researchers & engineers (more than OpenAI now, I think, and I wouldn’t be too surprised if the concentration of high quality researchers was higher than at GDM), being free from corporate chafe, and also having many of those high quality researchers thinking (and perhaps being correct in thinking, I don’t know) they’re value aligned with the overall direction of the company at large. Probably also Nvidia rate-limiting the purchases of large labs to keep competition among the AI companies.
All of this is also compounded by smart models leading to better data curation and RLAIF (given quality researchers & lack of crust) leading to even better models (this being the big reason I think llama had to be so big to be SOTA, and Gemini not even SOTA), which of course leads to money in the future even if they have no money now.
How many parameters do you estimate for other SOTA models?
Minstral had like 150b parameters or something.
FYI I believe the correct language is “directly causes an existential catastrophe”. “Existential risk” is a measure of the probability of an existential catastrophe, but is not itself an event.
This one seems probably worth making a top-level post?
I want to avoid this being negative-comms for Anthropic. I’m generally happy to loudly criticize Anthropic, obviously, but this was supposed to be part of the 5% of my work that I do because someone at the lab is receptive to feedback, where the audience was Zac and publishing was an afterthought. (Maybe the disclaimers at the top fail to negate the negative-comms; maybe I should list some good things Anthropic does that no other labs do...)
Also, this is low-effort.