What seemed psychologizing/unfair to you, Raemon? I think it was probably unnecessarily rude/a mistake to try to summarize Anthropic’s whole RSP in a sentence, given that the inferential distance here is obviously large. But I do think the sentence was fair.
As I understand it, Anthropic’s plan for detecting threats is mostly based on red-teaming (i.e., asking the models to do things to gain evidence about whether they can). But nobody understands the models well enough to check for the actual concerning properties themselves, so red teamers instead check for distant proxies, or properties that seem plausibly like precursors. (E.g., for “ability to search filesystems for passwords” as a partial proxy for “ability to autonomously self-replicate,” since maybe the former is a prerequisite for the latter).
But notice that this activity does not involve directly measuring the concerning behavior. Rather, it instead measures something more like “the amount the model strikes the evaluators as broadly sketchy-seeming/suggestive that it might be capable of doing other bad stuff.” And the RSP’s description of Anthropic’s planned responses to these triggers is so chock full of weasel words and caveats and vague ambiguous language that I think it barely constrains their response at all.
So in practice, I think both Anthropic’s plan for detecting threats, and for deciding how to respond, fundamentally hinge on wildly subjective judgment calls, based on broad, high-level, gestalt-ish impressions of how these systems seem likely to behave. I grant that this process is more involved than the typical thing people describe as a “vibe check,” but I do think it’s basically the same epistemic process, and I expect will generate conclusions around as sound.
I don’t really think any of that affects the difficulty of public communication; your implication that it must be the cause reads to me more like an insult than a well-considered psychological model
I don’t really think any of that affects the difficulty of public communication
The basic point would be that it’s hard to write publicly about how you are taking responsible steps that grapple directly with the real issues… if you are not in fact doing those responsible things in the first place. This seems locally valid to me; you may disagree on the object level about whether Adam Scholl’s characterization of Anthropic’s agenda/internal work is correct, but if it is, then it would certainly affect the difficulty of public communication to such an extent that it might well become the primary factor that needs to be discussed in this matter.
Indeed, the suggestion is for Anthropic employees to “talk about their views (on AI progress and risk and what Anthropic is doing and what Anthropic should do) with people outside Anthropic” and the counterargument is that doing so would be nice in an ideal world, except it’s very psychologically exhausting because every public statement you make is likely to get maliciously interpreted by those who will use it to argue that your company is irresponsible. In this situation, there is a straightforward direct correlation between the difficulty of public communication and the likelihood that your statements will get you and your company in trouble.
But the more responsible you are in your actual work, the more responsible-looking details you will be able to bring up in conversations with others when you discuss said work. AI safety community members are not actually arbitrarily intelligent Machiavellians with the ability to convincingly twist every (in-reality) success story into an (in-perception) irresponsible gaffe;[1] the extent to which they can do this depends very heavily on the extent to which you have anything substantive to bring up in the first place. After all, as Paul Graham often says, “If you want to convince people of something, it’s much easier if it’s true.”
As I see it, not being able to bring up Anthropic’s work/views on this matter without some AI safety person successfully making it seem like Anthropic is behaving badly is rather strong Bayesian evidence that Anthropic is in fact behaving badly. So this entire discussion, far from being an insult, seems directly on point to the topic at hand, and locally valid to boot (although not necessarily sound, as that depends on an individualized assessment of the particular object-level claims about the usefulness of the company’s safety team).
Quite the opposite, actually, if the change in the wider society’s opinions about EA in the wake of the SBF scandal is any representative indication of how the rationalist/EA/AI safety cluster typically handles PR stuff.
I think communication as careful as it must be to maintain the confidentiality distinction here is always difficult in the manner described, and that communication to large quantities of people will ~always result in someone running with an insane misinterpretation of what was said.
I understand that this confidentiality point might seem to you like the end of the fault analysis, but have you considered the hypothesis that Anthropic leadership has set such stringent confidentiality policies in part to make it hard for Zac to engage in public discourse?
Look, I don’t think Anthropic leadership is just trying to keep their training skills private or their models secure. Their company does not merely keep trade secrets. When I speak to staff from this company about issues with their ‘Responsible Scaling Policies’, they say that they want to tell me more information about how they think it can be better or how they think it might change, but cannot due to confidentiality constraints. That’s their safety policies, not information about their training policies that they want to keep secret so that they can make money.
I believe the Anthropic leadership cares very little about the public’s ability to have arguments and evidence and access to information about Anthropic’s behavior. The leadership roughly ~never shows up to engage with critical discourse about itself, unless there’s a potential major embarrassment. There is no regular Q&A session with the leadership of a company who believes their own product poses a 10-25% chance of existential catastrophe, no comment section on their website, no official twitter accounts that regularly engage with and share info with critics, no debates with the many people who outright oppose their actions.
No, they go far in the other direction of committing to no-public-discourse. I challenge any Anthropic staffer to openly deny that there is a mutual non-disparagement agreement between Anthropic and OpenAI leadership, whereby neither is legally allowed to openly criticize the other’s company. (You can read cofounder Sam McCandlish write that Anthropic has mutual non-disparagement agreements in this comment.) Anthropic leadership say they quit OpenAI primarily due to safety concerns, and yet I believe they simultaneously signed away their ability to criticize that very organization that they had such unique levels of information about and believed poses an existential threat to civilization.
Where Daniel Kokotajlo refused to sign a non-disparagement agreement (by-default forfeiting his equity) so that he could potentially criticize OpenAI in the future, the Amodei’s quit purportedly due to having damning criticisms of OpenAI in the present and then (I believe) chose to sign a non-disparagement agreement while quitting (and kept their equity). A complete inversion of good collective epistemic principles.
To quote from Zac’s above analogy explaining how difficult his situation at Anthropic is.
Imagine, if you will, trying to hold a long conversation about AI risk—but you can’t reveal any information about, or learned from, or even just informative about LessWrong. Every claim needs an independent public source, as do any jargon or concepts that would give an informed listener information about the site, etc.; you have to find different analogies and check that citations are public
The analogous goal described here for Anthropic is to have complete separation between internal and external information. This does not describe a set of blacklisted trade-secrets or security practices. My sense is that for most safety-related issues Anthropic has a set of whitelisted information, which is primarily the already public stuff. The goal here is for you to not have access to any information about them that they did not explicitly decide that they wanted you to know, and they do not want people in their org to share new information when engaging in public, critical discourse.
Yes, yes, Zac’s situation is stressful and I respect his effort to engage in public discourse nonetheless. Good on Zac. But I can’t help but wrankle at the implication that the primary reason he and others don’t talk more is the public commentariat not empathizing enough with having confidential info. Sure, people could do better to understand the difficulty of communicating while holding confidential info. It is hard to repeatedly walk right up to the line and not over it, it’s stressful to think you might have gone over it, and it’s stressful to suddenly find yourself unable to engage well with people’s criticisms because you hit a confidential crux. But as to the fault analysis for Zac’s particularly difficult position? In my opinion the blame is surely first with the Anthropic leadership who have given him way too stringent confidentiality constraints, due to seeming to anti-care about helping people external to Anthropic understand what is going on.
I don’t think the placement of fault is causally related to whether communication is difficult for him, really. To refer back to the original claim being made, Adam Schollsaid that
My guess is that this seems so stressful mostly because Anthropic’s plan is in fact so hard to defend… [I]t seems unsurprising (and good) that people might sometimes strongly object; if Anthropic had more reassuring things to say, I’m guessing it would feel less stressful to try to reassure them.
I think the amount of stress incurred when doing public communication is nearly orthogonal to these factors, and in particular is, when trying to be as careful about anything as Zac is trying to be about confidentiality, quite high at baseline. I don’t think Adam Scholl’s assessment arose from a usefully-predictive model, nor one which was likely to reflect the inside view.
Ben Pace has said that perhaps he doesn’t disagree with you in particular about this, but I sure think I do.[1]
I think the amount of stress incurred when doing public communication is nearly orthogonal to these factors, and in particular is, when trying to be as careful about anything as Zac is trying to be about confidentiality, quite high at baseline.
I don’t see how the first half of this could be correct, and while the second half could be true, it doesn’t seem to me to offer meaningful support for the first half either (instead, it seems rather… off-topic).
As a general matter, even if it were the case that no matter what you say, at least one person will actively misinterpret your words, this fact would have little bearing on whether you can causally influence the proportion of readers/community members that end up with (what seem to you like) the correct takeaways from a discussion of that kind.
Moreover, in a spot where you have something meaningful and responsible, etc, that you and your company have done to deal with safety issues, the major concern in your mind when communicating publicly is figuring out how to make it clear to everyone that you are on top of things without revealing confidential information. That is certainly stressful, but much less so than the additional constraint you have in a world in which you do not have anything concrete that you can back your generic claims of responsibility with, since that is a spot where you can no longer fall back on (a partial version of) the truth as your defense. For the vast majority of human beings, lying and intentional obfuscation with the intent to mislead are significantly more psychologically straining than telling the truth as-you-see-it is.
Overall, I also think I disagree about the amount of stress that would be caused by conversations with AI safety community members. As I have said earlier:
AI safety community members are not actually arbitrarily intelligent Machiavellians with the ability to convincingly twist every (in-reality) success story into an (in-perception) irresponsible gaffe;[1] the extent to which they can do this depends very heavily on the extent to which you have anything substantive to bring up in the first place.
[1] Quite the opposite, actually, if the change in the wider society’s opinions about EA in the wake of the SBF scandal is any representative indication of how the rationalist/EA/AI safety cluster typically handles PR stuff.
In any case, I have already made all these points in a number of ways in my previous response to you (which you haven’t addressed, and which still seem to me to be entirely correct).
Yeah, I totally think your perspective makes sense and I appreciate you bringing it up, even though I disagree.
I acknowledge that someone who has good justifications for their position but just has made a bunch of reasonable confidentiality agreements around the topic should expect to run into a bunch of difficulties and stresses around public conflicts and arguments.
I think you go too far in saying that the stress is orthogonal to whether you have a good case to make, I think you can’t really think that it’s not a top-3 factor to how much stress you’re experiencing. As a pretty simple hypothetical, if you’re responding to a public scandal about whether you stole money, you’re gonna have a way more stressful time if you did steal money than if you didn’t (in substantial part because you’d be able to show the books and prove it).
Perhaps not so much disagreeing with you in particular, but disagreeing with my sense of what was being agreed upon in Zac’s comment and in the reacts, I further wanted to raise my hypothesis that a lot of the confidentiality constraints are unwarranted and actively obfuscatory, which does change who is responsible for the stress, but doesn’t change the fact that there is stress.
Added: Also, I think we would both agree that there would be less stress if there were fewer confidentiality restrictions.
For what it’s worth, I endorse Anthopic’s confidentiality policies, and am confident that everyone involved in setting them sees the increased difficulty of public communication as a cost rather than a benefit. Unfortunately, the unilateralist’s curse and entangled truths mean that confidential-by-default is the only viable policy.
That might be the case, but then it only increases the amount of work your company should be doing to carve out and figure out the info that can be made public, and engage with criticism. There should be whole teams who have Twitter accounts and LW accounts and do regular AMAs and show up to podcasts and who have a mandate internally to seek information in the organization and publish relevant info, and there should be internal policies that reflect an understanding that it is correct for some research teams to spend 10-50% of their yearly effort toward making publishable version of research and decision-making principles in order to inform your stakeholders (read: the citizens of earth) and critics about decisions you are making directly related to existential catastrophes that you are getting rich running toward. Not monologue-style blogposts, but dialogue-style comment sections & interviews.
Confidentiality-by-default does not mean you get to abdicate responsibility for answering questions to the people whose lives you are risking about how-and-why you are making decisions, it means you have to put more work into doing it well. If your company valued the rest of the world understanding what is going on yet thought confidentiality by-default was required, I think it would be trying significantly harder to overcome this barrier.
My general principle is that if you are wielding a lot of power over people that they didn’t otherwise legitimately grant you (in this case building a potential doomsday device), you owe them to be auditable. You are supposed to show up and answer their questions directly – not “thank you so much for the questions, in six months I will publish a related blogpost on this topic” but more like “with the public info available to me, here’s my best guess answer to your specific question today”. Especially so if you are doing something the people you have power over perceive as norm-violating, and even more-so when you are keeping the answers to some very important questions secret from them.
What seemed psychologizing/unfair to you, Raemon? I think it was probably unnecessarily rude/a mistake to try to summarize Anthropic’s whole RSP in a sentence, given that the inferential distance here is obviously large. But I do think the sentence was fair.
As I understand it, Anthropic’s plan for detecting threats is mostly based on red-teaming (i.e., asking the models to do things to gain evidence about whether they can). But nobody understands the models well enough to check for the actual concerning properties themselves, so red teamers instead check for distant proxies, or properties that seem plausibly like precursors. (E.g., for “ability to search filesystems for passwords” as a partial proxy for “ability to autonomously self-replicate,” since maybe the former is a prerequisite for the latter).
But notice that this activity does not involve directly measuring the concerning behavior. Rather, it instead measures something more like “the amount the model strikes the evaluators as broadly sketchy-seeming/suggestive that it might be capable of doing other bad stuff.” And the RSP’s description of Anthropic’s planned responses to these triggers is so chock full of weasel words and caveats and vague ambiguous language that I think it barely constrains their response at all.
So in practice, I think both Anthropic’s plan for detecting threats, and for deciding how to respond, fundamentally hinge on wildly subjective judgment calls, based on broad, high-level, gestalt-ish impressions of how these systems seem likely to behave. I grant that this process is more involved than the typical thing people describe as a “vibe check,” but I do think it’s basically the same epistemic process, and I expect will generate conclusions around as sound.
I don’t really think any of that affects the difficulty of public communication; your implication that it must be the cause reads to me more like an insult than a well-considered psychological model
The basic point would be that it’s hard to write publicly about how you are taking responsible steps that grapple directly with the real issues… if you are not in fact doing those responsible things in the first place. This seems locally valid to me; you may disagree on the object level about whether Adam Scholl’s characterization of Anthropic’s agenda/internal work is correct, but if it is, then it would certainly affect the difficulty of public communication to such an extent that it might well become the primary factor that needs to be discussed in this matter.
Indeed, the suggestion is for Anthropic employees to “talk about their views (on AI progress and risk and what Anthropic is doing and what Anthropic should do) with people outside Anthropic” and the counterargument is that doing so would be nice in an ideal world, except it’s very psychologically exhausting because every public statement you make is likely to get maliciously interpreted by those who will use it to argue that your company is irresponsible. In this situation, there is a straightforward direct correlation between the difficulty of public communication and the likelihood that your statements will get you and your company in trouble.
But the more responsible you are in your actual work, the more responsible-looking details you will be able to bring up in conversations with others when you discuss said work. AI safety community members are not actually arbitrarily intelligent Machiavellians with the ability to convincingly twist every (in-reality) success story into an (in-perception) irresponsible gaffe;[1] the extent to which they can do this depends very heavily on the extent to which you have anything substantive to bring up in the first place. After all, as Paul Graham often says, “If you want to convince people of something, it’s much easier if it’s true.”
As I see it, not being able to bring up Anthropic’s work/views on this matter without some AI safety person successfully making it seem like Anthropic is behaving badly is rather strong Bayesian evidence that Anthropic is in fact behaving badly. So this entire discussion, far from being an insult, seems directly on point to the topic at hand, and locally valid to boot (although not necessarily sound, as that depends on an individualized assessment of the particular object-level claims about the usefulness of the company’s safety team).
Quite the opposite, actually, if the change in the wider society’s opinions about EA in the wake of the SBF scandal is any representative indication of how the rationalist/EA/AI safety cluster typically handles PR stuff.
I think communication as careful as it must be to maintain the confidentiality distinction here is always difficult in the manner described, and that communication to large quantities of people will ~always result in someone running with an insane misinterpretation of what was said.
I understand that this confidentiality point might seem to you like the end of the fault analysis, but have you considered the hypothesis that Anthropic leadership has set such stringent confidentiality policies in part to make it hard for Zac to engage in public discourse?
Look, I don’t think Anthropic leadership is just trying to keep their training skills private or their models secure. Their company does not merely keep trade secrets. When I speak to staff from this company about issues with their ‘Responsible Scaling Policies’, they say that they want to tell me more information about how they think it can be better or how they think it might change, but cannot due to confidentiality constraints. That’s their safety policies, not information about their training policies that they want to keep secret so that they can make money.
I believe the Anthropic leadership cares very little about the public’s ability to have arguments and evidence and access to information about Anthropic’s behavior. The leadership roughly ~never shows up to engage with critical discourse about itself, unless there’s a potential major embarrassment. There is no regular Q&A session with the leadership of a company who believes their own product poses a 10-25% chance of existential catastrophe, no comment section on their website, no official twitter accounts that regularly engage with and share info with critics, no debates with the many people who outright oppose their actions.
No, they go far in the other direction of committing to no-public-discourse. I challenge any Anthropic staffer to openly deny that there is a mutual non-disparagement agreement between Anthropic and OpenAI leadership, whereby neither is legally allowed to openly criticize the other’s company. (You can read cofounder Sam McCandlish write that Anthropic has mutual non-disparagement agreements in this comment.) Anthropic leadership say they quit OpenAI primarily due to safety concerns, and yet I believe they simultaneously signed away their ability to criticize that very organization that they had such unique levels of information about and believed poses an existential threat to civilization.
Where Daniel Kokotajlo refused to sign a non-disparagement agreement (by-default forfeiting his equity) so that he could potentially criticize OpenAI in the future, the Amodei’s quit purportedly due to having damning criticisms of OpenAI in the present and then (I believe) chose to sign a non-disparagement agreement while quitting (and kept their equity). A complete inversion of good collective epistemic principles.
To quote from Zac’s above analogy explaining how difficult his situation at Anthropic is.
The analogous goal described here for Anthropic is to have complete separation between internal and external information. This does not describe a set of blacklisted trade-secrets or security practices. My sense is that for most safety-related issues Anthropic has a set of whitelisted information, which is primarily the already public stuff. The goal here is for you to not have access to any information about them that they did not explicitly decide that they wanted you to know, and they do not want people in their org to share new information when engaging in public, critical discourse.
Yes, yes, Zac’s situation is stressful and I respect his effort to engage in public discourse nonetheless. Good on Zac. But I can’t help but wrankle at the implication that the primary reason he and others don’t talk more is the public commentariat not empathizing enough with having confidential info. Sure, people could do better to understand the difficulty of communicating while holding confidential info. It is hard to repeatedly walk right up to the line and not over it, it’s stressful to think you might have gone over it, and it’s stressful to suddenly find yourself unable to engage well with people’s criticisms because you hit a confidential crux. But as to the fault analysis for Zac’s particularly difficult position? In my opinion the blame is surely first with the Anthropic leadership who have given him way too stringent confidentiality constraints, due to seeming to anti-care about helping people external to Anthropic understand what is going on.
I don’t think the placement of fault is causally related to whether communication is difficult for him, really. To refer back to the original claim being made, Adam Scholl said that
I think the amount of stress incurred when doing public communication is nearly orthogonal to these factors, and in particular is, when trying to be as careful about anything as Zac is trying to be about confidentiality, quite high at baseline. I don’t think Adam Scholl’s assessment arose from a usefully-predictive model, nor one which was likely to reflect the inside view.
Ben Pace has said that perhaps he doesn’t disagree with you in particular about this, but I sure think I do.[1]
I don’t see how the first half of this could be correct, and while the second half could be true, it doesn’t seem to me to offer meaningful support for the first half either (instead, it seems rather… off-topic).
As a general matter, even if it were the case that no matter what you say, at least one person will actively misinterpret your words, this fact would have little bearing on whether you can causally influence the proportion of readers/community members that end up with (what seem to you like) the correct takeaways from a discussion of that kind.
Moreover, in a spot where you have something meaningful and responsible, etc, that you and your company have done to deal with safety issues, the major concern in your mind when communicating publicly is figuring out how to make it clear to everyone that you are on top of things without revealing confidential information. That is certainly stressful, but much less so than the additional constraint you have in a world in which you do not have anything concrete that you can back your generic claims of responsibility with, since that is a spot where you can no longer fall back on (a partial version of) the truth as your defense. For the vast majority of human beings, lying and intentional obfuscation with the intent to mislead are significantly more psychologically straining than telling the truth as-you-see-it is.
Overall, I also think I disagree about the amount of stress that would be caused by conversations with AI safety community members. As I have said earlier:
In any case, I have already made all these points in a number of ways in my previous response to you (which you haven’t addressed, and which still seem to me to be entirely correct).
He also said that he thinks your perspective makes sense, which… I’m not really sure about.
Yeah, I totally think your perspective makes sense and I appreciate you bringing it up, even though I disagree.
I acknowledge that someone who has good justifications for their position but just has made a bunch of reasonable confidentiality agreements around the topic should expect to run into a bunch of difficulties and stresses around public conflicts and arguments.
I think you go too far in saying that the stress is orthogonal to whether you have a good case to make, I think you can’t really think that it’s not a top-3 factor to how much stress you’re experiencing. As a pretty simple hypothetical, if you’re responding to a public scandal about whether you stole money, you’re gonna have a way more stressful time if you did steal money than if you didn’t (in substantial part because you’d be able to show the books and prove it).
Perhaps not so much disagreeing with you in particular, but disagreeing with my sense of what was being agreed upon in Zac’s comment and in the reacts, I further wanted to raise my hypothesis that a lot of the confidentiality constraints are unwarranted and actively obfuscatory, which does change who is responsible for the stress, but doesn’t change the fact that there is stress.
Added: Also, I think we would both agree that there would be less stress if there were fewer confidentiality restrictions.
For what it’s worth, I endorse Anthopic’s confidentiality policies, and am confident that everyone involved in setting them sees the increased difficulty of public communication as a cost rather than a benefit. Unfortunately, the unilateralist’s curse and entangled truths mean that confidential-by-default is the only viable policy.
That might be the case, but then it only increases the amount of work your company should be doing to carve out and figure out the info that can be made public, and engage with criticism. There should be whole teams who have Twitter accounts and LW accounts and do regular AMAs and show up to podcasts and who have a mandate internally to seek information in the organization and publish relevant info, and there should be internal policies that reflect an understanding that it is correct for some research teams to spend 10-50% of their yearly effort toward making publishable version of research and decision-making principles in order to inform your stakeholders (read: the citizens of earth) and critics about decisions you are making directly related to existential catastrophes that you are getting rich running toward. Not monologue-style blogposts, but dialogue-style comment sections & interviews.
Confidentiality-by-default does not mean you get to abdicate responsibility for answering questions to the people whose lives you are risking about how-and-why you are making decisions, it means you have to put more work into doing it well. If your company valued the rest of the world understanding what is going on yet thought confidentiality by-default was required, I think it would be trying significantly harder to overcome this barrier.
My general principle is that if you are wielding a lot of power over people that they didn’t otherwise legitimately grant you (in this case building a potential doomsday device), you owe them to be auditable. You are supposed to show up and answer their questions directly – not “thank you so much for the questions, in six months I will publish a related blogpost on this topic” but more like “with the public info available to me, here’s my best guess answer to your specific question today”. Especially so if you are doing something the people you have power over perceive as norm-violating, and even more-so when you are keeping the answers to some very important questions secret from them.
(not going to respond in this context out of respect for Zach’s wishes. May chat later, and am mulling over my own top-level post on the subject)