Anthropic is attempting to build a new mind vastly smarter than any human, and as I understand it, plans to ensure this goes well basically by “doing periodic vibe checks”
This obvious straw-man makes your argument easy to dismiss.
However I think the point is basically correct. Anthropic’s strategy to reduce x-risk also includes lobbying against pre-harm enforcement of liability for AI companies in SB 1047.
How is it a straw-man? How is the plan meaningfully different from that?
Imagine a group of people has already gathered a substantial amount of uranium, is already refining it, is already selling power generated by their pile of uranium, etc. And doing so right near and upwind of a major city. And they’re shoveling more and more uranium onto the pile, basically as fast as they can. And when you ask them why they think this is going to turn out well, they’re like “well, we trust our leadership, and you know we have various documents, and we’re hiring for people to ‘Develop and write comprehensive safety cases that demonstrate the effectiveness of our safety measures in mitigating risks from huge piles of uranium’, and we have various detectors such as an EM detector which we will privately check and then see how we feel”. And then the people in the city are like “Hey wait, why do you think this isn’t going to cause a huge disaster? Sure seems like it’s going to by any reasonable understanding of what’s going on”. And the response is “well we’ve thought very hard about it and yes there are risks but it’s fine and we are working on safety cases”. But… there’s something basic missing, which is like, an explanation of what it could even look like to safely have a huge pile of superhot uranium. (Also in this fantasy world no one has ever done so and can’t explain how it would work.)
In the AI case, there’s lots of inaction risk: if Anthropic doesn’t make powerful AI, someone less safety-focused will.
It’s reasonable to think e.g. I want to boost Anthropic in the current world because others are substantially less safe, but if other labs didn’t exist, I would want Anthropic to slow down.
I disagree. It would be one thing if Anthropic were advocating for AI to go slower, trying to get op-eds in the New York Times about how disastrous of a situation this was, or actually gaming out and detailing their hopes for how their influence will buy saving the world points if everything does become quite grim, and so on. But they aren’t doing that, and as far as I can tell they basically take all of the same actions as the other labs except with a slight bent towards safety.
Like, I don’t feel at all confident that Anthropic’s credit has exceeded their debit, even on their own consequentialist calculus. They are clearly exacerbating race dynamics, both by pushing the frontier, and by lobbying against regulation. And what they have done to help strikes me as marginal at best and meaningless at worst. E.g., I don’t think an RSP is helpful if we don’t know how to scale safely; we don’t, so I feel like this device is mostly just a glorified description of what was already happening, namely that the labs would use their judgment to decide what was safe. Because when it comes down to it, if an evaluation threshold triggers, the first step is to decide whether that was actually a red-line, based on the opaque and subjective judgment calls of people at Anthropic. But if the meaning of evaluations can be reinterpreted at Anthropic’s whims, then we’re back to just trusting “they seem to have a good safety culture,” and that isn’t a real plan, nor really any different to what was happening before. Which is why I don’t consider Adam’s comment to be a strawman. It really is, at the end of the day, a vibe check.
And I feel pretty sketched out in general by bids to consider their actions relative to other extremely reckless players like OpenAI. Because when we have so little sense of how to build this safely, it’s not like someone can come in and completely change the game. At best they can do small improvements on the margins, but once you’re at that level, it feels kind of like noise to me. Maybe one lab is slightly better than the others, but they’re still careening towards the same end. And at the very least it feels like there is a bit of a missing mood about this, when people are requesting we consider safety plans relatively. I grant Anthropic is better than OpenAI on that axis, but my god, is that really the standard we’re aiming for here? Should we not get to ask “hey, could you please not build machines that might kill everyone, or like, at least show that you’re pretty sure that won’t happen before you do?”
@Zach Stein-Perlman , you’re missing the point. They don’t have a plan. Here’s the thread (paraphrased in my words):
Zach: [asks, for Anthropic] Zac: … I do talk about Anthropic’s safety plan and orientation, but it’s hard because of confidentiality and because many responses here are hostile. … Adam: Actually I think it’s hard because Anthropic doesn’t have a real plan. Joseph: That’s a straw-man. [implying they do have a real plan?] Tsvi: No it’s not a straw-man, they don’t have a real plan. Zach: Something must be done. Anthropic’s plan is something. Tsvi: They don’t have a real plan.
I agree Anthropic doesn’t have a “real plan” in your sense, and narrow disagreement with Zac on that is fine.
I just think that’s not a big deal and is missing some broader point (maybe that’s a motte and Anthropic is doing something bad—vibes from Adam’s comment—is a bailey).
[Edit: “Something must be done. Anthropic’s plan is something.” is a very bad summary of my position. My position is more like various facts about Anthropic mean that them-making-powerful-AI is likely better than the counterfactual, and evaluating a lab in a vacuum or disregarding inaction risk is a mistake.]
[Edit: replies to this shortform tend to make me sad and distracted—this is my fault, nobody is doing something wrong—so I wish I could disable replies and I will probably stop replying and would prefer that others stop commenting. Tsvi, I’m ok with one more reply to this.]
various facts about Anthropic mean that them-making-powerful-AI is likely better than the counterfactual, and evaluating a lab in a vacuum or disregarding inaction risk is a mistake
Look, if Anthropic was honestly and publically saying
We do not have a credible plan for how to make AGI, and we have no credible reason to think we can come up with a plan later. Neither does anyone else. But—on the off chance there’s something that could be done with a nascent AGI that makes a nonomnicide outcome marginally more likely, if the nascent AGI is created and observed by people are at least thinking about the problem—on that off chance, we’re going to keep up with the other leading labs. But again, given that no one has a credible plan or a credible credible-plan plan, better would be if everyone including us stopped. Please stop this industry.
If they were saying and doing that, then I would still raise my eyebrows a lot and wouldn’t really trust it. But at least it would be plausibly consistent with doing good.
But that doesn’t sound like either what they’re saying or doing. IIUC they lobbied to remove protection for AI capabilities whistleblowers from SB 1047! That happened! Wow! And it seems like Zac feels he has to pretend to have a credible credible-plan plan.
Hm. I imagine you don’t want to drill down on this, but just to state for the record, this exchange seems like something weird is happening in the discourse. Like, people are having different senses of “the point” and “the vibe” and such, and so the discourse has already broken down. (Not that this is some big revelation.) Like, there’s the Great Stonewall of the AGI makers. And then Zac is crossing through the gates of the Great Stonewall to come and talk to the AGI please-don’t-makers. But then Zac is like (putting words in his mouth) “there’s no Great Stonewall, or like, it’s not there in order to stonewall you in order to pretend that we have a safe AGI plan or to muddy the waters about whether or not we should have one, it’s there because something something trade secrets and exfohazards, and actually you’re making it difficult to talk by making me work harder to pretend that we have a safe AGI plan or intentions that should promissorily satisfy the need for one”.
Seems like most people believe (implicitly or explicitly) that empirical research is the only feasible path forward to building a somewhat aligned generally intelligent AI scientist. This is an underspecified claim, and given certain fully-specified instances of it, I’d agree.
But this belief leads to the following reasoning: (1) if we don’t eat all this free energy in the form of researchers+compute+funding, someone else will; (2) other people are clearly less trustworthy compared to us (Anthropic, in this hypothetical); (3) let’s do whatever it takes to maintain our lead and prevent other labs from gaining power, while using whatever resources we have to also do alignment research, preferably in ways that also help us maintain or strengthen our lead in this race.
most people believe (implicitly or explicitly) that empirical research is the only feasible path forward to building a somewhat aligned generally intelligent AI scientist.
I don’t credit that they believe that. And, I don’t credit that you believe that they believe that. What did they do, to truly test their belief—such that it could have been changed? For most of them the answer is “basically nothing”. Such a “belief” is not a belief (though it may be an investment, if that’s what you mean). What did you do to truly test that they truly tested their belief? If nothing, then yours isn’t a belief either (though it may be an investment). If yours is an investment in a behavioral stance, that investment may or may not be advisable, but it would DEFINITELY be inadvisable to pretend to yourself that yours is a belief.
This obvious straw-man makes your argument easy to dismiss.
However I think the point is basically correct. Anthropic’s strategy to reduce x-risk also includes lobbying against pre-harm enforcement of liability for AI companies in SB 1047.
How is it a straw-man? How is the plan meaningfully different from that?
Imagine a group of people has already gathered a substantial amount of uranium, is already refining it, is already selling power generated by their pile of uranium, etc. And doing so right near and upwind of a major city. And they’re shoveling more and more uranium onto the pile, basically as fast as they can. And when you ask them why they think this is going to turn out well, they’re like “well, we trust our leadership, and you know we have various documents, and we’re hiring for people to ‘Develop and write comprehensive safety cases that demonstrate the effectiveness of our safety measures in mitigating risks from huge piles of uranium’, and we have various detectors such as an EM detector which we will privately check and then see how we feel”. And then the people in the city are like “Hey wait, why do you think this isn’t going to cause a huge disaster? Sure seems like it’s going to by any reasonable understanding of what’s going on”. And the response is “well we’ve thought very hard about it and yes there are risks but it’s fine and we are working on safety cases”. But… there’s something basic missing, which is like, an explanation of what it could even look like to safely have a huge pile of superhot uranium. (Also in this fantasy world no one has ever done so and can’t explain how it would work.)
In the AI case, there’s lots of inaction risk: if Anthropic doesn’t make powerful AI, someone less safety-focused will.
It’s reasonable to think e.g. I want to boost Anthropic in the current world because others are substantially less safe, but if other labs didn’t exist, I would want Anthropic to slow down.
I disagree. It would be one thing if Anthropic were advocating for AI to go slower, trying to get op-eds in the New York Times about how disastrous of a situation this was, or actually gaming out and detailing their hopes for how their influence will buy saving the world points if everything does become quite grim, and so on. But they aren’t doing that, and as far as I can tell they basically take all of the same actions as the other labs except with a slight bent towards safety.
Like, I don’t feel at all confident that Anthropic’s credit has exceeded their debit, even on their own consequentialist calculus. They are clearly exacerbating race dynamics, both by pushing the frontier, and by lobbying against regulation. And what they have done to help strikes me as marginal at best and meaningless at worst. E.g., I don’t think an RSP is helpful if we don’t know how to scale safely; we don’t, so I feel like this device is mostly just a glorified description of what was already happening, namely that the labs would use their judgment to decide what was safe. Because when it comes down to it, if an evaluation threshold triggers, the first step is to decide whether that was actually a red-line, based on the opaque and subjective judgment calls of people at Anthropic. But if the meaning of evaluations can be reinterpreted at Anthropic’s whims, then we’re back to just trusting “they seem to have a good safety culture,” and that isn’t a real plan, nor really any different to what was happening before. Which is why I don’t consider Adam’s comment to be a strawman. It really is, at the end of the day, a vibe check.
And I feel pretty sketched out in general by bids to consider their actions relative to other extremely reckless players like OpenAI. Because when we have so little sense of how to build this safely, it’s not like someone can come in and completely change the game. At best they can do small improvements on the margins, but once you’re at that level, it feels kind of like noise to me. Maybe one lab is slightly better than the others, but they’re still careening towards the same end. And at the very least it feels like there is a bit of a missing mood about this, when people are requesting we consider safety plans relatively. I grant Anthropic is better than OpenAI on that axis, but my god, is that really the standard we’re aiming for here? Should we not get to ask “hey, could you please not build machines that might kill everyone, or like, at least show that you’re pretty sure that won’t happen before you do?”
But that’s not a plan to ensure their uranium pile goes well.
@Zach Stein-Perlman , you’re missing the point. They don’t have a plan. Here’s the thread (paraphrased in my words):
Zach: [asks, for Anthropic]
Zac: … I do talk about Anthropic’s safety plan and orientation, but it’s hard because of confidentiality and because many responses here are hostile. …
Adam: Actually I think it’s hard because Anthropic doesn’t have a real plan.
Joseph: That’s a straw-man. [implying they do have a real plan?]
Tsvi: No it’s not a straw-man, they don’t have a real plan.
Zach: Something must be done. Anthropic’s plan is something.
Tsvi: They don’t have a real plan.
I explicitly said “However I think the point is basically correct” in the next sentence.
Sorry, reacts are ambiguous.
I agree Anthropic doesn’t have a “real plan” in your sense, and narrow disagreement with Zac on that is fine.
I just think that’s not a big deal and is missing some broader point (maybe that’s a motte and Anthropic is doing something bad—vibes from Adam’s comment—is a bailey).
[Edit: “Something must be done. Anthropic’s plan is something.” is a very bad summary of my position. My position is more like various facts about Anthropic mean that them-making-powerful-AI is likely better than the counterfactual, and evaluating a lab in a vacuum or disregarding inaction risk is a mistake.]
[Edit: replies to this shortform tend to make me sad and distracted—this is my fault, nobody is doing something wrong—so I wish I could disable replies and I will probably stop replying and would prefer that others stop commenting. Tsvi, I’m ok with one more reply to this.]
(I won’t reply more, by default.)
Look, if Anthropic was honestly and publically saying
If they were saying and doing that, then I would still raise my eyebrows a lot and wouldn’t really trust it. But at least it would be plausibly consistent with doing good.
But that doesn’t sound like either what they’re saying or doing. IIUC they lobbied to remove protection for AI capabilities whistleblowers from SB 1047! That happened! Wow! And it seems like Zac feels he has to pretend to have a credible credible-plan plan.
Hm. I imagine you don’t want to drill down on this, but just to state for the record, this exchange seems like something weird is happening in the discourse. Like, people are having different senses of “the point” and “the vibe” and such, and so the discourse has already broken down. (Not that this is some big revelation.) Like, there’s the Great Stonewall of the AGI makers. And then Zac is crossing through the gates of the Great Stonewall to come and talk to the AGI please-don’t-makers. But then Zac is like (putting words in his mouth) “there’s no Great Stonewall, or like, it’s not there in order to stonewall you in order to pretend that we have a safe AGI plan or to muddy the waters about whether or not we should have one, it’s there because something something trade secrets and exfohazards, and actually you’re making it difficult to talk by making me work harder to pretend that we have a safe AGI plan or intentions that should promissorily satisfy the need for one”.
Seems like most people believe (implicitly or explicitly) that empirical research is the only feasible path forward to building a somewhat aligned generally intelligent AI scientist. This is an underspecified claim, and given certain fully-specified instances of it, I’d agree.
But this belief leads to the following reasoning: (1) if we don’t eat all this free energy in the form of researchers+compute+funding, someone else will; (2) other people are clearly less trustworthy compared to us (Anthropic, in this hypothetical); (3) let’s do whatever it takes to maintain our lead and prevent other labs from gaining power, while using whatever resources we have to also do alignment research, preferably in ways that also help us maintain or strengthen our lead in this race.
I don’t credit that they believe that. And, I don’t credit that you believe that they believe that. What did they do, to truly test their belief—such that it could have been changed? For most of them the answer is “basically nothing”. Such a “belief” is not a belief (though it may be an investment, if that’s what you mean). What did you do to truly test that they truly tested their belief? If nothing, then yours isn’t a belief either (though it may be an investment). If yours is an investment in a behavioral stance, that investment may or may not be advisable, but it would DEFINITELY be inadvisable to pretend to yourself that yours is a belief.