I think it’s pretty unlikely that Anthropic’s murky strategy is good.
In particular, I think that balancing building AGI with building AGI safely only goes well for humanity in a pretty narrow range of worlds. Like, if safety is relatively easy and can roughly keep pace with capabilities, then I think this sort of thing might make sense. But the more the expected world departs from this—the more that you might expect safety to be way behind capabilities, and the more you might expect that it’s hard to notice just how big that gap is and/or how much of a threat capabilities pose—the more this strategy starts seeming pretty worrying to me.
It’s worrying because I don’t imagine Anthropic gets that many “shots” at playing safety cards, so to speak. Like, implementing RSPs and trying to influence norms is one thing, but what about if they notice something actually-maybe-dangerous-but-they’re-not-sure as they’re building? Now they’re in this position where if they want to be really careful (e.g., taking costly actions like: stop indefinitely until they’re beyond reasonable doubt that it’s safe) they’re most likely kind of screwing their investors, and should probably expect to get less funding in the future. And the more likely it is, from their perspective, that the behavior in question does end up being a false alarm, the more pressure there is to not do due diligence.
But the problem is that the more ambiguous the situation is—the less we understand about these systems—the less sure we can be about whether any given behavior is or isn’t an indication of something pretty dangerous. And the current situation seems pretty ambiguous to me. I don’t think anyone knows, for instance, whether Claude 3 seeming to notice it’s being tested is something to worry about or not. Probably it isn’t. But really, how do we know? We’re going off of mostly behavioral cues and making educated guesses about what the behavior implies. But that really isn’t very reassuring when we’re building something much smarter than us, with potentially catastrophic consequences. As it stands, I don’t believe we can even assign numbers to things in a very meaningful sense, let alone assign confidence above a remotely acceptable threshold, i.e., some 9’s of assurance that what we’re about to embark on won’t kill everyone.
The combination of how much uncertainty there is in evaluating these systems, and how much pressure there is for Anthropic to keep scaling seems very worrying to me. Like, if there’s a very obvious sign that a system is dangerous, then I believe Anthropic might be in a good position to pause and “sound the alarm.” But if things remain kind of ambiguous due to our lack of understanding, as they seem to me now, then I’m way less optimistic that the outcome of any maybe-dangerous-but-we’re-not-sure behavior is that Anthropic meaningfully and safely addresses it. In other words, I think that given our current state of understanding, the murky strategy favors “build AGI” more than it does “build AGI safely” and that scares me.
I also think the prior should be quite strong, here, that the obvious incentives will have the obvious effects. Like, creating AGI is desirable (so long as it doesn’t kill everyone and so on). Not only on the “loads of money” axis, but also along other axes monkeys care about: prestige, status, influence, power, etc. Yes, practically no one wants to die, and I don’t doubt that many people at Anthropic genuinely care and are worried about this. But, also, it really seems like you should a priori expect that with stakes this high, cognition will get distorted around whether or not to pursue the stakes. Maybe all Anthropic staff are perfectly equipped to be epistemically sane in such an environment, but I don’t think that one should on priors expect it. People get convinced of all kinds of things when they have a lot to gain, or a lot to lose.
Anyway, it seems likely to me that we will continue to live in the world where we don’t understand these systems well enough to be confident in our evaluations of them, and I assign pretty significant probability to the worlds where capabilities far outstrip our alignment techniques, so I am currently not thrilled that Anthropic exists. I expect that their murky strategy is net bad for humanity, given how the landscape currently looks.
Maybe you really do need to iterate on frontier AI to do meaningful safety work.
This seems like an open question that, to my mind, Anthropic has not fully explored. One way that I sometimes think about this is to ask: if Anthropic were the only leading AI lab, with no possibility of anyone catching up any time soon, should they still be scaling as fast as they are? My guess is no. Like, of course the safety benefit to scaling is not zero. But it’s a question of whether the benefits outweigh the costs. Given how little we understand these systems, I’d be surprised if we were anywhere near to hitting diminishing safety returns—as in, I don’t think the safety benefits of scaling vastly outstrip the benefit we might expect out of empirical work on current systems. And I think the potential cost of scaling as recklessly we currently are is extinction. I don’t doubt that at some point scaling will be necessary and important for safety; I do doubt that the time for that is now.
Maybe you do need to stay on the frontier because the world is accelerating whether Anthropic wants it to or not.
It really feels like if you create an organization which, with some unsettlingly large probability, might directly lead to the extinction of humanity, then you’re doing something wrong. Especially so, if the people that you’re making the decisions for (i.e., everyone), would be—if they fully understood the risks involved—on reflection unhappy about it. Like, I’m pretty sure that the sentence from Anthropic’s pitch deck “these models could begin to automate large portions of the economy” is already enough for many people to be pretty upset. But if they learned that Anthropic also assigned ~33% to a “pessimistic world” which includes the possibility “extinction” then I expect most people would rightly be pretty furious. I think making decisions for people in a way that they would predictably be upset about is unethical, and it doesn’t make it okay just because other people would do it anyway.
In any case, I think that Anthropic’s existence has hastened race dynamics, and I think that makes our chances of survival lower. That seems pretty in line with what to expect from this kind of strategy (i.e., that it cashes out to scaling coming before safety where it’s non-obvious what to do), and I think it makes sense to expect things of this type going forward (e.g., I am personally pretty skeptical that Anthropic is going to ~meaningfully pause development unless it’s glaringly obvious that they should do so, at which point I think we’re clearly in a pretty bad situation). And although OpenAI was never claiming as much of a safety vibe as Anthropic currently is, I still think the track record of ambiguous strategies which play to both sides does not inspire that much confidence about Anthropic’s trajectory.
Does Dario-and-other-leadership have good models of x-risk?
I am worried about this. My read on the situation is that Dario is basically expecting something more like a tool than an agent. Broadly, I get this sense because when I model Anthropic as operating under the assumption that risks mostly stem from misuse, their actions make a lot more sense to me. But also things like this quote seem consistent with that: “I suspect that it may roughly work to think of the model as if it’s trained in the normal way, just getting to above human level, it may be a reasonable assumption… that the internal structure of the model is not intentionally optimizing against us.” (Dario on the Dwarkesh podcast). If true, this makes me worried about the choices that Dario is going to make, when, again, it’s not clear how to interpret the behavior of these systems. In particular, it makes me worried he’s going to err on the side of “this is probably fine,” since tools seem, all else equal, less dangerous than agents. Dario isn’t the only person Anthropic’s decisions depend on, still, I think his beliefs have a large bearing on what Anthropic does.
But, the way I wish the conversation was playing out was less like “did Anthropic say a particular misleading thing?”
I think it’s pretty important to call attention to misleading things. Both because there is some chance that public focus on inconsistencies might cause them to change their behavior, and because pointing out specific problems in public arenas often causes evidence to come forward in one common space, and then everyone can gain a clearer understanding of what’s going on.
I think it’s pretty unlikely that Anthropic’s murky strategy is good.
In particular, I think that balancing building AGI with building AGI safely only goes well for humanity in a pretty narrow range of worlds. Like, if safety is relatively easy and can roughly keep pace with capabilities, then I think this sort of thing might make sense. But the more the expected world departs from this—the more that you might expect safety to be way behind capabilities, and the more you might expect that it’s hard to notice just how big that gap is and/or how much of a threat capabilities pose—the more this strategy starts seeming pretty worrying to me.
It’s worrying because I don’t imagine Anthropic gets that many “shots” at playing safety cards, so to speak. Like, implementing RSPs and trying to influence norms is one thing, but what about if they notice something actually-maybe-dangerous-but-they’re-not-sure as they’re building? Now they’re in this position where if they want to be really careful (e.g., taking costly actions like: stop indefinitely until they’re beyond reasonable doubt that it’s safe) they’re most likely kind of screwing their investors, and should probably expect to get less funding in the future. And the more likely it is, from their perspective, that the behavior in question does end up being a false alarm, the more pressure there is to not do due diligence.
But the problem is that the more ambiguous the situation is—the less we understand about these systems—the less sure we can be about whether any given behavior is or isn’t an indication of something pretty dangerous. And the current situation seems pretty ambiguous to me. I don’t think anyone knows, for instance, whether Claude 3 seeming to notice it’s being tested is something to worry about or not. Probably it isn’t. But really, how do we know? We’re going off of mostly behavioral cues and making educated guesses about what the behavior implies. But that really isn’t very reassuring when we’re building something much smarter than us, with potentially catastrophic consequences. As it stands, I don’t believe we can even assign numbers to things in a very meaningful sense, let alone assign confidence above a remotely acceptable threshold, i.e., some 9’s of assurance that what we’re about to embark on won’t kill everyone.
The combination of how much uncertainty there is in evaluating these systems, and how much pressure there is for Anthropic to keep scaling seems very worrying to me. Like, if there’s a very obvious sign that a system is dangerous, then I believe Anthropic might be in a good position to pause and “sound the alarm.” But if things remain kind of ambiguous due to our lack of understanding, as they seem to me now, then I’m way less optimistic that the outcome of any maybe-dangerous-but-we’re-not-sure behavior is that Anthropic meaningfully and safely addresses it. In other words, I think that given our current state of understanding, the murky strategy favors “build AGI” more than it does “build AGI safely” and that scares me.
I also think the prior should be quite strong, here, that the obvious incentives will have the obvious effects. Like, creating AGI is desirable (so long as it doesn’t kill everyone and so on). Not only on the “loads of money” axis, but also along other axes monkeys care about: prestige, status, influence, power, etc. Yes, practically no one wants to die, and I don’t doubt that many people at Anthropic genuinely care and are worried about this. But, also, it really seems like you should a priori expect that with stakes this high, cognition will get distorted around whether or not to pursue the stakes. Maybe all Anthropic staff are perfectly equipped to be epistemically sane in such an environment, but I don’t think that one should on priors expect it. People get convinced of all kinds of things when they have a lot to gain, or a lot to lose.
Anyway, it seems likely to me that we will continue to live in the world where we don’t understand these systems well enough to be confident in our evaluations of them, and I assign pretty significant probability to the worlds where capabilities far outstrip our alignment techniques, so I am currently not thrilled that Anthropic exists. I expect that their murky strategy is net bad for humanity, given how the landscape currently looks.
This seems like an open question that, to my mind, Anthropic has not fully explored. One way that I sometimes think about this is to ask: if Anthropic were the only leading AI lab, with no possibility of anyone catching up any time soon, should they still be scaling as fast as they are? My guess is no. Like, of course the safety benefit to scaling is not zero. But it’s a question of whether the benefits outweigh the costs. Given how little we understand these systems, I’d be surprised if we were anywhere near to hitting diminishing safety returns—as in, I don’t think the safety benefits of scaling vastly outstrip the benefit we might expect out of empirical work on current systems. And I think the potential cost of scaling as recklessly we currently are is extinction. I don’t doubt that at some point scaling will be necessary and important for safety; I do doubt that the time for that is now.
It really feels like if you create an organization which, with some unsettlingly large probability, might directly lead to the extinction of humanity, then you’re doing something wrong. Especially so, if the people that you’re making the decisions for (i.e., everyone), would be—if they fully understood the risks involved—on reflection unhappy about it. Like, I’m pretty sure that the sentence from Anthropic’s pitch deck “these models could begin to automate large portions of the economy” is already enough for many people to be pretty upset. But if they learned that Anthropic also assigned ~33% to a “pessimistic world” which includes the possibility “extinction” then I expect most people would rightly be pretty furious. I think making decisions for people in a way that they would predictably be upset about is unethical, and it doesn’t make it okay just because other people would do it anyway.
In any case, I think that Anthropic’s existence has hastened race dynamics, and I think that makes our chances of survival lower. That seems pretty in line with what to expect from this kind of strategy (i.e., that it cashes out to scaling coming before safety where it’s non-obvious what to do), and I think it makes sense to expect things of this type going forward (e.g., I am personally pretty skeptical that Anthropic is going to ~meaningfully pause development unless it’s glaringly obvious that they should do so, at which point I think we’re clearly in a pretty bad situation). And although OpenAI was never claiming as much of a safety vibe as Anthropic currently is, I still think the track record of ambiguous strategies which play to both sides does not inspire that much confidence about Anthropic’s trajectory.
I am worried about this. My read on the situation is that Dario is basically expecting something more like a tool than an agent. Broadly, I get this sense because when I model Anthropic as operating under the assumption that risks mostly stem from misuse, their actions make a lot more sense to me. But also things like this quote seem consistent with that: “I suspect that it may roughly work to think of the model as if it’s trained in the normal way, just getting to above human level, it may be a reasonable assumption… that the internal structure of the model is not intentionally optimizing against us.” (Dario on the Dwarkesh podcast). If true, this makes me worried about the choices that Dario is going to make, when, again, it’s not clear how to interpret the behavior of these systems. In particular, it makes me worried he’s going to err on the side of “this is probably fine,” since tools seem, all else equal, less dangerous than agents. Dario isn’t the only person Anthropic’s decisions depend on, still, I think his beliefs have a large bearing on what Anthropic does.
I think it’s pretty important to call attention to misleading things. Both because there is some chance that public focus on inconsistencies might cause them to change their behavior, and because pointing out specific problems in public arenas often causes evidence to come forward in one common space, and then everyone can gain a clearer understanding of what’s going on.