One thing that I do after social interactions, especially those which pertain to my work, is to go over all the updates my background processing is likely to make and to question them more explicitly.
This is helpful because I often notice that the updates I’m making aren’t related to reasons much at all. It’s more like “ah they kind of grimaced when I said that, so maybe I’m bad?” or like “they seemed just generally down on this approach, but wait are any of those reasons even new to me? Haven’t I already considered those and decided to do it anyway?” or “they seemed so aggressively pessimistic about my work, but did they even understand what I was saying?” or “they certainly spoke with a lot of authority, but why should I trust them on this, and do I even care about their opinion here?” Etc. A bunch of stuff which at first blush my social center is like “ah god, it’s all over, I’ve been an idiot this whole time” but with some second glancing it’s like “ah wait no, probably I had reasons for doing this work that withstand surface level pushback, let’s remember those again and see if they hold up” And often (always?) they do.
This did not come naturally to me; I’ve had to train myself into doing it. But it has helped a lot with this sort of problem, alongside the solutions you mention i.e. becoming more of a hermit and trying to surround myself by people engaged in more timeless thought.
My high-level skepticism of their approach is A) I don’t buy that it’s possible yet to know how dangerous models are, nor that it is likely to become possible in time to make reasonable decisions, and B) I don’t buy that Anthropic would actually pause, except under a pretty narrow set of conditions which seem unlikely to occur.
As to the first point: Anthropic’s strategy seems to involve Anthropic somehow knowing when to pause, yet as far as I can tell, they don’t actually know how they’ll know that. Their scaling policy does not list the tests they’ll run, nor the evidence that would cause them to update, just that somehow they will. But how? Behavioral evaluations aren’t enough, imo, since we often don’t know how to update from behavior alone—maybe the model inserted the vulnerability into the code “on purpose,” or maybe it was an honest mistake; maybe the model can do this dangerous task robustly, or maybe it just got lucky this time, or we phrased the prompt wrong, or any number of other things. And these sorts of problems seem likely to get harder with scale, i.e., insofar as it matters to know whether models are dangerous.
This is just one approach for assessing the risk, but imo no currently-possible assessment results can suggest “we’re reasonably sure this is safe,” nor come remotely close to that, for the same basic reason: we lack a fundamental understanding of AI. Such that ultimately, I expect Anthropic’s decisions will in fact mostly hinge on the intuitions of their employees. But this is not a robust risk management framework—vibes are not substitutes for real measurement, no matter how well-intentioned those vibes may be.
Also, all else equal I think you should expect incentives might bias decisions the more interpretive-leeway staff have in assessing the evidence—and here, I think the interpretation consists largely of guesswork, and the incentives for employees to conclude the models are safe seem strong. For instance, Anthropic employees all have loads of equity—including those tasked with evaluating the risks!—and a non-trivial pause, i.e. one lasting months or years, could be a death sentence for the company.
But in any case, if one buys the narrative that it’s good for Anthropic to exist roughly however much absolute harm they cause—as long as relatively speaking, they still view themselves as improving things marginally more than the competition—then it is extremely easy to justify decisions to keep scaling. All it requires is for Anthropic staff to conclude they are likely to make better decisions than e.g., OpenAI, which I think is the sort of conclusion that comes pretty naturally to humans, whatever the evidence.
This sort of logic is even made explicit in their scaling policy:
Personally, I am very skeptical that Anthropic will in fact end up deciding to pause for any non-trivial amount of time. The only scenario where I can really imagine this happening is if they somehow find incontrovertible evidence of extreme danger—i.e., evidence which not only convinces them, but also their investors, the rest of the world, etc.—such that it would become politically or legally impossible for any of their competitors to keep pushing ahead either.
But given how hesitant they seem to commit to any red lines about this now, and how messy and subjective the interpretation of the evidence is, and how much inference is required to e.g. go from the fact that “some model can do some AI R&D task” to “it may soon be able to recursively self-improve,” I feel really quite skeptical that Anthropic is likely to encounter the sort of knockdown, beyond-a-reasonable-doubt evidence of disaster that I expect would be needed to convince them to pause.
I do think Anthropic staff probably care more about the risk than the staff of other frontier AI companies, but I just don’t buy that this caring does much. Partly because simply caring is not a substitute for actual science, and partly because I think it is easy for even otherwise-virtuous people to rationalize things when the stakes and incentives are this extreme.
Anthropic’s strategy seems to me to involve a lot of magical thinking—a lot of, “with proper effort, we’ll surely surely figure out what to do when the time comes.” But I think it’s on them to demonstrate to the people whose lives they are gambling with, how exactly they intend to cross this gap, and in my view they sure do not seem to be succeeding at that.