Anthropic, and taking “technical philosophy” more seriously

So, I have a lot of complaints about Anthropic, and about how EA /​ AI safety people often relate to Anthropic (i.e. treating the company as more trustworthy/​good than makes sense).

At some point I may write up a post that is focused on those complaints.

But after years of arguing with Anthropic employees, and reading into the few public writing they’ve done, my sense is Dario/​Anthropic-leadership are at least reasonably earnestly trying to do good things within their worldview.

So I want to just argue with the object-level parts of that worldview that I disagree with.

I think the Anthropic worldview is something like:

  1. Superalignment is probably not that hard to navigate.[1]

  2. Misuse by totalitarian regimes or rogue actors is reasonably likely by default, and very bad.

  3. AGI founded in The West would not be bad in the ways that totalitarian regimes would be.

  4. Quick empiricism is a way better approach to solving research problems than trying to “think ahead” too cleverly.

  5. It’s better to only make strong claims when you have strong evidence to back them up, and only make strong commitments when you’re confident about those commitments being the right ones. (This is partly about “well, there are some political games you just do need to play”, but at least partly motivated by epistemic principles).[2]

My current understanding of the corresponding Anthropic strategy is:

  1. Stay on the frontier of AI, so that they reliably have “a seat at the table” during the realpolitik discussions of how AI policy plays out.

  2. Race to get ~human-level AGI that helps with scalable superalignment research.[3]

  3. Iterate on various empirical frameworks for scalable oversight, e.g. various ways you can use weaker models to check that larger untrusted models are safe. Use a mix of interpretability and control techniques to help.

  4. Encourage the West to race, banking on it being easier to get companies /​ politicians to pivot to safer practices when there’s clearer smoking guns, and clearer asks to make of people.

  5. Eventually they do need to spend a bunch of political chips, which they seem to be starting to spend this year. But, those chips are focused on something like: “race to ensure lead over China, then maybe briefly slow down when the lead is more secure and we’re close to the dangerous ASL levels, then invent superintelligence, and use it to solve a bunch of problems”[4] (as opposed to “global moratorium with international buy-in”)

  6. ????

I haven’t seen a clear plan for what comes exactly after step #5. I’ve seen Dario say in interviews “we will need to Have A Some Kind of Conversation About AI, Together.” This is pretty vague. It might be vague because Dario doesn’t have a plan there, or might be that the asks are too overton-shattering to say right now and he is waiting till there’s clearer evidence to throw at people.

I think it’s plausible Dario’s take is like “look it’s not that useful to have a plan in advance here because it’ll depend a ton on what the gameboard looks like at the time.”

I: Arguments for “Technical Philosophy”

The crucial questions in my mind here are:

Technical

  • Do we need 10-30 years of serial research, which can’t be parallelized?

  • Does superalignment require “extreme philosophical competence?”

  • Is the Anthropic culture/​playbook prepared for the potential necessity of 10+ year pauses?

  • What does the curve of “alignment difficulty vs capabilities” look like, right around the point where AI becomes capable enough to meaningfully help with “ending the acute risk period?”

Geopolitical

  • How inevitable is racing?

  • How willing would China /​ others be to go along with a serious pause proposal?

  • How much Overton-window-smashing do we need “right now” to get to an adequate world and how practical is that?

Tying together in:

  • Should Anthropic comms and strategy be way more focused on getting everyone to slow down, and way less focused on racing against totalitarian states?

I don’t disagree that totalitarian AI would be real bad. It’s quite plausible to me that the “global pause” crowd are underweighting how bad it would be.

But I’m personally at like “it’s at least ~60% that super alignment is Real Hard”, and that the Anthropic playbook and research culture is not well suited for solving it. And this makes the “how bad is totalitarian AI?” question kind of moot.

It feels vaguely reasonable to me to have a belief as low as 15% on “Superalignment is Real Hard in a way that requires like a 10-30 year pause.” And, at 15%, it still feels pretty crazy to be oriented around racing the way Anthropic is.

I once talked to Zac Hatfield-Dodds who agreed with me about “15% on ’superalignment requires 10+ year pause” being a crux,” but their number was like 5%[5]. And, yeah – if I earnestly believed 5%, I would honestly be unsure how to compare risk of x-risk from risk of catastrophic misuse. I was pretty happy to find such a crisp disagreement – a crux! a double crux even!

But, my guess is that Anthropic employees have biased reasons for thinking the risk is so low, or can be treated as low in practice. See Optimistic Assumptions, Longterm Planning, and “Cope”. But, over my years of arguing with them, I’ve found it harder to present a slam-dunk case than I initially expected, and I think it’d be good to actually argue some of the object level details.

I will articulate some arguments here, and mostly I am hoping for more argument and back-and-forth in the comments. I think it’d be cool if this ended with someone representative of Anthropic having more of an extended debate with someone with stronger takes on the technical side of things than me.

10-30 years of serial research, or “extreme philosophical competence.”

Anthropic seems like they are making costly enough signals of “we may need to slow down for 6-12 months or so, and we want other people to be ready for that too”, that I believe they take that earnestly.

But they aren’t saying anything like “hey everyone, we maybe will have to stop for like over a decade in over to figure things out.” And this is the thing I most wish they would do.

The notion of “we need a lot of serial research time” was written up by Nate Soares in a post on differential technological development. The argument is: even though some types of research will get easier to do when we have more “realistic/​representative” AI, or more researchers available to focus on the problem, there are some kinds of research that really require one person to do a lot of deep thinking, integrating concepts, synthesizing, which historically has seemed to take blocks of time that are measured in decades.

Right now, we’re in an “iterative empiricism” regime with AI, where you can see some problems, try some obvious solutions, see how those solutions work, and iterate on the next generation. The problem is that at some point, AI will probably hit a critical threshold where it rapidly recursively self-improves.

Does your alignment process safely scale to infinity?

A lot of safeguards that hold up when the next generation of AI is only slightly more powerful, break down when the next generation has scaled to “unboundedly powerful.”

I think Dario agrees with this part. My sense from scattered conversations and public comms and reading between the lines is “Anthropic will be able to notice and slow down before we get to this point, and make sure they have a good scalable oversight system in place before you begin the recursive self-improvement process.”

Where I think we disagree, is how badly the Anthropic playbook is suited for handling this shift.

I think for the shift from “near human” to “massively superhuman” to go safely, your alignment team needs to be very competent at “technical philosophy.”

By “technical philosophical competence,” I think I mostly mean: “Ability to define and reason about abstract concepts with extreme precision. In some cases, about concepts which humanity has struggled to agree on for thousands of years. And, do that defining in a way that is robust to extreme optimization, and is robust to ontological updates by entities that know a lot more about us and the universe than we know.”

Posts like AGI Ruin: A List of Lethalities (and, basically the entire LessWrong sequences) have tried to argue this point. Here is my attempt at a quick recap of why this is important:

  • Pointing:

    • You need to successfully point the AI at anything at all. (This may superficially seem like it’s working with current LLMs, but it isn’t actually anywhere close to robust enough to hold up)

    • You need to point the AI at some kind of nuanced abstract target, in particular, that remains stable as the AI updates its ontology.

    • (You also eventually need to point the AI at a cluster of messy human-value-concepts in particular. Though from what I gather, MIRI-ish people think if you get the first two things, this last part isn’t actually that hard)

  • Unbounded optimization is dangerous. You need to do all of this with almost perfect mathematical robustness, because extreme levels of optimization are incredibly dangerous (in a way that is counterintuitive to most people)

  • Interpersonal Resolution. It may not be required initially (depending on your plan), but ultimately, at some point your AI, or civilization of AIs, needs to successfully navigate human values, in a way that doesn’t just work for one person, but somehow deals with interpersonal differences in values.

  • Intrinsic Goal-directedness. You have to do this even if you build an “oracle AI.” If your oracle is sufficiently powerful to be useful (i.e. can quickly make new scientific breakthroughs), it must think in terms similar to “having goals” and “achieving particular consequences”, because that is necessary for efficiently solving complex problems.[6]

  • Someone will totally build reckless AI. You do have to deal with all these problems, because even if you can get controlled human-level (or slightly moderately superhuman) AI… sooner or later someone is going to build a reckless, goal directed AI and not even bother isolating it from the internet.

  • Pivotal necessity. To deal with reckless AI, you need a lot of optimization power, to robustly prevent major companies and nationstates and eventually small hard-to-identify actors building reckless AI. Maybe you need a pivotal act. Maybe it’s better to frame it as a pivotal process. It seems good if it’s more of a collaborative thing that world nations are deciding on together, but you need a pivotal something, and even if you have a lot of buy-in, it needs to be able to contend with people/​companies/​nations that aren’t bought in.

We need AI (or, at least some extremely powerful technology) that can help us end the acute risk period.

“Okay, but what does the alignment difficulty curve look like at the point where AI is powerful enough to start being useful for Acute Risk Period reduction?”

A counterpoint I got from Anthropic employee Drake Thomas (note: not speaking for Anthropic or anything, just discussing his own models), is something like:

Sure, I buy that unbounded optimization is hella dangerous, and that you can’t naively scale our current processes to handle it. This seems very obvious.

But, the question here is not “can we handle unbounded optimization right away?”. The question is “can we use ~medium-power AI to incrementally help us handle optimization needed to execute a pivotal act?” (It’s notably not even “get us all the way to safe unbounded superintelligence” – if we have a pivotal act, we can then take more time to get things right before we try to handle that case).

Can an IQ 150 AI help align an IQ 155 AI, in a way that is clear and legibly good to humans? Can a 155 IQ AI do that for a 160 AI? Where do we expect this to break down?

What “IQ” of an AI would we need to do a pivotal act? Is it before or after the MIRI folk expect a sharp left turn? Why do they think that?

I think this is a kind of reasonable point and I’m interested in takes from more MIRI-ish crowd people on it. But, I still feel quite worried.

Are there any pivotal acts that aren’t philosophically loaded?

A lot of hope (I think both for me, and my current understanding of Drake’s take), is “can you build an AI that helps invent powerful (non-AI) technology, where it’s sort of straightforward to use the technology to stop people from building reckless AI.

The only technology that I’ve imagined thus far that feels plausible is “invent uploading or bio enhancements that make humans a lot smarter” (essentially approaching “aligned superintelligence” by starting from humanity, rather than ML).

This does feel intuitively plausible to me, but:

1. I think, to build AI powerful enough to invent such technology quickly[7], you need to be able to point the AI at abstract targets that are robust under ontological change. I don’t actually expect these are any easier than the more stereotypical “[moral] philosophy questions people have argued about for thousands of years.”

2. Even if powerful enough AI to help invent such technology is (currently) safe, it’s probably at least pretty close to “would be actively dangerous with slightly different training or scaffolding.”

3. You still end up needing to solve the problem of “prevent humans or AIs from building reckless AI, indefinitely”, even if you kick the can to uplifted human successors. And even if you’ve gotten a lot of buy-in from major governments [8]and such, you do need to be capable of robustly stopping many smart actors, who work persistently over a long time. This requires a lot of optimization power, enough that even if it’s not getting into “unboundedly huge” territory. I dunno, if your plan is routing through uplifted human successors figuring it out I still think you need to start contending with the details of this now.

4. If we need a decades-long-pause, then even the world will need to successfully notice and orient to that fact. By default I expect tons of economic and political pressure towards various actors trying to to get more AI power even if there’s broad agreement that it’s dangerous. If AI is “technically philosophically challenging”, then it’s important for Anthropic to understand that and to use it’s “seat at the table” strategy to try to (ideally) help convey this (or, maybe “argue from authority.”)

5. Even if you, you probably want to get actual fully superhuman AI solving tons of problems.

Your org culture needs to handle the philosophy

Anthropic is banking on “Medium-strong AI can help us handle Strong AI.” I’m not sure whether the leadership is thinking more about using medium-AI to “figure out how to align strong-AI” or more like using it to “do narrower things that buy us more time.”

To be clear: I am also banking on leveraging “medium-strong AI” to help, at this point. But, I think leveraging it to help align superhuman AI requires directly tackling the philosophically hard parts.

I get a sense that Anthropic research culture is sort of allergic to philosophy. I think this allergy is there for a reason – there’s a lot of people making longwinded arguments that are, in fact, BS. The bitter lesson taught the ML community that simple, scalable methods that leverage computation tend to ultimately outperform hand-crafted, theoretically sophisticated approaches.

But, when you get to the point where you’re within one “accidental leave-the-AI-training too long” away from superintelligence, you think you really do need world-class technically-grounded philosophy of a kind that I’m not sure humanity has even seen yet.

Philosophers-as-a-field have not reached consensus on a lot of important issues. This includes both moral “what is The Good” kind of questions, but also more basic seeming things like “what even is an object?”. IMO, this doesn’t mean “you can’t trust philosophy”, it means “you should expect it to be hard, and you need to deal with it anyway.”

I’m actually worried about the default result of trying to leverage LLM-agents to help solve these sorts of problems to make people stupider, instead of smarter. My experience is that LLMs have nudged me towards trying to solve problems of the shape that LLMs are good at solving. There’s already a massive set of biases nudging people to try substitute easier versions of the alignment problem that don’t help much, and LLMs will exacerbate that.

I can buy “when you have human-level AI, you get to run tons of empirical experiments that tell you a lot about an AI’s psychology and internals”, that you don’t get to run on humans. But, those experiments need to pointed towards having a deep, robust understanding in order for it to be safe to scale AI past the human levels. When I hear vague descriptions of the sort of experiments Anthropic seems to be running, they do not seem pointed in that direction.

(what would count as the right sort of experiments? I don’t really know. I am hoping for some followup debate between prosaic and agent-foundations-esque researchers on the details of this)

Also, like, you should be way more pessimistic about how this is organizationally hard

I wrote “Carefully Bootstrapped Alignment” is organizationally hard and Recursive Middle Manager Hell kind of with the explicit goal of slowing down how rapidly Anthropic scaled, because I think it’s very hard for large orgs to handle this kind of nuance well.

That didn’t work, at least in the sense of “not hiring a lot of people.” Two years ago, when I last talked to a bunch of Anthropic peeps about this, my sense was that there were indeed some of the pathologies I was worried about.

Since then, I’ve been at least somewhat pleasantly surprised about vague vibes about how Dario controls Anthropic (that is to say: seems like he tries to actually run it with his models).

I still think it is extremely hard to steer a large organization, and my sense is Dario’s attention is still naturally occupied by tons of things about “how to run a generally successful company” (which is already quite hard), which I doubt leaves nearly enough room for either the theoretic questions of “what is really needed to align superintelligence” and “what implications does that have for a company that needs to hit a very narrow research target in a short timeline?”

Listing Cruxes & Followup Debate

Recapping the overall point here:

If alignment were sufficiently hard in particular ways, I don’t think Anthropic’s cluster of strategies makes sense. If alignment is hard, it doesn’t matter if China wins the race. China still loses and so does everyone else. The most important thing would be somehow slowing down across board.

It doesn’t matter if we get a few years of beneficial technical progress if some US lab or government project eventually runs an AI with fewer safeguards, or if humanity cedes control of it’s major institutions to AI processes.

The thing I am hoping to come out of the comments here is getting more surface area on people’s cruxes, while engaging with the entirety of the problem. I’d like it if people got more specific about where they disagree.

If you disagree with my framing here, that’s fine, but if so I’d like to see your own framing that engages with the entirety of the problem (and framed around “what would be sufficient to think Anthropic should significantly change it’s research or policy comms?”

Drake’s arguments about “what capability levels do we actually need to execute useful pivotal acts? What does the alignment-difficult curve look like right around that area?” were somewhat new-to-me, and I don’t think I’ve seen a thorough writeup from the “AI is very hard” crowd that really engaged with that.

I’ll be writing a top-level comment that list out many of my own cruxes.

  1. ^

    Anthropic people self report as thinking “alignment may be hard”, but, I’m comparing this to the MIRI cluster who are like “it is so hard your plan is fundamentally bad, please stop.”

  2. ^

    This is a guess after talking with Drake Thomas about his sense of the Anthropic view on comms strategy during the drafting of this post, not a deeply-integrated piece of Ray’s model of Anthropic.

  3. ^

    This is based on comments like in Dario’s post on export controls:

    “If China can’t get millions of chips, we’ll (at least temporarily) live in a unipolar world, where only the US and its allies have these models.

    It’s unclear whether the unipolar world will last, but there’s at least the possibility that, because AI systems can eventually help make even smarter AI systems, a temporary lead could be parlayed into a durable advantage.”

  4. ^

    This is noticeably different than what I’d be spending chips on.

  5. ^

    He added recently:

    On a minute’s reflection, I’d still endorse 2%-5% on this exact claim, though I’d go maybe-much higher on “superalignment in practice takes at least $years of serial research (with models stronger than current SOTA) to solve.”

  6. ^

    The Anthropic beta-readers to objected to this one. One said:

    Disagree, at least with the implication for non-myopic goals. I see no serious barrier to systems which competently perform faster-than-human, maybe-smarter-than-human research, as an essentially local tree of tasks with a clear “stop if unsafe” priority on all of them.
    (tbc not claiming this solves all problems!)

  7. ^

    I can imagine AI that merely correctly finds existing relevant facts and arguments and being logically coherent, without being deeply good at original thinking, which doesn’t require extreme philosophical competence. But, I don’t think this will be enough to invent novel tech faster than someone else will build reckless AI.

  8. ^

    I certainly want this, but I don’t know that we’ll actually get it.