Anthropic, and taking “technical philosophy” more seriously

RaemonMar 13, 2025, 1:48 AM

123 points

So, I have a lot of complaints about Anthropic, and about how EA / AI safety people often relate to Anthropic (i.e. treating the company as more trustworthy/good than makes sense).

At some point I may write up a post that is focused on those complaints.

But after years of arguing with Anthropic employees, and reading into the few public writing they’ve done, my sense is Dario/Anthropic-leadership are at least reasonably earnestly trying to do good things within their worldview.

So I want to just argue with the object-level parts of that worldview that I disagree with.

I think the Anthropic worldview is something like:

Superalignment is probably not that hard to navigate.^[1]
Misuse by totalitarian regimes or rogue actors is reasonably likely by default, and very bad.
AGI founded in The West would not be bad in the ways that totalitarian regimes would be.
Quick empiricism is a way better approach to solving research problems than trying to “think ahead” too cleverly.
It’s better to only make strong claims when you have strong evidence to back them up, and only make strong commitments when you’re confident about those commitments being the right ones. (This is partly about “well, there are some political games you just do need to play”, but at least partly motivated by epistemic principles).^[2]

My current understanding of the corresponding Anthropic strategy is:

Stay on the frontier of AI, so that they reliably have “a seat at the table” during the realpolitik discussions of how AI policy plays out.
Race to get ~human-level AGI that helps with scalable superalignment research.^[3]
Iterate on various empirical frameworks for scalable oversight, e.g. various ways you can use weaker models to check that larger untrusted models are safe. Use a mix of interpretability and control techniques to help.
Encourage the West to race, banking on it being easier to get companies / politicians to pivot to safer practices when there’s clearer smoking guns, and clearer asks to make of people.
Eventually they do need to spend a bunch of political chips, which they seem to be starting to spend this year. But, those chips are focused on something like: “race to ensure lead over China, then maybe briefly slow down when the lead is more secure and we’re close to the dangerous ASL levels, then invent superintelligence, and use it to solve a bunch of problems”^[4] (as opposed to “global moratorium with international buy-in”)
????

I haven’t seen a clear plan for what comes exactly after step #5. I’ve seen Dario say in interviews “we will need to Have A Some Kind of Conversation About AI, Together.” This is pretty vague. It might be vague because Dario doesn’t have a plan there, or might be that the asks are too overton-shattering to say right now and he is waiting till there’s clearer evidence to throw at people.

I think it’s plausible Dario’s take is like “look it’s not that useful to have a plan in advance here because it’ll depend a ton on what the gameboard looks like at the time.”

I: Arguments for “Technical Philosophy”

The crucial questions in my mind here are:

Technical

Do we need 10-30 years of serial research, which can’t be parallelized?
Does superalignment require “extreme philosophical competence?”
Is the Anthropic culture/playbook prepared for the potential necessity of 10+ year pauses?
What does the curve of “alignment difficulty vs capabilities” look like, right around the point where AI becomes capable enough to meaningfully help with “ending the acute risk period?”

Geopolitical

How inevitable is racing?
How willing would China / others be to go along with a serious pause proposal?
How much Overton-window-smashing do we need “right now” to get to an adequate world and how practical is that?

Tying together in:

Should Anthropic comms and strategy be way more focused on getting everyone to slow down, and way less focused on racing against totalitarian states?

I don’t disagree that totalitarian AI would be real bad. It’s quite plausible to me that the “global pause” crowd are underweighting how bad it would be.

But I’m personally at like “it’s at least ~60% that super alignment is Real Hard”, and that the Anthropic playbook and research culture is not well suited for solving it. And this makes the “how bad is totalitarian AI?” question kind of moot.

It feels vaguely reasonable to me to have a belief as low as 15% on “Superalignment is Real Hard in a way that requires like a 10-30 year pause.” And, at 15%, it still feels pretty crazy to be oriented around racing the way Anthropic is.

I once talked to Zac Hatfield-Dodds who agreed with me about “15% on ’superalignment requires 10+ year pause” being a crux,” but their number was like 5%^[5]. And, yeah – if I earnestly believed 5%, I would honestly be unsure how to compare risk of x-risk from risk of catastrophic misuse. I was pretty happy to find such a crisp disagreement – a crux! a double crux even!

But, my guess is that Anthropic employees have biased reasons for thinking the risk is so low, or can be treated as low in practice. See Optimistic Assumptions, Longterm Planning, and “Cope”. But, over my years of arguing with them, I’ve found it harder to present a slam-dunk case than I initially expected, and I think it’d be good to actually argue some of the object level details.

I will articulate some arguments here, and mostly I am hoping for more argument and back-and-forth in the comments. I think it’d be cool if this ended with someone representative of Anthropic having more of an extended debate with someone with stronger takes on the technical side of things than me.

10-30 years of serial research, or “extreme philosophical competence.”

Anthropic seems like they are making costly enough signals of “we may need to slow down for 6-12 months or so, and we want other people to be ready for that too”, that I believe they take that earnestly.

But they aren’t saying anything like “hey everyone, we maybe will have to stop for like over a decade in over to figure things out.” And this is the thing I most wish they would do.

The notion of “we need a lot of serial research time” was written up by Nate Soares in a post on differential technological development. The argument is: even though some types of research will get easier to do when we have more “realistic/representative” AI, or more researchers available to focus on the problem, there are some kinds of research that really require one person to do a lot of deep thinking, integrating concepts, synthesizing, which historically has seemed to take blocks of time that are measured in decades.

Right now, we’re in an “iterative empiricism” regime with AI, where you can see some problems, try some obvious solutions, see how those solutions work, and iterate on the next generation. The problem is that at some point, AI will probably hit a critical threshold where it rapidly recursively self-improves.

Does your alignment process safely scale to infinity?

A lot of safeguards that hold up when the next generation of AI is only slightly more powerful, break down when the next generation has scaled to “unboundedly powerful.”

I think Dario agrees with this part. My sense from scattered conversations and public comms and reading between the lines is “Anthropic will be able to notice and slow down before we get to this point, and make sure they have a good scalable oversight system in place before you begin the recursive self-improvement process.”

Where I think we disagree, is how badly the Anthropic playbook is suited for handling this shift.

I think for the shift from “near human” to “massively superhuman” to go safely, your alignment team needs to be very competent at “technical philosophy.”

By “technical philosophical competence,” I think I mostly mean: “Ability to define and reason about abstract concepts with extreme precision. In some cases, about concepts which humanity has struggled to agree on for thousands of years. And, do that defining in a way that is robust to extreme optimization, and is robust to ontological updates by entities that know a lot more about us and the universe than we know.”

Posts like AGI Ruin: A List of Lethalities (and, basically the entire LessWrong sequences) have tried to argue this point. Here is my attempt at a quick recap of why this is important:

Pointing:
- You need to successfully point the AI at anything at all. (This may superficially seem like it’s working with current LLMs, but it isn’t actually anywhere close to robust enough to hold up)
- You need to point the AI at some kind of nuanced abstract target, in particular, that remains stable as the AI updates its ontology.
- (You also eventually need to point the AI at a cluster of messy human-value-concepts in particular. Though from what I gather, MIRI-ish people think if you get the first two things, this last part isn’t actually that hard)
Unbounded optimization is dangerous. You need to do all of this with almost perfect mathematical robustness, because extreme levels of optimization are incredibly dangerous (in a way that is counterintuitive to most people)
Interpersonal Resolution. It may not be required initially (depending on your plan), but ultimately, at some point your AI, or civilization of AIs, needs to successfully navigate human values, in a way that doesn’t just work for one person, but somehow deals with interpersonal differences in values.
Intrinsic Goal-directedness. You have to do this even if you build an “oracle AI.” If your oracle is sufficiently powerful to be useful (i.e. can quickly make new scientific breakthroughs), it must think in terms similar to “having goals” and “achieving particular consequences”, because that is necessary for efficiently solving complex problems.^[6]
Someone will totally build reckless AI. You do have to deal with all these problems, because even if you can get controlled human-level (or slightly moderately superhuman) AI… sooner or later someone is going to build a reckless, goal directed AI and not even bother isolating it from the internet.
Pivotal necessity. To deal with reckless AI, you need a lot of optimization power, to robustly prevent major companies and nationstates and eventually small hard-to-identify actors building reckless AI. Maybe you need a pivotal act. Maybe it’s better to frame it as a pivotal process. It seems good if it’s more of a collaborative thing that world nations are deciding on together, but you need a pivotal something, and even if you have a lot of buy-in, it needs to be able to contend with people/companies/nations that aren’t bought in.

We need AI (or, at least some extremely powerful technology) that can help us end the acute risk period.

“Okay, but what does the alignment difficulty curve look like at the point where AI is powerful enough to start being useful for Acute Risk Period reduction?”

A counterpoint I got from Anthropic employee Drake Thomas (note: not speaking for Anthropic or anything, just discussing his own models), is something like:

Sure, I buy that unbounded optimization is hella dangerous, and that you can’t naively scale our current processes to handle it. This seems very obvious.
But, the question here is not “can we handle unbounded optimization right away?”. The question is “can we use ~medium-power AI to incrementally help us handle optimization needed to execute a pivotal act?” (It’s notably not even “get us all the way to safe unbounded superintelligence” – if we have a pivotal act, we can then take more time to get things right before we try to handle that case).
Can an IQ 150 AI help align an IQ 155 AI, in a way that is clear and legibly good to humans? Can a 155 IQ AI do that for a 160 AI? Where do we expect this to break down?
What “IQ” of an AI would we need to do a pivotal act? Is it before or after the MIRI folk expect a sharp left turn? Why do they think that?

I think this is a kind of reasonable point and I’m interested in takes from more MIRI-ish crowd people on it. But, I still feel quite worried.

Are there any pivotal acts that aren’t philosophically loaded?

A lot of hope (I think both for me, and my current understanding of Drake’s take), is “can you build an AI that helps invent powerful (non-AI) technology, where it’s sort of straightforward to use the technology to stop people from building reckless AI.

The only technology that I’ve imagined thus far that feels plausible is “invent uploading or bio enhancements that make humans a lot smarter” (essentially approaching “aligned superintelligence” by starting from humanity, rather than ML).

This does feel intuitively plausible to me, but:

1. I think, to build AI powerful enough to invent such technology quickly^[7], you need to be able to point the AI at abstract targets that are robust under ontological change. I don’t actually expect these are any easier than the more stereotypical “[moral] philosophy questions people have argued about for thousands of years.”

2. Even if powerful enough AI to help invent such technology is (currently) safe, it’s probably at least pretty close to “would be actively dangerous with slightly different training or scaffolding.”

3. You still end up needing to solve the problem of “prevent humans or AIs from building reckless AI, indefinitely”, even if you kick the can to uplifted human successors. And even if you’ve gotten a lot of buy-in from major governments ^[8]and such, you do need to be capable of robustly stopping many smart actors, who work persistently over a long time. This requires a lot of optimization power, enough that even if it’s not getting into “unboundedly huge” territory. I dunno, if your plan is routing through uplifted human successors figuring it out I still think you need to start contending with the details of this now.

4. If we need a decades-long-pause, then even the world will need to successfully notice and orient to that fact. By default I expect tons of economic and political pressure towards various actors trying to to get more AI power even if there’s broad agreement that it’s dangerous. If AI is “technically philosophically challenging”, then it’s important for Anthropic to understand that and to use it’s “seat at the table” strategy to try to (ideally) help convey this (or, maybe “argue from authority.”)

5. Even if you, you probably want to get actual fully superhuman AI solving tons of problems.

Your org culture needs to handle the philosophy

Anthropic is banking on “Medium-strong AI can help us handle Strong AI.” I’m not sure whether the leadership is thinking more about using medium-AI to “figure out how to align strong-AI” or more like using it to “do narrower things that buy us more time.”

To be clear: I am also banking on leveraging “medium-strong AI” to help, at this point. But, I think leveraging it to help align superhuman AI requires directly tackling the philosophically hard parts.

I get a sense that Anthropic research culture is sort of allergic to philosophy. I think this allergy is there for a reason – there’s a lot of people making longwinded arguments that are, in fact, BS. The bitter lesson taught the ML community that simple, scalable methods that leverage computation tend to ultimately outperform hand-crafted, theoretically sophisticated approaches.

But, when you get to the point where you’re within one “accidental leave-the-AI-training too long” away from superintelligence, you think you really do need world-class technically-grounded philosophy of a kind that I’m not sure humanity has even seen yet.

Philosophers-as-a-field have not reached consensus on a lot of important issues. This includes both moral “what is The Good” kind of questions, but also more basic seeming things like “what even is an object?”. IMO, this doesn’t mean “you can’t trust philosophy”, it means “you should expect it to be hard, and you need to deal with it anyway.”

I’m actually worried about the default result of trying to leverage LLM-agents to help solve these sorts of problems to make people stupider, instead of smarter. My experience is that LLMs have nudged me towards trying to solve problems of the shape that LLMs are good at solving. There’s already a massive set of biases nudging people to try substitute easier versions of the alignment problem that don’t help much, and LLMs will exacerbate that.

I can buy “when you have human-level AI, you get to run tons of empirical experiments that tell you a lot about an AI’s psychology and internals”, that you don’t get to run on humans. But, those experiments need to pointed towards having a deep, robust understanding in order for it to be safe to scale AI past the human levels. When I hear vague descriptions of the sort of experiments Anthropic seems to be running, they do not seem pointed in that direction.

(what would count as the right sort of experiments? I don’t really know. I am hoping for some followup debate between prosaic and agent-foundations-esque researchers on the details of this)

Also, like, you should be way more pessimistic about how this is organizationally hard

I wrote “Carefully Bootstrapped Alignment” is organizationally hard and Recursive Middle Manager Hell kind of with the explicit goal of slowing down how rapidly Anthropic scaled, because I think it’s very hard for large orgs to handle this kind of nuance well.

That didn’t work, at least in the sense of “not hiring a lot of people.” Two years ago, when I last talked to a bunch of Anthropic peeps about this, my sense was that there were indeed some of the pathologies I was worried about.

Since then, I’ve been at least somewhat pleasantly surprised about vague vibes about how Dario controls Anthropic (that is to say: seems like he tries to actually run it with his models).

I still think it is extremely hard to steer a large organization, and my sense is Dario’s attention is still naturally occupied by tons of things about “how to run a generally successful company” (which is already quite hard), which I doubt leaves nearly enough room for either the theoretic questions of “what is really needed to align superintelligence” and “what implications does that have for a company that needs to hit a very narrow research target in a short timeline?”

Listing Cruxes & Followup Debate

Recapping the overall point here:

If alignment were sufficiently hard in particular ways, I don’t think Anthropic’s cluster of strategies makes sense. If alignment is hard, it doesn’t matter if China wins the race. China still loses and so does everyone else. The most important thing would be somehow slowing down across board.

It doesn’t matter if we get a few years of beneficial technical progress if some US lab or government project eventually runs an AI with fewer safeguards, or if humanity cedes control of it’s major institutions to AI processes.

The thing I am hoping to come out of the comments here is getting more surface area on people’s cruxes, while engaging with the entirety of the problem. I’d like it if people got more specific about where they disagree.

If you disagree with my framing here, that’s fine, but if so I’d like to see your own framing that engages with the entirety of the problem (and framed around “what would be sufficient to think Anthropic should significantly change it’s research or policy comms?”

Drake’s arguments about “what capability levels do we actually need to execute useful pivotal acts? What does the alignment-difficult curve look like right around that area?” were somewhat new-to-me, and I don’t think I’ve seen a thorough writeup from the “AI is very hard” crowd that really engaged with that.

I’ll be writing a top-level comment that list out many of my own cruxes.

^
Anthropic people self report as thinking “alignment may be hard”, but, I’m comparing this to the MIRI cluster who are like “it is so hard your plan is fundamentally bad, please stop.”
^
This is a guess after talking with Drake Thomas about his sense of the Anthropic view on comms strategy during the drafting of this post, not a deeply-integrated piece of Ray’s model of Anthropic.
^
This is based on comments like in Dario’s post on export controls:
“If China can’t get millions of chips, we’ll (at least temporarily) live in a unipolar world, where only the US and its allies have these models.
It’s unclear whether the unipolar world will last, but there’s at least the possibility that, because AI systems can eventually help make even smarter AI systems, a temporary lead could be parlayed into a durable advantage.”
^
This is noticeably different than what I’d be spending chips on.
^
He added recently:
On a minute’s reflection, I’d still endorse 2%-5% on this exact claim, though I’d go maybe-much higher on “superalignment in practice takes at least $years of serial research (with models stronger than current SOTA) to solve.”
^
The Anthropic beta-readers to objected to this one. One said:
Disagree, at least with the implication for non-myopic goals. I see no serious barrier to systems which competently perform faster-than-human, maybe-smarter-than-human research, as an essentially local tree of tasks with a clear “stop if unsafe” priority on all of them.
(tbc not claiming this solves all problems!)
^
I can imagine AI that merely correctly finds existing relevant facts and arguments and being logically coherent, without being deeply good at original thinking, which doesn’t require extreme philosophical competence. But, I don’t think this will be enough to invent novel tech faster than someone else will build reckless AI.
^
I certainly want this, but I don’t know that we’ll actually get it.

What links here?

The prospect of accelerated AI safety progress, including philosophical progress by Mitchell_Porter (Mar 13, 2025, 10:52 AM; 11 points)

RaemonMar 13, 2025, 1:48 AM

123 points

29 comments11 min readLW link

AI Anthropic (org)

ryan_greenblatt Mar 14, 2025, 11:27 PM
37 points
12

My guess is that the parts of the core leadership of Anthropic which are thinking actively about misalignment risks (in particular, Dario and Jared) think that misalignment risk is like ~5x smaller than I think it is while also thinking that risks from totalitarian regimes are like 2x worse than I think they are. I think the typical views of opinionated employees on the alignment science team are closer to my views than to the views of leadership. I think this explains a lot about how Anthropic operates.

I share similar concerns that Anthropic doesn’t seem to institutionally prioritize thinking about the future or planning, and their public outputs to date are not encouraging evidence of careful thinking about misalignment. That said, I’m pretty sympathetic to the idea that while this isn’t great, this isn’t that bad because careful thinking about exactly what needs to happen in the future isn’t that good of an approach for driving research direction or organization strategy. My biggest concern is that while careful planning is perhaps not that important now (you can maybe get most of the value by doing things that seem heuristically good), we’ll eventually need to actually do a good job thinking through what exactly should be implemented when the powerful AIs are right in front of us. I don’t think we’ll be able to make the ultimate strategic choices purely based on empirical feedback loops.

It feels vaguely reasonable to me to have a belief as low as 15% on “Superalignment is Real Hard in a way that requires like a 10-30 year pause.” And, at 15%, it still feels pretty crazy to be oriented around racing the way Anthropic is.

I don’t really see why this is a crux. I’m currently at like ~5% on this claim (given my understanding of what you mean), but moving to 15% or even 50% (while keeping the rest of the distribution the same) wouldn’t really change my strategic orientation. Maybe you’re focused on getting to a world with a more acceptable level of risk (e.g., <5%), but I think going from 40% risk to 20% risk is better to focus on.

The relevant questions are:
- What is the risk reduction obtained by going from “token effort / no pause” to “1 year pause which is well timed and focused on misalignment concerns” vs. the risk reduction obtained by going from “1 year pause” to “20 year pause”.^[1]
- How much do various strategies change the probability of something like “1 year pause” and “20 year pause”.
(By 1 year pause, I mean devoting 1 year of delay worth of resources to safety. This would include scenarios where progress is halted for a year and all AI company resources are devoted to safety. It would also count if 50% of compute and researchers are well targeted toward safety for 2 years at around the point of full AI R&D automation, which would yield a roughly 1 year slow down. By “well timed”, I’m including things like deciding to be slower for 2 years or fully paused for 1.)

I would guess that going from a token effort to a well-done 1 year pause for safety reasons (at an organization taking AI risk somewhat seriously and which is moderately competent) reduces risk from around 40% to 20%. (Very low confidence numbers.) And, it’s much easier to go from a token level of effort to a 1 year pause than to get a 20 year pause, so focusing on the ~1 year pause is pretty reasonable. For what it’s worth, it’s not super obvious to me that Anthropic thinks of themselves as going for a 1 year pause. I don’t think there’s a different strategy Anthropic could follow which would increase the chance of successfully getting buy-in for a 20 year pause to a sufficient degree that focusing on this over the 1 year pause makes sense.

I’m pretty skeptical of the “extreme philosophical competence” perspective. This is basically because we “just” need to be able to hand off to an AI which:
- Is seriously aligned (e.g., it faithfully pursues our interests on long open-ended and conceptually loaded tasks that are impossible for us to check).
- Is strictly more capable than us.
- Has enough time to do nearly as well as humans could have hoped to have done.
Creating such a system seems unlikely to require “extreme philosophical competence” even if such a system has to apply “extreme philosophical competence” to sufficiently align successors.

Josh has some posts from a similar perspective, though I don’t fully agree with him (at least at a vibes level), see: here, here, and here.
1. ↩︎
  There is also obviously a case that a 20 year pause is bad if poorly managed, though I don’t think this is very important for this discussion.
- Raemon Mar 15, 2025, 2:47 AM
  8 points
  2
  Parent
  
  I’m pretty skeptical of the “extreme philosophical competence” perspective. This is basically because we “just” need to be able to hand off to an AI which is seriously aligned (e.g., it faithfully pursues our interests on long open-ended and conceptually loaded tasks that are impossible for use to check).
  The “extreme philosophical competence” hypothesis is that you need such competence to achieve “seriously aligned” in this sense. It sounds like you disagree, but I don’t know why since your reasoning just sidesteps the problem.
  Looking over the comments of the first joshc post, it seems like that’s also basically asserted it wasn’t necessary by fiat. And, the people who actively believe in “alignment is philosophically loaded” showed up to complain that this ignored the heart of the problem.
  My current summary of the arguments (which I put ~60% on, and I think Eliezer/Oli/Wentworth treat much more confidently and maybe believe a stronger version of) are something like:
  1. Anything general enough to really tackle openended, difficult-to-evaluate plans, will basically need to operate in a goal directed way in order to do that. (i.e. What’s Up With Confusingly Pervasive Goal Directedness?)
  2. The goal-directedness means it’s very likely to be self/situationally aware, and the requisite intelligence to solve these sorts of problems means even if it’s not full blown anti-aligned, it’s at least probably going to want to try to build more option value for itself.
  3. The fact that you can’t evaluate the results means it has a lot of room to give you answers that help preserve it’s goals and bootstrap (at least on the margin), even if it’s not massively smart enough to one-shot escape. And you can’t solve that problem with Control (i.e. The Case Against AI Control Research).
  4. You can maybe have interpretability tools that check for schemingness (if it’s the first generation of generally capable agent and isn’t too smart yet, maybe you’ve done a good job preserving Chain of Thought as a reasonably faithful representation, for now). But, you’ll then just see “yep, the agent is unaligned”, and not actually be able to fix it.
  I think my current model of you (Ryan) is like:
  “Training models to do specific things, cleverly, actually just makes it pretty hard for them to develop scheming or other motivated misalignments – they have to jump all the way from “don’t think about scheming ever” to “secretly think about scheming” to avoid getting caught, and that probably just won’t work?”
  (or, in example of the second joshC post, they need to learn to be really reliably good at truth-tracking-patterns and articulating their reasoning, after internalizing that for thousands of reps, an AI is just gonna have a hard time jumping to reasoning that isn’t truth tracking).
  I don’t have a clear model of how you respond to point #4 – that we’ll just reliably find them to be scheming if we succeed at the interpretability steps, and not have a good way of dealing with it. (Maybe you just don’t think this is as overwhelmingly likely?)
  Interested in whatever Real You’s cruxes are, 1-2 steps removed.
  - ryan_greenblatt Mar 15, 2025, 5:09 AM
    14 points
    1
    Parent
    
    The “extreme philosophical competence” hypothesis is that you need such competence to achieve “seriously aligned” in this sense. It sounds like you disagree, but I don’t know why since your reasoning just sidesteps the problem.
    
    Yes, my reasoning is definitely part, but not all of the argument. Like the thing I said is a sufficient crux for me. (If I thought we had to directly use human labor to align AIs which were qualitatively wildly superhuman in general I would put much more weight on “extreme philosophical competence”.)
    
    Looking over the comments of the first joshc post, it seems like that’s also basically asserted it wasn’t necessary by fiat.
    
    I agree that Josh’s first post doesn’t argue about this, but the later two posts kinda do? That said, I think Josh doesn’t really state the argument very clearly in a single place anywhere (at least not in the way I would when trying to cover this objection).
    
    “Training models to do specific things, cleverly, actually just makes it pretty hard for them to develop scheming or other motivated misalignments – they have to jump all the way from “don’t think about scheming ever” to “secretly think about scheming” to avoid getting caught, and that probably just won’t work?”
    
    Hmm, no this isn’t really that much of my model. Or at least, I don’t feel myself thinking thoughts like this.
    
    Interested in whatever Real You’s cruxes are, 1-2 steps removed.
    
    I think this is probably not the best place to discuss this, but I’ll try laying out the quick version of the argument, mostly so I can practice explaining myself here.
    
    Objectives and components
    
    Our goal is to make a trustworthy AI that we defer to. This AI need to be:
    
    Capable: At least as capable as we are collectively. This includes handling wicked strategy/forecasting/conceputual problems and anticipating risks from what it does.
    Trustworthy: It needs to faithfully pursue our interests, including on confusing and open ended problems that are conceptually loaded and that we can’t really check.
    
    We’ll do this using two components:
    
    Avoid egregious misalignment: Rule out misalignment that could make behavioral tests very misleading: avoid scheming (both during training and later developed at runtime as we run more copies of the AI), types of sycophancy that are hard to detect, and value drift.
    Use behavioral tests for capabilities and the parts of trustworthiness that remain.These tests will need to evaluate how the AI generalizes to tasks that are harder, more open ended, longer, and harder to check than the domains we trained it on. (Because we’ll need things to generalize further than this.) Iterate against these tests until we know how to train AIs that generalize well.
    
    Avoiding egregious misalignment
    
    I’d first claim that we at least have mechanisms (that don’t require extreme philosophical competence) for greatly reducing the risk of egregious misalignment in expectation. The strategies available are:
    
    Test measures on analogous setups (model organisms) and then try to transfer
    Apply principled methods where we have some a priori story it should help
    
    Analogous setups includes:
    
    Cases where a model naturally became egregiously misaligned (ideally with a roughly similar architecture and training method to the one we are using).
    Cases where we trained in egregious misalignment (or at least heavily nudged the model towards this).
    Very abstract analogies where the object we’re studying isn’t actually egregious misalignment, but is analogous to this.
    
    If we had a large range of pretty good test beds of analogous naturally occuring misalignment that cover the failure modes we’re most worried about, then iterating on countermeasures in these test beds seems quite useful. (The countermeasures would be drawn from the principled methods.)
    
    A serious problem is that we’re not necessarily able to consistently detect misalignment. We also need to be able to test whether misalignment is less likely after some countermeasure, so being able to reproducably catch the misalignment at sufficiently low cost is important.^[1] This is probably the central difficulty in test beds. We also need to ensure that our countermeasures aren’t effectively training against our tests.^[2]
    
    (To get a lot of signal, our test beds would need to be representative of the failure modes we’re worried about or there would need to be some reason to expect the countermeasures we try to generalize to all failure modes)
    
    Another objection is: “Actually, we didn’t have any methods available to try which could end up with a model that (always) isn’t egregiously misaligned. So, even if you can iterate a bunch, you’ll just either find that nothing works or you’ll just fool yourself.” I’m somewhat skeptical of this objection if we actually have a bunch of signal in the test beds and can afford to iterate. It feels to me like in most worlds where we have serious misalignment problems, we can get a good enough set of methods to try that we “just” need like <5 bits. As in, if we were to randomly sample sets of measures (and parameters for these measures, etc) from stuff that seem promising, it seems unlikely we’d need more than 32 tries before something actually worked. This means you maybe don’t need to apply crazy amounts of optimization pressure. I don’t have a cached quick justification.
    
    Oops, I ran out of my time box for this comment. I didn’t get to principled measures. I can send you my doc though.
    
    Behavioral tests
    
    Oops, I ran out of my time box for this comment. Hopefully this section is somewhat comprehensible though I don’t get into detail.
    
    We need to test that our AI does reasonable and aligned stuff on really hard to check tasks. So, how can we get confidence in this given that the tasks are hard to check?
    
    First, we can see how the AI does on relatively hard to check or philosophically tricky tasks which we have already looked into substantially for unrelated reasons. A big reason why things might be hard to check is that we don’t have enough time: the AI might do investigations that would take humans decades. But, we have done some investigations in the past! This won’t ultimately let us check tasks as hard as the ones we want the AIs to do, but we can get close. Note that we also have to apply a similar process for humans.
    Second, we can study how generalization on this sort of thing works in general: when an AI is trained on easy, looks good on medium, how will it do on hard? We can study this in all kinds of ways so that we can gain confidence in generalization. In particular, we can get a sense of how reasonable humans are and trust them to generalize. If we could get a similar property for AIs, we might be good, and this doesn’t seem clearly harder than the problem of making AIs which are as capable as humans.
    
    We’d then need to iterate against these tests in a way that doesn’t overfit.
    
    ↩︎
    Technically, we just need some signal here, but if catching the misalignment was very costly initially, P(catch) increases with spend, and our initial catch doesn’t make it much easier to catch peturbations of the model that might behave slightly differently, then we probably wouldn’t get enough signal. That said, if we can sometimes detect misalignment in some models, iteration still could be quite useful
    
    ↩︎
    More precisely, we need to ensure that our countermeasures aren’t just shifting from a type of misalignment we can detect to a type we can’t. Qualitatively analyzing the countermeasures and our tests should help here.
    - Raemon Mar 16, 2025, 7:21 PM
      2 points
      0
      Parent
      
      Thanks for laying this out thus far. I’mma reply but understand if you wanna leave the convo here . I would be interested in more effortpost/dialogue about your thoughts here.
      Yes, my reasoning is definitely part, but not all of the argument. Like the thing I said is a sufficient crux for me. (If I thought we had to directly use human labor to align AIs which were qualitatively wildly superhuman in general I would put much more weight on “extreme philosophical competence”.)
      This makes sense as a crux for the claim “we need philosophical competence to align unboundedly intelligent superintelligences.” But, it doesn’t make sense for the claim “we need philosophical competence to align general, openended intelligence.” I suppose my OP didn’t really distinguish these claims and there were a few interpretations of how the arguments fit together. I was more saying the second (although to be fair I’m not sure I was actually distinguishing them well in my head until now)
      It doesn’t make sense for “we just’ need to be able to hand off to an AI which is seriously aligned” to be a crux for the second. A thing can’t be a crux for itself.
      I notice my “other-guy-feels-like-they’re-missing-the-point” → “check if I’m not listening well, or if something is structurally wrong with the convo” alarm is firing, so maybe I do want to ask for one last clarification on “did you feel like you understood this the first time? Does it feel like I’m missing the point of what you said? Do you think you understand why it feels to me like you were missing the point (even if you think it’s because I’m being dense about something?)
      Takes on your proposal
      Meanwhile, here’s some takes based on my current understanding of your proposal.
      These bits:
      We need to ensure that our countermeasures aren’t just shifting from a type of misalignment we can detect to a type we can’t. Qualitatively analyzing the countermeasures and our tests should help here.
      ...is a bit I think is philosophical-competence bottlenecked. And this bit:
      “Actually, we didn’t have any methods available to try which could end up with a model that (always) isn’t egregiously misaligned. So, even if you can iterate a bunch, you’ll just either find that nothing works or you’ll just fool yourself.”
      ...is a mix of “philosophically bottlenecked” and “rationality bottlenecked.” (i.e. you both have to be capable of reasoning about whether you’ve found things that really worked, and, because there are a lot of degrees of freedom, capable of noticing if you’re deploying that reasoning accurately)
      I might buy that you and Buck are competent enough here to think clearly about it (not sure. I think you benefit from having a number of people around who seem likely to help), but I would bet against Anthropic decisionmakers being philosophically competent enough.
      (I think at least some people on the alignment science or interpretability teams might be. I bet against the median such teammembers being able to navigate it. And ultimately, what matters is “does Anthropic leadership go forward with the next training run”, so it matters whether Anthropic leadership buys arguments from hypothetically-competent-enough alignment/interpretability people. And Anthropic leadership already seem to basically be ignoring arguments of this type, and I don’t actually expect to get the sort of empirical clarity that (it seems like) they’d need to update before it’s too late.)
      Second, we can study how generalization on this sort of thing works in general
      I think this counts as the sort of empiricism I’m somewhat optimisic about in my post. i.e. if you are able to find experiments that actually give you evidence about deeper laws, that let you then make predictions about new Actually Uncertain questions of generalization that you then run more experiments on… that’s the sort of thing I feel optimistic about. (Depending on the details, of course)
      But, you still need technical philosophical competence to know if you’re asking the right questions about generalization, and to know when the results actually imply that the next scale-up is safe.
      - ryan_greenblatt Mar 16, 2025, 8:47 PM
        2 points
        0
        Parent
        
        
        This makes sense as a crux for the claim “we need philosophical competence to align unboundedly intelligent superintelligences.” But, it doesn’t make sense for the claim “we need philosophical competence to align general, openended intelligence.”
        
        I was thinking of a slightly broader claim: “we need extreme philosophical competence”. If I thought we had to use human labor to align wildly superhuman AIs, I would put much more weight on “extreme philosophical competence is needed”. I agree that “we need philosophical competence to align any general, openended intelligence” isn’t affected by the level of capability at handoff.
        
        I might buy that you and Buck are competent enough here to think clearly about it (not sure. I think you benefit from having a number of people around who seem likely to help), but I would bet against Anthropic decisionmakers being philosophically competent enough.
        
        I think there might be a bit of a (presumably unintentional) motte and bailey here where the motte is “careful conceptual thinking might be required rather than pure naive empiricism (because we won’t have good enough test beds by default) and it seems like Anthropic (leadership) might fail heavily at this thinking” and the bailey is “extreme philosophical competence (e.g. 10-30 years of tricky work) is pretty likely to be needed”.
        
        I buy the motte here, but not the bailey. I think the motte is a substantial discount on Anthropic from my perspective, but I’m kinda sympathetic to where they are coming from. (Getting conceptual stuff and futurism right is real hard! How would they know who to trust among people disagreeing wildly!)
        
        And ultimately, what matters is “does Anthropic leadership go forward with the next training run”, so it matters whether Anthropic leadership buys arguments from hypothetically-competent-enough alignment/interpretability people.
        
        I don’t think “does anthropic stop (at the right time)” is the majority of the relevance of careful conceptual thinking from my perspective. Probably more of it is “do they do a good job allocating their labor and safety research bets”. This is because I don’t think they’ll have very much lead time if any (median −3 months) and takeoff will probably be slower than the amount of lead time if any, so pausing won’t be as relevant. Correspondingly, pausing at the right time isn’t the biggest deal relative to other factors, though it does seem very important at an absolute level.
        Raemon Mar 16, 2025, 10:25 PM
        2 points
        0
        Parent
        
        I think there might be a bit of a (presumably unintentional) motte and bailey here where the motte is “careful conceptual thinking might be required rather than pure naive empiricism (because we won’t be given good enough test beds by default) and it seems like Anthropic (leadership) might fail heavily at this” and the bailey is “extreme philosophical competence (e.g. 10-30 years of tricky work) is pretty likely to be needed”.
        Yeah I agree that was happening somewhat. The connecting dots here are “in worlds where it turns out we need a long Philosophical Pause, I think you and Buck would probably be above some threshold where you notice and navigate it reasonably.”
        I think my actual belief is “the Motte is high likelihood true, the Bailey is… medium-ish likelihood true, but, like, it’s a distribution, there’s not a clear dividing line between them”
        I also think the pause can be “well, we’re running untrusted AGIs and ~trusted pseudogeneral LLM-agents that help with the philosophical progress, but, we can’t run them that long or fast, they help speed things up and make what’d normally be a 10-30 year pause into a 3-10 year pause, but also the world would be going crazy left to it’s own devices, and the sort of global institutional changes necessary are still similarly-outside-of-overton window as a 20 year global moratorium and the “race with China” rhetoric is still bad.
- Raemon Mar 15, 2025, 2:51 AM
  2 points
  0
  Parent
  
  I don’t really see why this is a crux. I’m currently at like ~5% on this claim (given my understanding of what you mean), but moving to 15% or even 50% (while keeping the rest of the distribution the same) wouldn’t really change my strategic orientation. Maybe you’re focused on getting to a world with a more acceptable level of risk (e.g., <5%), but I think going from 40% risk to 20% risk is better to focus on.
  I think you kinda convinced me here this reasoning isn’t (as stated) very persuasive.
  I think my reasoning had some additional steps like:
  - when I’m 15% on ‘alignment might be philosophically hard’, I still expect to maybe learn more and update to 90%+, and it seems better to pursue strategies that don’t actively throw that world under the bus. (and, while I don’t fully understand the Realpolitik, it seems to me that Anthropic could totally be pursuing strategies that achieve a lot of it’s goals without Policy Comms that IMO actively torch the “long pause” worlds)
  - you are probably right I was oriented around “getting to like 5% risk” than reducing risk on the margin.
  - I’m probably partly just not really visualizing what it’d be like to be a 15%-er and bringing some bias in.
  - ryan_greenblatt Mar 15, 2025, 4:10 AM
    5 points
    6
    Parent
    
    
    IMO actively torch the “long pause” worlds
    
    Not sure how interesting this is to discuss, but I don’t think I agree with this. Stuff they’re doing does seem harmful to worlds where you need a long pause, but feels like at the very least Anthropic is a small fraction of the torching right? Like if you think Anthropic is making this less likely, surely they are a small fraction of people pushing in this direction such that they aren’t making this that much worse (and can probably still pivot later given what they’ve said so far).
- Raemon Mar 15, 2025, 2:25 AM
  2 points
  0
  Parent
  
  Thanks. I’ll probably reply to different parts in different threads.
  For the first bit:
  My guess is that the parts of the core leadership of Anthropic which are thinking actively about misalignment risks (in particular, Dario and Jared) think that misalignment risk is like ~5x smaller than I think it is while also thinking that risks from totalitarian regimes are like 2x worse than I think they are. I think the typical views of opinionated employees on the alignment science team are closer to my views than to the views of leadership. I think this explains a lot about how Anthropic operates.
  The rough number you give are helpful. I’m not 100% sure I see the dots you’re intending to connect with “leadership thinks 1/5-ryan-misalignment and 2x-ryan-totalitariansm” / “rest of alignment science team closer to ryan” → “this explains a lot.”
  Is this just the obvious “whelp, leadership isn’t bought into this risk model and call most of the shots, but in conversations with several employees that engage more with misalignment?”. Or was there a more specific dynamic you thought it explained?
  - ryan_greenblatt Mar 15, 2025, 4:17 AM
    2 points
    0
    Parent
    
    Yep, just the obvious. (I’d say “much less bought in” than “isn’t bought in”, but whatever.)
    
    I don’t really have dots I’m trying to connect here, but this feels more central to me than what you discuss. Like, I think “alignment might be really, really hard” (which you focus on) is less of the crux than “is misalignment that likely to be a serious problem at all?” in explaining. Another way to put this is that I think “is misalignment the biggest problem” is maybe more of the crux than “is misalignment going to be really, really hard to resolve in some worlds”. I see why you went straight to your belief though.
aysja Mar 14, 2025, 5:19 AM
21 points
2

My high-level skepticism of their approach is A) I don’t buy that it’s possible yet to know how dangerous models are, nor that it is likely to become possible in time to make reasonable decisions, and B) I don’t buy that Anthropic would actually pause, except under a pretty narrow set of conditions which seem unlikely to occur.
As to the first point: Anthropic’s strategy seems to involve Anthropic somehow knowing when to pause, yet as far as I can tell, they don’t actually know how they’ll know that. Their scaling policy does not list the tests they’ll run, nor the evidence that would cause them to update, just that somehow they will. But how? Behavioral evaluations aren’t enough, imo, since we often don’t know how to update from behavior alone—maybe the model inserted the vulnerability into the code “on purpose,” or maybe it was an honest mistake; maybe the model can do this dangerous task robustly, or maybe it just got lucky this time, or we phrased the prompt wrong, or any number of other things. And these sorts of problems seem likely to get harder with scale, i.e., insofar as it matters to know whether models are dangerous.
This is just one approach for assessing the risk, but imo no currently-possible assessment results can suggest “we’re reasonably sure this is safe,” nor come remotely close to that, for the same basic reason: we lack a fundamental understanding of AI. Such that ultimately, I expect Anthropic’s decisions will in fact mostly hinge on the intuitions of their employees. But this is not a robust risk management framework—vibes are not substitutes for real measurement, no matter how well-intentioned those vibes may be.
Also, all else equal I think you should expect incentives might bias decisions the more interpretive-leeway staff have in assessing the evidence—and here, I think the interpretation consists largely of guesswork, and the incentives for employees to conclude the models are safe seem strong. For instance, Anthropic employees all have loads of equity—including those tasked with evaluating the risks!—and a non-trivial pause, i.e. one lasting months or years, could be a death sentence for the company.
But in any case, if one buys the narrative that it’s good for Anthropic to exist roughly however much absolute harm they cause—as long as relatively speaking, they still view themselves as improving things marginally more than the competition—then it is extremely easy to justify decisions to keep scaling. All it requires is for Anthropic staff to conclude they are likely to make better decisions than e.g., OpenAI, which I think is the sort of conclusion that comes pretty naturally to humans, whatever the evidence.
This sort of logic is even made explicit in their scaling policy:
It is possible at some point in the future that another actor in the frontier AI ecosystem will pass, or be on track to imminently pass, a Capability Threshold without implementing measures equivalent to the Required Safeguards such that their actions pose a serious risk for the world. In such a scenario, because the incremental increase in risk attributable to us would be small, we might decide to lower the Required Safeguards.
Personally, I am very skeptical that Anthropic will in fact end up deciding to pause for any non-trivial amount of time. The only scenario where I can really imagine this happening is if they somehow find incontrovertible evidence of extreme danger—i.e., evidence which not only convinces them, but also their investors, the rest of the world, etc.—such that it would become politically or legally impossible for any of their competitors to keep pushing ahead either.
But given how hesitant they seem to commit to any red lines about this now, and how messy and subjective the interpretation of the evidence is, and how much inference is required to e.g. go from the fact that “some model can do some AI R&D task” to “it may soon be able to recursively self-improve,” I feel really quite skeptical that Anthropic is likely to encounter the sort of knockdown, beyond-a-reasonable-doubt evidence of disaster that I expect would be needed to convince them to pause.
I do think Anthropic staff probably care more about the risk than the staff of other frontier AI companies, but I just don’t buy that this caring does much. Partly because simply caring is not a substitute for actual science, and partly because I think it is easy for even otherwise-virtuous people to rationalize things when the stakes and incentives are this extreme.
Anthropic’s strategy seems to me to involve a lot of magical thinking—a lot of, “with proper effort, we’ll surely surely figure out what to do when the time comes.” But I think it’s on them to demonstrate to the people whose lives they are gambling with, how exactly they intend to cross this gap, and in my view they sure do not seem to be succeeding at that.
- Raemon Mar 14, 2025, 8:05 PM
  4 points
  2
  Parent
  
  I agree with this (and think it’s good to periodically say all of this straightforwardly).
  I don’t know that it’ll be particularly worth your time, but, the thing I was hoping for this post was to ratchet the conversation-re-anthropic forward in, like, “doublecrux-weighted-concreteness.” (i.e. your arguments here are reasonably crux-and-concrete, but don’t seem to be engaging much with the arguments in this post that seemed more novel and representative of where anthropic employees tend to be coming from, instead just repeated AFAICT your cached arguments against Anthropic)
  I don’t have much hope of directly persuading Dario, but I feel some hope of persuading both current and future-prospective employees who aren’t starting from the same prior of “alignment is hard enough that this plan is just crazy”, and for that to have useful flow-through effects.
  My experience talking at least with Zac and Drake has been “these are people with real models, who share many-but-not-all-MIRI-ish assumptions but don’t intuitively buy that the Anthropic’s downsides are high, and would respond to arguments that were doing more to bridge perspectives.” (I’m hoping they end up writing comments here outlining more of their perspective/cruxes, which they’d expressed interest in in the past, although I ended up shipping the post quickly without trying to line up everything)
  I don’t have a strong belief that contributing to that conversation is a better use of your time than whatever else you’re doing, but it seemed sad to me for the conversation to not at least be attempted.
  (I do also plan to write 1-2 posts that are more focused on “here’s where Anthropic/Dario have done things that seem actively bad to me and IMO are damning unless accounted for,” that are less “attempt to maintain some kind of discussion-bridge”, but, it seemed better to me to start with this one)
Adam Scholl Mar 14, 2025, 2:17 AM
16 points
0

Dario/Anthropic-leadership are at least reasonably earnestly trying to do good things within their worldview
I think as stated this is probably true of the large majority of people, including e.g. the large majority of the most historically harmful people. “Worldviews” sometimes reflect underlying beliefs that lead people to choose actions, but they can of course also be formed post-hoc, to justify whatever choices they wished to make.
In some cases, one can gain evidence about which sort of “worldview” a person has, e.g. by checking it for coherency. But this isn’t really possible to do with Dario’s views on alignment, since to my knowledge, excepting the Concrete Problems paper he has actually not ever written anything about the alignment problem.^[1] Given this, I think it’s reasonable to guess that he does not have a coherent set of views which he’s neglected to mention, so much as the more human-typical “set of post-hoc justifications.”
(In contrast, he discusses misuse regularly—and ~invariably changes the subject from alignment to misuse in interviews—in a way which does strike me as reflecting some non-trivial cognition).
1. ^
  Counterexamples welcome! I’ve searched a good bit and could not find anything, but it’s possible I missed something.
- Raemon Mar 14, 2025, 5:06 AM
  8 points
  0
  Parent
  
  I think… agree denotationally and (lean towards) disagreeing connotationally? (like, seems like this is implying “and because he doesn’t seem like he obviously has coherent views on alignment-in-particular, it’s not worth arguing the object level?”)
  (to be clear, I don’t super expect this post to affect Dario’s decisionmaking models, esp. directly. I do have at least some hope for Anthropic employees to engage with these sorts of models/arguments, and my sense from talking them is that a lot of the LW-flavored arguments have often missed their cruxes)
  - Adam Scholl Mar 14, 2025, 5:16 AM
    2 points
    0
    Parent
    
    No, I agree it’s worth arguing the object level. I just disagree that Dario seems to be “reasonably earnestly trying to do good things,” and I think this object-level consideration seems relevant (e.g., insofar as you take Anthropic’s safety strategy to rely on the good judgement of their staff).
    - Raemon Mar 14, 2025, 5:19 AM
      2 points
      0
      Parent
      
      I think (moderately likely, though not super confident) it makes more sense to model Dario as:
      “a person who actually is quite worried about misuse, and is making significant strategic decisions around that (and doesn’t believe alignment is that hard)”
      than as “a generic CEO who’s just generally following incentives and spinning narrative post-hoc rationalizations.”
      - Adam Scholl Mar 14, 2025, 5:26 AM
        2 points
        −2
        Parent
        
        Yeah, I buy that he cares about misuse. But I wouldn’t quite use the word “believe,” personally, about his acting as though alignment is easy—I think if he had actual models or arguments suggesting that, he probably would have mentioned them by now.
        Raemon Mar 14, 2025, 5:59 PM
        2 points
        0
        Parent
        
        I don’t particularly disagree with the first half, but your second sentence isn’t really a crux for me for the first part.
Lukas_Gloor Mar 13, 2025, 1:29 PM
10 points
0

It feels vaguely reasonable to me to have a belief as low as 15% on “Superalignment is Real Hard in a way that requires like a 10-30 year pause.” And, at 15%, it still feels pretty crazy to be oriented around racing the way Anthropic is.
Yeah, I think the only way I maybe find the belief combination “15% that alignment is Real Hard” and “racing makes sense at this moment” compelling is if someone thinks that pausing now would be too late and inefficient anyway. (Even then, it’s worth considering the risks of “What if the US aided by AIs during takeoff goes much more authoritarian to the point where there’d be little difference between that and the CCP?”) Like, say you think takeoff is just a couple of years of algorithmic tinkering away and compute restrictions (which are easier to enforce than prohibitions against algorithmic tinkering) wouldn’t even make that much of a difference now.
However, if pausing now is too late, we should have paused earlier, right? So, insofar as some people today justify racing via “it’s too late for a pause now,” where were they earlier?
Separately, I want to flag that my own best guess on alignment difficulty is somewhere in between your “Real Hard” and my model of Anthropic’s position. I’d say I’m overall closer to you here, but I find the “10-30y” thing a bit too extreme. I think that’s almost like saying, “For practical purposes, we non-uploaded humans should think of the deep learning paradigm as inherently unalignable.” I wouldn’t confidently put that below 15% (we simply don’t understand the technology well enough), but I likewise don’t see why we should be confident in such hardness, given that ML at least gives us better control of the new species’ psychology than, say, animal taming and breeding (e.g., Carl Shulman’s arguments somewhere—iirc—in his podcasts with Dwarkesh Patel). Anyway, the thing that I instead think of as the “alignment is hard” objection to the alignment plans I’ve seen described by AI companies, is mostly just a sentiment of, “no way you can wing this in 10 hectic months while the world around you goes crazy.” Maybe we should call this position “alignment can’t be winged.” (For the specific arguments, see posts by John Wentworth, such as this one and this one [particularly the section, “The Median Doom-Path: Slop, Not Scheming”].)
The way I could become convinced otherwise is if the position is more like, “We’ve got the plan. We think we’ve solved the conceptually hard bits of the alignment problem. Now it’s just a matter of doing enough experiments where we already know the contours of the experimental setups. Frontier ML coding AIs will help us with that stuff and it’s just a matter of doing enough red teaming, etc.”
However, note that even when proponents of this approach describe it themselves, it sounds more like “we’ll let AIs do most of it ((including the conceptually hard bits?))” which to me just sounds like they plan on winging it.
- Noosphere89 Mar 13, 2025, 4:46 PM
  4 points
  2
  Parent
  
  My own take is I do endorse a version of the “pausing now is too late objection”, more specifically I think that for most purposes, we should assume pauses are too late to be effective when thinking about technical alignment, and a big portion of the reason is that I don’t think we will be able to convince many people that AI is powerful enough to need governance without them first hand seeing massive job losses, and at that point we are well past the point of no return for when we could control AI as a species.
  
  In particular, I think Eliezer is probably vindicated/made a correct prediction around how people would react to AI in there’s no fire alarm for AGI (more accurately, the fire alarm will go off way too late to serve as a fire alarm.)
  
  More here:
  
  https://www.lesswrong.com/posts/BEtzRE2M5m9YEAQpX/there-s-no-fire-alarm-for-artificial-general-intelligence
Adele Lopez Mar 13, 2025, 5:20 PM
8 points
6

I don’t disagree that totalitarian AI would be real bad. It’s quite plausible to me that the “global pause” crowd are underweighting how bad it would be.

I think an important crux here is on how bad a totalitarian AI would be compared to a completely unaligned AI. If you expect a totalitarian AI to be enough of an s-risk that it is something like 10 times worse than an AI that just wipes everything out, then racing starts making a lot more sense.
Raemon Mar 13, 2025, 2:07 AM
5 points
0

Cruxes and Questions
The broad thrust of my questions are:
Anthropic Research Strategy
- Does Anthropic building towards automated AGI research make timelines shorter (via spurring competition or leaking secrets)
- ...or, make timelines worse (by inspiring more AI companies or countries to directly target AGI, as opposed to merely trying to cash in on the current AI hype)
- Is it realistic for Anthropic to have enough of a lead to safely build AGI in a way that leads to durably making the world safer?
“Is Technical Philosophy actually that big a deal?”
- Can there be pivotal acts that require high AI powerlevels, but not unboundedly high, in a reasonable timeframe, such that they’re achievable without solving The Hard Parts of robust pointing?
Governance / Policy Comms
- Is it practical for a western coalition to stop the rest of the world (and, governments and other major actors within the western coalition) from building reckless or evil AI?
- Raemon Mar 13, 2025, 2:08 AM
  6 points
  0
  Parent
  
  Does Anthropic shorten timelines, by working on automatic AI research?
  I think “at least a little”, though not actually that much.
  There’s a lot of other AI companies now, but not that many of them are really frontier labs. I think Anthropic’s presence in the race still puts marginal pressure on OpenAI companies to rush things out the door a bit with less care than they might have otherwise. (Even if you model other labs as caring ~zero about x-risk, there are still ordinary security/bugginess reasons to delay releases so you don’t launch a broken product. Having more “real” competition seems like it’d make people more willing to cut corners to avoid getting scooped on product releases)
  (I also think earlier work by Dario at OpenAI, and the founding of Anthropic in the first place, probably did significantly shorten timelines. But, this factor isn’t significant at this point, and while I’m mad about the previous stuff it’s not actually a crux for their current strategy)
  Subquestions:
  - How many bits does Anthropic leak by doing their research? This is plausibly low-ish. I don’t know of them actually leaked much about reasoning models until after OpenAI and Deepseek had pretty thoroughly exposed that vein of research.
  - How many other companies are actually focused on automating AI research, or pushing frontier AI in ways that are particularly relevant? If it’s a small number, then I think Anthropic’s contribution to this race is larger and more costly. I think the main mechanism here might be Anthropic putting pressure on OpenAI in particular (by being one of 2-3 real competitors on ‘frontier AI’, which pushes OpenAI to release things with less safety testing)
  Is Anthropic institutionally capable of noticing “it’s really time to stop our capabilities research,” and doing so, before it’s too late?
  I know they have the RSP. I think there is a threshold of danger where I believe they’d actually stop.
  The problem is, before we get to “if you leave this training run overnight it might bootstrap into deceptive alignment that fools their interpretability and then either FOOMs, or gets deployed” territory, there will be a period of “Well, maybe it might do that but also The Totalitarian Guys Over There are still working on their training and we don’t want to fall behind”. And meanwhile, it’s also just sort of awkward/difficult^[10] to figure out how to reallocate all your capabilities researches onto non-dangerous tasks.
  How realistic is it to have a lead over “labs at more dangerous companies?” (Where “more dangerous” might mean more reckless, or more totalitarian)
  This is where I feel particularly skeptical. I don’t get how Anthropic’s strategy of race-to-automate-AI can make sense without actually expecting to get a lead, and with the rest of the world also generally racing in this direction, it seems really unlikely for them to have much lead.
  Relatedly… (sort of a subquestion but also an important top-level question)
  Does racing towards Recursive Self Improvement makes timelines worse (as opposed to “shorter”?)
  Maybe Anthropic pushing the frontier doesn’t shorten timelines (because there’s already at least a few other organizations who are racing with each other, and no one wants to fall behind).
  But, Anthropic being in the race (and, also publicly calling for RSI in a fairly adversarial way, i.e. “gaining a more durable advantage”) might cause there to be more companies and nations explicitly racing for full AGI, and doing so in a more adversarial way, and generally making the gameboard more geopolitically chaotic at a crucial time.
  This seems more true to me, than the “does Anthropic shorten timelines?” question.I think there are currently few enough labs doing this that a marginal lab going for AGI does make that seem more “real,” and give FOMO to other companies/countries.^[11]
  But, given that Anthropic has already basically stated they are doing this, the subquestion is more like:
  - If Anthropic publicly/credibly shifted away from racing, would that make race dynamics better? I think the answer here is “yes, but, it does depend on how you actually go about it.”
  Assuming Anthropic got powerful but controllable ~human-genius-ish level AI, can/will they do something useful with it to end the acute risk period?
  In my worldview, getting to AGI only particularly matters if you leverage it to prevent other people from creating reckless/powerseeking AI. Otherwise, whatever material benefits you get from it are short lived.
  I don’t know how Dario thinks about this question. This could mean a lot of things. Some ways of ending the acute risk period are adversarial, or unilateralist, and some are more cooperative (either with a coalition of groups/companies/nations, or with most of the world).
  This is the hardest to have good models about. Partly it’s just, like, quite a hard problem for anyone to know what it looks like to handle this sanely. Partly, it’s the sort of thing people are more likely to not be fully public about.
  Some recent interviews have had him saying “Guys this is a radically different kind of technology, we need to come together and think about this. It’s bigger than one company should be deciding what to do with.” There’s versions of this that are a cheap platitude more than earnest plea, but, I do basically take him at his word here.
  He doesn’t talk about x-risk, or much about uncontrollable AI. The “Core views on AI safety” lists “alignment might be very hard” as a major plausibility they are concerned with, and implies it ends up being like ¹⁄₃ or something of their
  Subquestions:
  - Are there useful things you can do here with controllable power levels of AI? i.e.
    Can you get to very high power levels using the set of skills/approaches Anthropic is currently bringing to bear?
    Can we muddle through the risk period with incremental weaker tech and moderate coalition-size advantage?
  - Will Anthropic be able to leverage this sanely/safely under time pressure?
Raemon Mar 14, 2025, 8:23 PM
4 points
2

I am kinda intrigued by how controversial this post seems (based on seeing the karma creep upwards and then back down over the past day). I am curious if the downvoters tend more like:
- Anti-Anthropic-ish folk who think the post is way too charitable/soft on Anthropic
- Pro-Anthropic-ish folk who think the post doesn’t make very good/worthwhile arguments against Anthropic
- “Alignment-is-real-hard” folks who think this post doesn’t represent the arguments for that very well.
- “other?”
PeterMcCluskey Mar 13, 2025, 11:27 PM
2 points
0

One crux is how soon do we need to handle the philosophical problems? My intuition says that something, most likely corrigibility in the Max Harms sense, will enable us to get pretty powerful AIs while postponing the big philosophical questions.

Are there any pivotal acts that aren’t philosophically loaded?

My intuition says there will be pivotal processes that don’t require any special inventions. I expect that AIs will be obedient when they initially become capable enough to convince governments that further AI development would be harmful (if it would in fact be harmful).

The combination of worried governments and massive AI-enhanced surveillance seems likely to be effective.

If we need a decades-long-pause, then even the world will need to successfully notice and orient to that fact. By default I expect tons of economic and political pressure towards various actors trying to to get more AI power even if there’s broad agreement that it’s dangerous.

I expect this to get easier to deal with over time. Maybe job disruptions will get voters to make AI concerns their top priority. Maybe the AIs will make sufficiently convincing arguments. Maybe a serious mistake by an AI will create a fire alarm.
- Raemon Mar 14, 2025, 1:40 AM
  4 points
  2
  Parent
  
  I expect that AIs will be obedient when they initially become capable enough to convince governments that further AI development would be harmful (if it would in fact be harmful).
  Seems like “the AIs are good enough at persuasion to persuade governments and someone is deploying them for that” is right when you need to be very high confidence they’re obedient (and, don’t have some kind of agenda). If they can persuade governments, they can also persuade you of things.
  I also think it gets into a point where I’d sure feel way more comfortable if we had more satisfying answers to “where exactly are we supposed to draw the line between ‘informing’ and ‘manipulating’” (I’m not 100% sure what you’re imagining here tho)
  - PeterMcCluskey Mar 15, 2025, 4:13 AM
    2 points
    0
    Parent
    
    I’m assuming that the AI can accomplish its goal by honestly informing governments. Possibly that would include some sort of demonstration that the of the AI’s power that would provide compelling evidence that the AI would be dangerous if it wasn’t obedient.
    
    I’m not encouraging you to be comfortable. I’m encouraging you to mix a bit more hope in with your concerns.
sjadler Mar 19, 2025, 1:00 AM
1 point
0

Very useful post! Thanks for writing it.
is robust to ontological updates
^ I think this might be helped by an example of the sort of ontological update you’d expect might be pretty challenging; I’m not sure that I have the same things in mind as you here
(I imagine one broad example is “What if AI discovers some new law of physics that we’re unaware of”, but it isn’t super clear to me how that specifically collides with value-alignment-y things?)
- ChristianKl Mar 21, 2025, 9:40 PM
  2 points
  0
  Parent
  
  The existing ontology that we have around consciousness is pretty unclear. A better understanding the nature of consciousness and thus what’s valuable will likely come with new ontology.
  When it comes to reasoning around statistics, robustness of judgements, causality, what it means not to Goodhart it’s likely that getting better at reasoning also means to come up with new ontology.
[ ]

[deleted]

Anthropic, and taking “technical philosophy” more seriously

I: Arguments for “Technical Philosophy”

10-30 years of serial research, or “extreme philosophical competence.”

Does your alignment process safely scale to infinity?

“Okay, but what does the alignment difficulty curve look like at the point where AI is powerful enough to start being useful for Acute Risk Period reduction?”

Are there any pivotal acts that aren’t philosophically loaded?

Your org culture needs to handle the philosophy

Also, like, you should be way more pessimistic about how this is organizationally hard

Listing Cruxes & Followup Debate

Objectives and components

Avoiding egregious misalignment

Behavioral tests

Takes on your proposal

Cruxes and Questions