LessWrong team member / moderator. I’ve been a LessWrong organizer since 2011, with roughly equal focus on the cultural, practical and intellectual aspects of the community. My first project was creating the Secular Solstice and helping groups across the world run their own version of it. More recently I’ve been interested in improving my own epistemic standards and helping others to do so as well.
Raemon
Were you by any chance writing in Cursor? I think they recently changed the UI such that it’s easier to end up in “agent mode” where it sometimes randomly does stuff.
I am kinda intrigued by how controversial this post seems (based on seeing the karma creep upwards and then back down over the past day). I am curious if the downvoters tend more like:
Anti-Anthropic-ish folk who think the post is way too charitable/soft on Anthropic
Pro-Anthropic-ish folk who think the post doesn’t make very good/worthwhile arguments against Anthropic
“Alignment-is-real-hard” folks who think this post doesn’t represent the arguments for that very well.
“other?”
I agree with this (and think it’s good to periodically say all of this straightforwardly).
I don’t know that it’ll be particularly worth your time, but, the thing I was hoping for this post was to ratchet the conversation-re-anthropic forward in, like, “doublecrux-weighted-concreteness.” (i.e. your arguments here are reasonably crux-and-concrete, but don’t seem to be engaging much with the arguments in this post that seemed more novel and representative of where anthropic employees tend to be coming from, instead just repeated AFAICT your cached arguments against Anthropic)
I don’t have much hope of directly persuading Dario, but I feel some hope of persuading both current and future-prospective employees who aren’t starting from the same prior of “alignment is hard enough that this plan is just crazy”, and for that to have useful flow-through effects.
My experience talking at least with Zac and Drake has been “these are people with real models, who share many-but-not-all-MIRI-ish assumptions but don’t intuitively buy that the Anthropic’s downsides are high, and would respond to arguments that were doing more to bridge perspectives.” (I’m hoping they end up writing comments here outlining more of their perspective/cruxes, which they’d expressed interest in in the past, although I ended up shipping the post quickly without trying to line up everything)
I don’t have a strong belief that contributing to that conversation is a better use of your time than whatever else you’re doing, but it seemed sad to me for the conversation to not at least be attempted.
(I do also plan to write 1-2 posts that are more focused on “here’s where Anthropic/Dario have done things that seem actively bad to me and IMO are damning unless accounted for,” that are less “attempt to maintain some kind of discussion-bridge”, but, it seemed better to me to start with this one)
Yeah I was staring at the poll and went “oh no.” They aren’t often actually used this way so it’s not obviously right to special-case it, although maybe we do polls enough that we generally should present react in “first->last posted” rather than sorting by number
I don’t particularly disagree with the first half, but your second sentence isn’t really a crux for me for the first part.
I think (moderately likely, though not super confident) it makes more sense to model Dario as:
“a person who actually is quite worried about misuse, and is making significant strategic decisions around that (and doesn’t believe alignment is that hard)”
than as “a generic CEO who’s just generally following incentives and spinning narrative post-hoc rationalizations.”
I think… agree denotationally and (lean towards) disagreeing connotationally? (like, seems like this is implying “and because he doesn’t seem like he obviously has coherent views on alignment-in-particular, it’s not worth arguing the object level?”)
(to be clear, I don’t super expect this post to affect Dario’s decisionmaking models, esp. directly. I do have at least some hope for Anthropic employees to engage with these sorts of models/arguments, and my sense from talking them is that a lot of the LW-flavored arguments have often missed their cruxes)
Also, the video you linked has a lot of additional opinionated features that I think are targeting a much more specific group than even “people who aren’t put off by AI”—it would never show up on my youtube.
For frame of reference, do regular movie trailers normally show up in your youtube? This video seemed relatively “mainstream”-vibing to me, although somewhat limited by the medium.
I expect that AIs will be obedient when they initially become capable enough to convince governments that further AI development would be harmful (if it would in fact be harmful).
Seems like “the AIs are good enough at persuasion to persuade governments and someone is deploying them for that” is right when you need to be very high confidence they’re obedient (and, don’t have some kind of agenda). If they can persuade governments, they can also persuade you of things.
I also think it gets into a point where I’d sure feel way more comfortable if we had more satisfying answers to “where exactly are we supposed to draw the line between ‘informing’ and ‘manipulating’” (I’m not 100% sure what you’re imagining here tho)
Does Anthropic shorten timelines, by working on automatic AI research?
I think “at least a little”, though not actually that much.
There’s a lot of other AI companies now, but not that many of them are really frontier labs. I think Anthropic’s presence in the race still puts marginal pressure on OpenAI companies to rush things out the door a bit with less care than they might have otherwise. (Even if you model other labs as caring ~zero about x-risk, there are still ordinary security/bugginess reasons to delay releases so you don’t launch a broken product. Having more “real” competition seems like it’d make people more willing to cut corners to avoid getting scooped on product releases)
(I also think earlier work by Dario at OpenAI, and the founding of Anthropic in the first place, probably did significantly shorten timelines. But, this factor isn’t significant at this point, and while I’m mad about the previous stuff it’s not actually a crux for their current strategy)
Subquestions:
How many bits does Anthropic leak by doing their research? This is plausibly low-ish. I don’t know of them actually leaked much about reasoning models until after OpenAI and Deepseek had pretty thoroughly exposed that vein of research.
How many other companies are actually focused on automating AI research, or pushing frontier AI in ways that are particularly relevant? If it’s a small number, then I think Anthropic’s contribution to this race is larger and more costly. I think the main mechanism here might be Anthropic putting pressure on OpenAI in particular (by being one of 2-3 real competitors on ‘frontier AI’, which pushes OpenAI to release things with less safety testing)
Is Anthropic institutionally capable of noticing “it’s really time to stop our capabilities research,” and doing so, before it’s too late?
I know they have the RSP. I think there is a threshold of danger where I believe they’d actually stop.
The problem is, before we get to “if you leave this training run overnight it might bootstrap into deceptive alignment that fools their interpretability and then either FOOMs, or gets deployed” territory, there will be a period of “Well, maybe it might do that but also The Totalitarian Guys Over There are still working on their training and we don’t want to fall behind”. And meanwhile, it’s also just sort of awkward/difficult[10] to figure out how to reallocate all your capabilities researches onto non-dangerous tasks.
How realistic is it to have a lead over “labs at more dangerous companies?” (Where “more dangerous” might mean more reckless, or more totalitarian)
This is where I feel particularly skeptical. I don’t get how Anthropic’s strategy of race-to-automate-AI can make sense without actually expecting to get a lead, and with the rest of the world also generally racing in this direction, it seems really unlikely for them to have much lead.
Relatedly… (sort of a subquestion but also an important top-level question)
Does racing towards Recursive Self Improvement makes timelines worse (as opposed to “shorter”?)
Maybe Anthropic pushing the frontier doesn’t shorten timelines (because there’s already at least a few other organizations who are racing with each other, and no one wants to fall behind).
But, Anthropic being in the race (and, also publicly calling for RSI in a fairly adversarial way, i.e. “gaining a more durable advantage”) might cause there to be more companies and nations explicitly racing for full AGI, and doing so in a more adversarial way, and generally making the gameboard more geopolitically chaotic at a crucial time.
This seems more true to me, than the “does Anthropic shorten timelines?” question.I think there are currently few enough labs doing this that a marginal lab going for AGI does make that seem more “real,” and give FOMO to other companies/countries.[11]
But, given that Anthropic has already basically stated they are doing this, the subquestion is more like:
If Anthropic publicly/credibly shifted away from racing, would that make race dynamics better? I think the answer here is “yes, but, it does depend on how you actually go about it.”
Assuming Anthropic got powerful but controllable ~human-genius-ish level AI, can/will they do something useful with it to end the acute risk period?
In my worldview, getting to AGI only particularly matters if you leverage it to prevent other people from creating reckless/powerseeking AI. Otherwise, whatever material benefits you get from it are short lived.
I don’t know how Dario thinks about this question. This could mean a lot of things. Some ways of ending the acute risk period are adversarial, or unilateralist, and some are more cooperative (either with a coalition of groups/companies/nations, or with most of the world).
This is the hardest to have good models about. Partly it’s just, like, quite a hard problem for anyone to know what it looks like to handle this sanely. Partly, it’s the sort of thing people are more likely to not be fully public about.
Some recent interviews have had him saying “Guys this is a radically different kind of technology, we need to come together and think about this. It’s bigger than one company should be deciding what to do with.” There’s versions of this that are a cheap platitude more than earnest plea, but, I do basically take him at his word here.
He doesn’t talk about x-risk, or much about uncontrollable AI. The “Core views on AI safety” lists “alignment might be very hard” as a major plausibility they are concerned with, and implies it ends up being like 1⁄3 or something of their
Subquestions:
Are there useful things you can do here with controllable power levels of AI? i.e.
Can you get to very high power levels using the set of skills/approaches Anthropic is currently bringing to bear?
Can we muddle through the risk period with incremental weaker tech and moderate coalition-size advantage?
Will Anthropic be able to leverage this sanely/safely under time pressure?
Cruxes and Questions
The broad thrust of my questions are:
Anthropic Research Strategy
Does Anthropic building towards automated AGI research make timelines shorter (via spurring competition or leaking secrets)
...or, make timelines worse (by inspiring more AI companies or countries to directly target AGI, as opposed to merely trying to cash in on the current AI hype)
Is it realistic for Anthropic to have enough of a lead to safely build AGI in a way that leads to durably making the world safer?
“Is Technical Philosophy actually that big a deal?”
Can there be pivotal acts that require high AI powerlevels, but not unboundedly high, in a reasonable timeframe, such that they’re achievable without solving The Hard Parts of robust pointing?
Governance / Policy Comms
Is it practical for a western coalition to stop the rest of the world (and, governments and other major actors within the western coalition) from building reckless or evil AI?
Anthropic, and taking “technical philosophy” more seriously
I would bet they are <1% of the population. Do you disagree, or think they disproportionately matter?
I’m skeptical that there are actually enough people so ideologically opposed to this, that it outweighs the upside of driving home that capabilities are advancing, through the medium itself. (similar to how even though tons of people hate FB, few people actually leave)
I’d be wanting to target a quality level similar to this:
One of the things I track are “ingredients for a good movie or TV show that would actually be narratively satisfying / memetically fit,” that would convey good/realistic AI hard sci-fi to the masses.
One of the more promising strategies withint that I can think of is “show multiple timelines” or “flashbacks from a future where the AI wins but it goes slowly enough to be human-narrative-comprehensible” (with the flashbacks being about the people inventing the AI).
This feels like one of the reasonable options for a “future” narrative. (A previous one I was interested in was the Green goo is plausible concept)
Also, I think many Richard Ngo stories would lend themselves well to being some kind of cool youtube video, leveraging AI generated content to make things feel higher budget and also sending an accompanying message of “the future is coming, like now.” (King and the Golem was nice but felt more like a lecture than a video, or something). A problem with AI generated movies is that the tech’s not there yet for it not being slightly uncanny, but I think Ngo stories have a vibe where the uncanniness will be kinda fine.
I also kinda thought this. I actually thought it sounded sufficiently academic that I didn’t realize at first it was your org, instead of some other thing you were supporting.
LW moderators have a policy of generally rejecting LLM stuff, but some things slip through cracks. (I think maybe LLM writing got a bit better recently and some of the cues I used are less reliable now, so I may have been missing some)
Curated. This was one of the more interesting results from the alignment scene in awhile.
I did like Martin Randall’s comment distinguishing “alignment” from “harmless” in the Helpful/Harmless/Honest sense (i.e. the particular flavor of ‘harmlessness’ that got trained into the AI). I don’t know whether Martin’s particular articulation is correct for what’s going on here, but in general it seems important to track that just because we’ve identified some kind of vector, that doesn’t mean we necessarily understand what that vector means. (I also liked that Martin gave some concrete predictions implied by his model)
@kave @habryka
Minor note but I found the opening section hard to read. See: Abstracts should be either Actually Short™, or broken into paragraphs