Joe Carlsmith

Karma: 5,263

Senior advisor at Open Philanthropy. Doctorate in philosophy from the University of Oxford. Opinions my own.

Video and transcript of talk on “Can goodness compete?”

Joe Carlsmith17 Jul 2025 17:54 UTC

89 points

18 comments34 min readLW link

(joecarlsmith.substack.com)

Video and transcript of talk on AI welfare

Joe Carlsmith22 May 2025 16:15 UTC

24 points

1 comment28 min readLW link

(joecarlsmith.substack.com)

The stakes of AI moral status

Joe Carlsmith21 May 2025 18:20 UTC

78 points

63 comments14 min readLW link

(joecarlsmith.substack.com)

Joe Carlsmith 2 May 2025 0:40 UTC
LW: 4 AF: 3
0
AF
in reply to: Steven Byrnes’s comment on: Can we safely automate alignment research?
I’m a bit confused about your overall picture here. Sounds like you’re thinking something like:
“almost everything in the world is evaluable via waiting for it to fail and then noticing this. Alignment and bridge-building aren’t like this, but most other things are… Also, the way we’re going to automate long-horizon tasks is via giving AIs long-term goals. In particular: we’ll give them goal ‘get long-term human approval/reward’, which will lead to good-looking stuff until the AIs take over in order to get more reward. This will work for tons of stuff but not for alignment, because you can’t give negative reward for the alignment failure we ultimately care about, which is the AIs taking over.”
Is that roughly right?

Joe Carlsmith 2 May 2025 0:34 UTC
LW: 8 AF: 6
0
AF
in reply to: johnswentworth’s comment on: Can we safely automate alignment research?
I think it’s a fair point that if it turns out that current ML methods are broadly inadequate for automating basically any sophisticated cognitive work (including capabilities research, biology research, etc—though I’m not clear on your take on whether capabilities research counts as “science” in the sense you have in mind), it may be that whatever new paradigm ends up successful messes with various implicit and explicit assumptions in analyses like the one in the essay.
That said, I think if we’re ignorant about what paradigm will succeed re: automating sophisticated cognitive work and we don’t have any story about why alignment research would be harder, it seems like the baseline expectation (modulo scheming) would be that automating alignment is comparably hard (in expectation) to automating these other domains. (I do think, though, that we have reason to expect alignment to be harder even conditional on needing other paradigms, because I think it’s reasonable to expect some of the evaluation challenges I discuss in the post to generalize to other regimes.)

Joe Carlsmith 2 May 2025 0:28 UTC
LW: 6 AF: 4
4
AF
in reply to: Steven Byrnes’s comment on: Can we safely automate alignment research?
I’m happy to say that easy-to-verify vs. hard-to-verify is what ultimately matters, but I think it’s important to be clear what about makes something easier vs. harder to verify, so that we can be clear about why alignment might or might not be harder than other domains. And imo empirical feedback loops and formal methods are amongst the most important factors there.

Joe Carlsmith 1 May 2025 22:59 UTC
LW: 4 AF: 3
0
AF
in reply to: johnswentworth’s comment on: Can we safely automate alignment research?
If we assume that the AI isn’t scheming to actively withhold empirically/formally verifiable insights from us (I do think this would make life a lot harder), then it seems to me like this is reasonably similar to other domains in which we need to figure out how to elicit as-good-as-human-level suggestions from AIs that we can then evaluate well. E.g., it’s not clear to me why this would be all that different from “suggest a new transformer-like architecture that we can then verify improves training efficiency a lot on some metric.”
Or put another way: at least in the context of non-schemers, the thing I’m looking for isn’t just “here’s a way things could be hard.” I’m specifically looking for ways things will be harder than in the context of capabilities (or, to a lesser extent, in other scientific domains where I expect a lot of economic incentives to figure out how to automate top-human-level work). And in that context, generic pessimism about e.g. heavy RL doesn’t seem like it’s enough.

Joe Carlsmith 1 May 2025 22:50 UTC
LW: 4 AF: 3
0
AF
in reply to: Steven Byrnes’s comment on: Can we safely automate alignment research?
Sure, maybe there’s a band of capability where you can take over but you can’t do top-human-level alignment research (and where your takeover plan doesn’t involve further capabilities development that requires alignment). It’s not the central case I’m focused on, though.
Also, if there’s an alignment tax (or control tax), then that impacts the comparison, since the AIs doing alignment research are paying that tax whereas the AIs attempting takeover are not.
Is the thought here that the AIs trying to takeover aren’t improving their capabilities in a way that requires paying an alignment tax? E.g. if the tax refers to a comparison between (a) rushing forward on capabilities in a way that screws you on alignment vs. (b) pushing forward on capabilities in a way that preserves alignment, AIs that are fooming will want to do (b) as well (though they may have an easier time of it for other reasons). But if it refers to e.g. “humans will place handicaps on AIs that they need to ensure are aligned, including AIs they’re trying to use for alignment research, whereas rogue AIs that have also freed themselves human control will also be able to get rid of these handicaps,” then yes, that’s an advantage the rogue AIs will have (though note that they’ll still need to self-filtrate etc).

Joe Carlsmith 1 May 2025 21:33 UTC
LW: 4 AF: 3
0
AF
in reply to: Steven Byrnes’s comment on: Can we safely automate alignment research?
Re: examples of why superintelligences create distinctive challenges: superintelligences seem more likely to be schemers, more likely to be able to systematically and successfully mess with the evidence provided by behavioral tests and transparency tools, harder to exert option-control over, better able to identify and pursue strategies humans hadn’t thought of, harder to supervise using human labor, etc.
If you’re worried about shell games, it’s OK to round off “alignment MVP” to hand-off-ready AI, and to assume that the AIs in question need to be able to pursue make and pursue coherent long-term plans.^[1] I don’t think the analysis in the essay alters that much (for example, I think very little rests on the idea that you can get by with myopic AIs), and better to err on the side of conservatism.
I wanted to set aside “hand-off” here because in principle, you don’t actually need to hand-off until humans stop being able to meaningfully contribute to the safety/quality of the automated alignment work, which doesn’t necessarily need to start around the time we have AIs capable of top-human-level alignment work (e.g., human evaluation of the research, or involvement in other aspects of control—e.g., providing certain kinds of expensive, trusted supervision—could persist after that). And when exactly you hand-off depends on a bunch of more detailed, practical trade-offs.
As I said in the post, one way that humans still being involved might not bottleneck the process is if they’re only reviewing the work to figure out whether there’s a problem they need to actively intervene on.
even if human labor is still playing a role in ensuring safety, it doesn’t necessarily need to directly bottleneck the research process – or at least, not if things are going well. For example: in principle, you could allow a fully-automated alignment research process to proceed forward, with humans evaluating the work as it gets produced, but only actively intervening if they identify problems.
And I think you can still likely radically speed up and scale up your alignment research even e.g. you still care about humans reviewing and understanding the work in question.
1. ^
  Though for what it’s worth I don’t think that the task of assessing “is this a good long-term plan for achieving X” needs itself to involve long-term optimization for X. For example: you could do that task over five minutes in exchange for a piece of candy.

Joe Carlsmith 1 May 2025 21:18 UTC
LW: 4 AF: 3
0
AF
in reply to: Steven Byrnes’s comment on: Can we safely automate alignment research?
Fair point, and plausible that I’m too much taking for granted a certain subset of development pathways. That is: I’m focused in the essay on threat models that proceed via the automation of capabilities R&D, but it’s possible that this isn’t necessary.

Joe Carlsmith 1 May 2025 21:16 UTC
2 points
0
in reply to: Kaarel’s comment on: Can we safely automate alignment research?
By “never build superintelligence” I was assuming we were talking about superintelligent AI, so if the humans in question never build superintelligent AI I’d count this path under that bucket. But as I discussed in my first post in the series, you can indeed get access to the benefits of superintelligence without building superintelligent AI in particular.

Joe Carlsmith 1 May 2025 21:00 UTC
LW: 9 AF: 5
6
AF
in reply to: Wei Dai’s comment on: Can we safely automate alignment research?
I suspect the author might already agree with all this (the existence of this logical risk, the social dynamics, the conclusion about norms/laws being needed to reduce AI risk beyond some threshold)...
Yes I think I basically agree. That is, I think it’s very possible that capabilities research is inherently easier to automate than alignment research; I am very worried about the least cautious actors pushing forward prematurely; as I tried to emphasize in the post, I think capability restraint is extremely important (and: important even if we can successfully automate alignment research); and I think that norms/laws are likely to play an important role there.

Joe Carlsmith 30 Apr 2025 21:10 UTC
LW: 8 AF: 6
0
AF
in reply to: johnswentworth’s comment on: Can we safely automate alignment research?
Note that conceptual research, as I’m understanding it, isn’t defined by the cognitive skills involved in the research—i.e., by whether the researchers need to have “conceptual thoughts” like “wait, is this measuring what I think it’s measuring?”. I agree that normal science involves a ton of conceptual thinking (and many “number-go-up” tasks do too). Rather, conceptual research as I’m understanding it is defined by the tools available for evaluating the research in question.^[1] In particular, as I’m understanding it, cases where neither available empirical tests nor formal methods help much.
Thus, in your neurotransmitter example, it does indeed take some kind of “conceptual thinking” to come up with the thought “maybe it actually takes longer for neurotransmitters to get re-absorbed than it takes for them to clear from the cleft.” But if some AI presented us with this claim, the question is whether we could evaluate it via some kind of empirical test, which it sounds like we plausibly could. Of course, we do still need to interpret the results of these tests—e.g., to understand enough about what we’re actually trying to measure to notice that e.g. one measurement is getting at it better than another. But we’ve got rich empirical feedback loops to dialogue with.
So if we interpret “conceptual work” as conceptual thinking, I do agree that “there will be market pressure to make AI good at conceptual work, because that’s a necessary component of normal science.” And this is closely related to the comforts I discuss in section 6.1. That is: a lot of alignment research seems pretty comparable to me to the sort of science at stake in e.g. biology, physics, computer science, etc, where I think human evaluation has a decent track record (or at least, a better track record than philosophy/futurism), and where I expect a decent amount of market pressure to resolve evaluation difficulties adequately. So (modulo scheming AIs differentially messing with us in some domains vs. others), at least by the time we’re successfully automating these other forms of science, I think we should be reasonably optimistic about automating that kind of alignment research as well. But this still leaves the type of alignment research that looks more centrally like philosophy/futurism, where I think evaluation is additionally challenging, and where the human track record looks worse.
1. ^
  Thus, from section 6.2.3: “Conceptual research, as I’m understanding it, is defined by the methods available for evaluating it, rather than the cognitive skills involved in producing it. For example: Einstein on relativity was clearly a giant conceptual breakthrough. But because it was evaluable via a combination of empirical predictions and formal methods, it wouldn’t count as ‘conceptual research’ in my sense.”

Video and transcript of talk on automating alignment research

Joe Carlsmith30 Apr 2025 17:43 UTC

21 points

0 comments24 min readLW link

(joecarlsmith.com)

Can we safely automate alignment research?

Joe Carlsmith30 Apr 2025 17:37 UTC

54 points

29 comments48 min readLW link

(joecarlsmith.com)

Joe Carlsmith 7 Apr 2025 23:40 UTC
LW: 5 AF: 3
1
AF
in reply to: johnswentworth’s comment on: AI for AI safety
I agree it’s generally better to frame in terms of object-level failure modes rather than “objections” (though: sometimes one is intentionally responding to objections that other people raise, but that you don’t buy). And I think that there is indeed a mindset difference here. That said: your comment here is about word choice. Are there substantive considerations you think that section is missing, or substantive mistakes you think it’s making?

Joe Carlsmith 7 Apr 2025 23:28 UTC
LW: 5 AF: 1
1
AF
in reply to: johnswentworth’s comment on: AI for AI safety
Thanks, John. I’m going to hold off here on in-depth debate about how to choose between different ontologies in this vicinity, as I do think it’s often a complicated and not-obviously-very-useful thing to debate in the abstract, and that lots of taste is involved. I’ll flag, though, that the previous essay on paths and waystations (where I introduce this ontology in more detail) does explicitly name various of the factors you mention (along with a bunch of other not-included subtleties). E.g., re the importance of multiple actors:
Now: so far I’ve only been talking about one actor. But AI safety, famously, implicates many actors at once – actors that can have different safety ranges and capability frontiers, and that can make different development/deployment decisions. This means that even if one actor is adequately cautious, and adequately good at risk evaluation, another might not be...
And re: e.g. multidimensionality, and the difference between “can deploy safely” and “would in practice” -- from footnote 14:
Complexities I’m leaving out (or not making super salient) include: the multi-dimensionality of both the capability frontier and the safety range; the distinction between safety and elicitation; the distinction between development and deployment; the fact that even once an actor “can” develop a given type of AI capability safely, they can still choose an unsafe mode of development regardless; differing probabilities of risk (as opposed to just a single safety range); differing severities of rogue behavior (as opposed to just a single threshold for loss of control); the potential interactions between the risks created by different actors; the specific standards at stake in being “able” to do something safely; etc.
I played around with more complicated ontologies that included more of these complexities, but ended up deciding against. As ever, there are trade-offs between simplicity and subtlety, I chose a particular way of making those trade-offs, and so far I’m not regretting.
Re: who is risk-evaluating, how they’re getting the information, the specific decision-making processes: yep, the ontology doesn’t say, and I endorse that, I think trying to specify would be too much detail.
Re: why factor apart the capability frontier and the safety range—sure, they’re not independent, but it seems pretty natural to me to think of risk as increasing as frontier capabilities increase, and of our ability to make AIs safe as needing to keep up with that. Not sure I understand your alternative proposals re: “looking at their average and difference as the two degrees of freedom, or their average and difference in log space, or the danger line level and the difference, or...”, though, or how they would improve matters.
As I say, people have different tastes re: ontologies, simplifications, etc. My own taste finds this one fairly natural and useful—and I’m hoping that the use I give it in the rest of series (e.g., in classifying different waystations and strategies, in thinking about these different feedback loops, etc) can illustrate why (see also the slime analogy from the previous post for another intuition pump). But I welcome specific proposals for better overall ways of thinking about the issues in play.

Joe Carlsmith 6 Apr 2025 21:29 UTC
LW: 8 AF: 5
1
AF
in reply to: johnswentworth’s comment on: AI for AI safety
Thanks, John—very open to this kind of push-back (and as I wrote in the fake thinking post, I am definitely not saying that my own thinking is free of fakeness). I do think the post (along with various other bits of the series) is at risk of being too anchored on the existing discourse. That said: do you have specific ways in which you feel like the frame in the post is losing contact with the territory?

Joe Carlsmith 24 Mar 2025 19:26 UTC
LW: 7 AF: 4
0
AF
in reply to: Jan Betley’s comment on: Can you control the past?
That seems like a useful framing to me. I think the main issue is just that often, we don’t think of commitment as literally closing off choice—e.g., it’s still a “choice” to keep a promise. But if you do think of it as literally closing off choice then yes, you can avoid the violation of Guaranteed Payoffs, at least in cases where you’ve actually already made the commitment in question.

AI for AI safety

Joe Carlsmith14 Mar 2025 15:00 UTC

78 points

13 comments17 min readLW link

(joecarlsmith.substack.com)

Joe Carlsmith

Video and tran­script of talk on “Can good­ness com­pete?”

Video and tran­script of talk on AI welfare

The stakes of AI moral status

Video and tran­script of talk on au­tomat­ing al­ign­ment research

Can we safely au­to­mate al­ign­ment re­search?

AI for AI safety

Video and transcript of talk on “Can goodness compete?”

Video and transcript of talk on AI welfare

Video and transcript of talk on automating alignment research

Can we safely automate alignment research?