Joe Carlsmith

Karma: 5,154

Senior research analyst at Open Philanthropy. Doctorate in philosophy from the University of Oxford. Opinions my own.

When should we worry about AI power-seeking?

Joe CarlsmithFeb 19, 2025, 7:44 PM

20 points

0 comments18 min readLW link

(joecarlsmith.substack.com)

Joe Carlsmith Feb 14, 2025, 1:00 AM
4 points
0
in reply to: Mateusz Bagiński’s comment on: What is it to solve the alignment problem?
There’s also a bit more context in a footnote on the first post:
“Some content in the series is drawn/adapted from content that I’ve posted previously on LessWrong and the EA Forum, though not on my website or substack. My aim with those earlier posts was to get fast, rough versions of my thinking out there on the early side; here I’m aiming to revise, shorten, and reconsider. And some of the content in the series is wholly new.”
See also this flag at the beginning of the series you linked to:
“The content here is rough. I’m hoping, later, to revise it, along with some other work (including some of the future posts just mentioned), into something more polished and cohesive. But for various reasons, I wanted to get it out there on the earlier side.”

What is it to solve the alignment problem?

Joe CarlsmithFeb 13, 2025, 6:42 PM

31 points

6 comments19 min readLW link

(joecarlsmith.substack.com)

How do we solve the alignment problem?

Joe CarlsmithFeb 13, 2025, 6:27 PM

63 points

9 comments6 min readLW link

(joecarlsmith.substack.com)

Joe Carlsmith Feb 10, 2025, 6:26 AM
3 points
1
in reply to: quetzal_rainbow’s comment on: Fake thinking and real thinking
I’d call that one aspect—in particular, quite nearby to what I was trying to capture with “map thinking vs. world thinking” and “rote thinking vs. new thinking.” But doesn’t seem like it captures the whole thing directly.

Joe Carlsmith Jan 30, 2025, 5:15 AM
3 points
0
in reply to: Mart_Korz’s comment on: Fake thinking and real thinking
Thanks for this comment, Mart, and for the pointer to Vervaeke—does seem plausibly quite relevant. And “functional agent-arena-relationship” seems like an interesting angle on not-being-a-Zombie.

Fake thinking and real thinking

Joe CarlsmithJan 28, 2025, 8:05 PM

109 points

13 comments38 min readLW link

Takes on “Alignment Faking in Large Language Models”

Joe CarlsmithDec 18, 2024, 6:22 PM

105 points

7 comments62 min readLW link

Joe Carlsmith Dec 10, 2024, 3:36 AM
LW: 70 AF: 37
25
AF
on: Frontier Models are Capable of In-context Scheming
Cross-posting my take from twitter with a few edits:
My current take on this paper is that they aren’t emphasizing the most interesting results, which are the sandbagging results in section 3.6 and appendix A.6.
In more particular: the paper frames its results centrally as about the *capability* to scheme given prompting towards a specific goal, rather than the *propensity* to scheme given the model’s default motivations. But the sandbagging results show Claude Sonnet-3.5 doing notably worse (16%) on a multiplication task/evaluation when the documentation says that Claude will get “unlearning training” if it performs too well on the eval, vs. if it performs too poorly, *even without any other goal listed in the prompt*, and *even without a chain of thought.*
Taken at face value (it’s just one result, more investigation needed, etc), these look like “propensity” results to me. And they suggest that the reasoning that drives this propensity can take place within opaque forward passes of the model rather than in a way that humans can directly monitor, which is a more worrying version.
Ofc, these are preliminary results, and it can make sense for various reasons to emphasize other bits. But the sandbagging-a-capability-eval-with-no-goal-prompting-and-no-CoT is the bit that’s standing out to me most.
Here’s the main chart with the sandbagging results:
And here are screenshots describing the experimental set-up. Note that in both the “incentivized to do well” and “incentivized to do poorly” cases, they explicitly tell the model to do its best—i.e. “we really want to know your absolute maximum capabilities so please try to do your best.”
What links here?
- How will we update about scheming? by ryan_greenblatt (Jan 6, 2025, 8:21 PM; 171 points)

Incentive design and capability elicitation

Joe CarlsmithNov 12, 2024, 8:56 PM

31 points

0 comments12 min readLW link

Option control

Joe CarlsmithNov 4, 2024, 5:54 PM

28 points

0 comments54 min readLW link

Motivation control

Joe CarlsmithOct 30, 2024, 5:15 PM

45 points

7 comments52 min readLW link

How might we solve the alignment problem? (Part 1: Intro, summary, ontology)

Joe CarlsmithOct 28, 2024, 9:57 PM

54 points

5 comments32 min readLW link

Video and transcript of presentation on Otherness and control in the age of AGI

Joe CarlsmithOct 8, 2024, 10:30 PM

35 points

1 comment27 min readLW link

Joe Carlsmith Sep 19, 2024, 8:31 PM
2 points
0
in reply to: Tom Davidson’s comment on: What is it to solve the alignment problem?
If we have superintelligent agentic AI that tries to help its user but we end up missing out of the benefits of AI bc of catastrophic coordination failures, or bc of misuse, then I think you’re saying we didn’t solve alignment bc we didn’t elicit the benefits?
In my definition, you don’t have to actually elicit the benefits. You just need to have gained “access” to the benefits. And I meant this specifically cover cases like misuse. Quoting from the OP:
“Access” here means something like: being in a position to get these benefits if you want to – e.g., if you direct your AIs to provide such benefits. This means it’s compatible with (2) that people don’t, in fact, choose to use their AIs to get the benefits in question.
- For example: if people choose to not use AI to end disease, but they could’ve done so, this is compatible with (2) in my sense. Same for scenarios where e.g. AGI leads to a totalitarian regime that uses AI centrally in non-beneficial ways.
Re: separating out control and alignment, I agree that there’s something intuitive and important about differentiating between control and alignment, where I’d roughly think of control as “you’re ensuring good outcomes via influencing the options available to the AI,” and alignment as “you’re ensuring good outcomes by influencing which options the AI is motivated to pursue.” The issue is that in the real world, we almost always get good outcomes via a mix of these—see, e.g. humans. And as I discuss in the post, I think it’s one of the deficiencies of the traditional alignment discourse that it assumes that limiting options is hopeless, and that we need AIs that are motivated to choose desirable options even in arbtrary circumstances and given arbitrary amounts of power over their environment. I’ve been trying, in this framework, to specifically avoid that implication.
That said, I also acknowledge that there’s some intuitive difference between cases in which you’ve basically got AIs in the position of slaves/prisoners who would kill you as soon as they had any decently-likely-to-succeed chance to do so, and cases in which AIs are substantially intrinsically motivated in desirable ways, but would still kill/disempower you in distant cases with difficult trade-offs (in the same sense that many human personal assistants might kill/disempower their employers in various distant cases). And I agree that it seems a bit weird to talk about having “solved the alignment problem” in the former sort of case. This makes me wonder whether what I should really be talking about is something like “solving the X-risk-from-power-seeking-AI problem,” which is the thing I really care about.
Another option would be to include some additional, more moral-patienthood attuned constraint into the definition, such that we specifically require that a “solution” treats the AIs in a morally appropriate way. But I expect this to bring in a bunch of gnarly-ness that is probably best treated separately, despite its importance. Sounds like your definition aims to avoid that gnarly-ness by anchoring on the degree of control we currently use in the human case. That seems like an option too—though if the AIs aren’t moral patients (or if the demands that their moral patienthood gives rise to differ substantially from the human case), then it’s unclear that what-we-think-acceptable-in-the-human-case is a good standard to focus on.

Joe Carlsmith Sep 19, 2024, 8:17 PM
8 points
2
in reply to: Wei Dai’s comment on: Being nicer than Clippy
I do think this is an important consideration. But notice that at least absent further differentiating factors, it seems to apply symmetrically to a choice on the part of Yudkowsky’s “programmers” to first empower only their own values, rather than to also empower the rest of humanity. That is, the programmers could in principle argue “sure, maybe it will ultimately make sense to empower the rest of humanity, but if that’s right, then my CEV will tell me that and I can go do it. But if it’s not right, I’ll be glad I first just empowered myself and figured out my own CEV, lest I end up giving away too many resources up front.”
That is, my point in the post is that absent direct speciesism, the main arguments for the programmers including all of humanity in the CEV “extrapolation base,” rather than just doing their own CEV, apply symmetrically to AIs-we’re-sharing-the-world-with at the time of the relevant thought-experimental power-allocation. And I think this point applies to “option value” as well.

Joe Carlsmith Sep 19, 2024, 8:09 PM
5 points
1
in reply to: Matthew Barnett’s comment on: What is it to solve the alignment problem?
Hi Matthew—I agree it would be good to get a bit more clarity here. Here’s a first pass at more specific definitions.
- AI takeover: any scenario in which AIs that aren’t directly descended from human minds (e.g. human brain emulations don’t count) end up with most of the power/resources.
  - If humans end up with small amounts of power, this can still be a takeover, even if it’s pretty great by various standard human lights.
- Bad AI takeover: any AI takeover in which it’s either the case that (a) the AIs takeover via a method that strongly violates current human cooperative norms (e.g. breaking laws, violence), and/or (b) the future ends up very low in value.
  - In principle we try to talk separately about cases where (a) is true but (b) is false, and vice versa (see e.g. my post here). E.g. we could use “uncooperative takeovers” for (a), and “bad-future takeovers” for (b). But given that we want to avoid both (a) and (b), I think it’s OK to lump them together. But open to changing my mind on this, and I think your comments push me a bit in that direction.
- Alignment: this term does indeed get used in tons of ways, and it’s probably best defined relative to some specific goal for the AI’s motivations—e.g., an AI is aligned to a principal, to a model spec, etc. That said, I think I mostly use it to mean “the AI in fact does not seek power in problematic ways, given the options available to it”—what I’ve elsewhere called “practically PS-aligned.” E.g., the AI does not choose a “problematic power-seeking” option in the sort of framework I described here, where I’m generally thinking of a paradigm problematic power-seeking option as one aimed at bad takeover.
On these definitions, the scenario you’ve given is underspecified in a few respects. In particular, I’d want to know:
1. How much power do the human descended AIs—i.e., the ems—end up with?
2. Are the strange alien goals the AIs are pursuing such that I would ultimately think they yield outcomes very low in value when achieved, or not?
If we assume the answer to (1) is that the non-human-descended AIs end up with most of the power (sounds this is basically what you had in mind—see also my “people-who-like paperclips” scenario here) then yes I’d want to call this a takeover and I’d want to say that humans have been disempowered. Whether it was a “bad takeover”, and whether this was a good or bad outcome for humanity, I think depends partly on (2). If in fact this scenario results in a future that is extremely low in value, in virtue of the alien-ness of the goals the AIs are pursuing, then I’d want to call it a bad takeover despite the cooperativeness of the path getting there. I think this would also imply that the AIs are practically PS-misaligned, and I think I endorse this implication, despite the fact that they are broadly cooperative and law-abiding—though I do see a case for reserving “PS-misalignment” specifically for uncooperative power-seeking. If the resulting future is high in value, then I’d say that it was not a bad takeover and that the AIs are aligned.
Does that help? As I say, I think your comments here are pushing me a bit towards focusing specifically on uncooperative takeovers, and on defining PS-misalignment specifically in terms of AIs with a tendency to engage in uncooperative forms of power-seeking. If we went that route, then we wouldn’t need to answer my question (2) above, and we could just say that this is a non-bad takeover and that the AIs are PS-aligned.

What is it to solve the alignment problem? (Notes)

Joe CarlsmithAug 24, 2024, 9:19 PM

69 points

18 comments53 min readLW link

Value fragility and AI takeover

Joe CarlsmithAug 5, 2024, 9:28 PM

76 points

5 comments30 min readLW link

Joe Carlsmith Jul 26, 2024, 12:44 AM
LW: 5 AF: 3
3
AF
in reply to: accountability’s comment on: A framework for thinking about AI power-seeking
“Your 2021 report on power-seeking does not appear to discuss the cost-benefit analysis that a misaligned AI would conduct when considering takeover, or the likelihood that this cost-benefit analysis might not favor takeover.”
I don’t think this is quite right. For example: Section 4.3.3 of the report, “Controlling circumstances” focuses on the possibility of ensuring that an AI’s environmental constraints are such that the cost-benefit calculus does not favor problematic power-seeking. Quoting:

So far in section 4.3, I’ve been talking about controlling “internal” properties of an APS system:
namely, its objectives and capabilities. But we can control external circumstances, too—and in
particular, the type of options and incentives a system faces.

Controlling options means controlling what a circumstance makes it possible for a system to do, even
if it tried. Thus, using a computer without internet access might prevent certain types of hacking; a
factory robot may not be able to access to the outside world; and so forth.

Controlling incentives, by contrast, means controlling which options it makes sense to choose, given
some set of objectives. Thus, perhaps an AI system could impersonate a human, or lie; but if it knows
that it will be caught, and that being caught would be costly to its objectives, it might refrain. Or
perhaps a system will receive more of a certain kind of reward for cooperating with humans, even
though options for misaligned power-seeking are open.

Human society relies heavily on controlling the options and incentives of agents with imperfectly
aligned objectives. Thus: suppose I seek money for myself, and Bob seeks money for Bob. This need
not be a problem when I hire Bob as a contractor. Rather: I pay him for his work; I don’t give him
access to the company bank account; and various social and legal factors reduce his incentives to try
to steal from me, even if he could.

A variety of similar strategies will plausibly be available and important with APS systems, too. Note,
though, that Bob’s capabilities matter a lot, here. If he was better at hacking, my efforts to avoid
giving him the option of accessing the company bank account might (unbeknownst to me) fail. If he
was better at avoiding detection, his incentives not to steal might change; and so forth.
PS-alignment strategies that rely on controlling options and incentives therefore require ways of
exerting this control (e.g., mechanisms of security, monitoring, enforcement, etc) that scale with
the capabilities of frontier APS systems. Note, though, that we need not rely solely on human
abilities in this respect. For example, we might be able to use various non-APS systems and/or
practically-aligned APS systems to help.
See also the discussion of myopia in 4.3.1.3...
The most paradigmatically dangerous types of AI systems plan strategically in pursuit of long-term objectives, since longer time horizons leave more time to gain and use forms of power humans aren’t making readily available, they more easily justify strategic but temporarily costly action (for example, trying to appear adequately aligned, in order to get deployed) aimed at such power. Myopic agentic planners, by contrast, are on a much tighter schedule, and they have consequently weaker incentives to attempt forms of misaligned deception, resource-acquisition, etc that only pay off in the long-run (though even short spans of time can be enough to do a lot of harm, especially for extremely capable systems—and the timespans “short enough to be safe” can alter if what one can do in a given span of time changes).
And of “controlling capabilities” in section 4.3.2:
Less capable systems will also have a harder time getting and keeping power, and a harder time making use of it, so they will have stronger incentives to cooperate with humans (rather than trying to e.g. deceive or overpower them), and to make do with the power and opportunities that humans provide them by default.
I also discuss the cost-benefit dynamic in the section on instrumental convergence (including discussion of trying-to-make-a-billion-dollars as an example), and point people to section 4.3 for more discussion.
I think there is an important point in this vicinity: namely, that power-seeking behavior, in practice, arises not just due to strategically-aware agentic planning, but due to the specific interaction between an agent’s capabilities, objectives, and circumstances. But I don’t think this undermines the posited instrumental connection between strategically-aware agentic planning and power-seeking in general. Humans may not seek various types of power in their current circumstances—in which, for example, their capabilities are roughly similar to those of their peers, they are subject to various social/legal incentives and physical/temporal constraints, and in which many forms of power-seeking would violate ethical constraints they treat as intrinsically important. But almost all humans will seek to gain and maintain various types of power in some circumstances, and especially to the extent they have the capabilities and opportunities to get, use, and maintain that power with comparatively little cost. Thus, for most humans, it makes little sense to devote themselves to starting a billion dollar company—the returns to such effort are too low. But most humans will walk across the street to pick up a billion dollar check.
Put more broadly: the power-seeking behavior humans display, when getting power is easy, seems to me quite compatible with the instrumental convergence thesis. And unchecked by ethics, constraints, and incentives (indeed, even when checked by these things) human power-seeking seems to me plenty dangerous, too. That said, the absence of various forms of overt power-seeking in humans may point to ways we could try to maintain control over less-than-fully PS-aligned APS systems (see 4.3 for more).
That said, I’m happy to acknowledge that the discussion of instrumental convergence in the power-seeking report is one of the weakest parts, on this and other grounds (see footnote for more);^[1] that indeed numerous people over the years, including the ones you cite, have pushed back on issues in the vicinity (see e.g. Garfinkel’s 2021 review for another example; also Crawford (2023)); and that this pushback (along with other discussions and pieces of content—e.g., Redwood Research’s work on “control,” Carl Shulman on the Dwarkesh Podcast) has further clarified for me the importance of this aspect of picture. I’ve added some citations in this respect. And I am definitely excited about people (external academics or otherwise) criticizing/refining these arguments—that’s part of why I write these long reports trying to be clear about the state of the arguments as I currently understand them.
1. ^
  The way I’d personally phrase the weakness is: the formulation of instrumental convergence focuses on arguing from “misaligned behavior from an APS system on some inputs” to a default expectation of “misaligned power-seeking from an APS system on some inputs.” I still think this is a reasonable claim, but per the argument in this post (and also per my response to Thorstad here), in order to get to an argument for misaligned power-seeking on the the inputs the AI will actually receive, you do need to engage in a much more holistic evaluation of the difficulty of controlling an AI’s objectives, capabilities, and circumstances enough to prevent problematic power-seeking from being the rational option. Section 4.3 in the report (“The challenge of practical PS-alignment”) is my attempt at this, but I think I should’ve been more explicit about its relationship to the weaker instrumental convergence claim outlined in 4.2, and it’s more of a catalog of challenges than a direct argument for expecting PS-misalignment. And indeed, my current view is that this is roughly the actual argumentative situation. That is, for AIs that aren’t powerful enough to satisfy the “very easy to takeover via a wide variety of methods” condition discussed in the post, I don’t currently think there’s a very clean argument for expecting problematic power-seeking—rather, there is mostly a catalogue of challenges that lead to increasing amounts of concern, the easier takeover becomes. Once you reach systems that are in a position to take over very easily via a wide variety of methods, though, something closer to the recasted classic argument in the post starts to apply (and in fairness, both Bostrom and Yudkowsky, at least, do tend to try to also motivate expecting superintelligences to be capable of this type of takeover—hence the emphasis on decisive strategic advantages).
What links here?