Seth Herd

Karma: 7,176

Message me here or at seth dot herd at gmail dot com.

I was a researcher in cognitive psychology and cognitive neuroscience for two decades and change. I studied complex human thought using neural network models of brain function. I’m applying that knowledge to figuring out how we can align AI as developers make it to “think for itself” in all the ways that make humans capable and dangerous.

If you’re new to alignment, see the Research Overview section below. Field veterans who are curious about my particular take and approach should see the More on My Approach section at the end of the profile.

Important posts:

On LLM-based agents as a route to takeover-capable AGI
- LLM AGI will have memory, and memory changes alignment
- Brief argument for short timelines being quite possible
- Capabilities and alignment of LLM cognitive architectures
  - Cognitive psychology perspective on routes to LLM-based AGI with no breakthroughs needed
AGI risk interactions with societal power structures and incentives:
- Whether governments will control AGI is important and neglected
- If we solve alignment, do we die anyway?
  - Risks of proliferating human-controlled AGI
- Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours
On the psychology of alignment as a field:
- Cruxes of disagreement on alignment difficulty
- Motivated reasoning/confirmation bias as the most important cognitive bias
On technical alignment of LLM-based AGI agents:
- System 2 Alignment on how developers will try to align LLM agent AGI
- Seven sources of goals in LLM agents brief problem statement
- Internal independent review for language model agent alignment
On AGI alignment targets assuming technical alignment
On communicating AGI risks:

Research Overview:

Alignment is the study of how to give AIs goals or values aligned with ours, so we’re not in competition with our own creations. Recent breakthroughs in AI like ChatGPT make it possible we’ll have smarter-than-human AIs soon. So we’d better get ready. If their goals don’t align well enough with ours, they’ll probably outsmart us and get their way — and treat us as we do ants or monkeys. See this excellent intro video for more.

There are good and deep reasons to think that aligning AI will be very hard. But I think we have promising solutions that bypass most of those difficulties, and could be relatively easy to use for the types of AGI we’re most likely to develop first.

That doesn’t mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won’t get many tries. If it were up to me I’d Shut It All Down, but I don’t see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.

In brief I think we can probably build and align language model agents (or language model cognitive architectures) even when they’re more autonomous and competent than humans. We’d use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too.

Bio

I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I’ve worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I’ve focused on the interactions between different brain neural networks that are needed to explain complex thought. Here’s a list of my publications.

I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I’m incredibly excited to now be working full-time on alignment, currently as a research fellow at the Astera Institute.

More on My Approach

The field of AGI alignment is “pre-paradigmatic.” So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can’t afford.

When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans’ cognitive capacities—a “real” artificial general intelligence that will soon be able to outsmart humans.

My work since then has convinced me that we could probably also align such an AGI so that it stays aligned even if it grows much smarter than we are. Instead of trying to give it a definition of ethics it can’t misunderstand or re-interpret (value alignment mis-specification), we’ll continue doing with the alignment target developers currently use: Instruction-following. It’s counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there’s no logical reason this can’t be done. An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter.

There are significant problems to be solved in prioritizing instructions; we would need an agent to prioritize more recent instructions over previous ones, including hypothetical future instructions.

I increasingly suspect we should be actively working to build such intelligences. It seems like our our best hope of survival, since I don’t see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won’t “think” in English. Thus far, I haven’t been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven’t embarked on actually helping develop language model cognitive architectures.

Even though these approaches are pretty straightforward, they’d have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That’s despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.

Seth Herd Aug 7, 2025, 9:54 PM
17 points
10
on: Yes, Rationalism is a Cult
If your definition of cult is any set of beliefs that anyone has ever promoted in bad/dishonest/non-rational ways, then every organization and set of beliefs with more than about two adherents is a cult.

Seth Herd Aug 7, 2025, 9:13 PM
6 points
2
in reply to: Q Home’s comment on: Q Home’s Shortform
I just want to note that humans aren’t aligned by default, so creating human-like reasoning and learning is not itself an alignment method. It’s just a different variant of providing capabilities, which you separately need to point at an alignment target.

It may or may not be easier to align than alternatives. I personally don’t think this matters because I strongly believe that the only type of AGI worth aligning is the type(s) most likely to be developed first. Hoping that the indurstry and society is going to make major changes to AGI development based on which types the researchers think are easier to align seems like a forlorn hope.

More on why it’s a mistake to assume human-like cognition in itself leads to alignment:

Sociopaths/psychopaths are a particularly vivid example of how humans are misaligned. And there are good reasons to think that they are not a special case in which empathy was accidentally left out or deliberately blocked, but that they are the baseline human cognition without the mechanisms that create empathy. It’s tough to make this case for certain, but it’s a very bad idea to assume that humans are aligned by default, so all we’ve got to do is reproduce human-like cognitive mechanisms and maybe train it “in a good family” or similar.

That’s not to argue against human-like approaches to AGI as worse for alignment, just to say that they’re only better in that we have a little better understanding of that type of cognition and some mechanisms by which humans often wind up approximately aligned in common contexts.

My own research is also in using loosely human-like reasoning and learning as a route to alignment, but that’s primarily because a) that’s my background expertise so it’s my relative advantage and b) I think LLMs are very loosely like some parts of the human brain/mind, and that we’ll see continued expansion of LLM agents to reason in more loosely human-like ways (that is, with chains of thought, specific memory looks ups, metacognition to organize this, etc).

So I’m working on aligning loosely human-like cognition not because I think it’s by default any easier than aligning any other form of AGI, but just because that’s what seems most likely to become the first takeover capable (or pivotal act capable) AGI.

Seth Herd Aug 6, 2025, 8:26 PM
2 points
0
in reply to: wdmacaskill’s comment on: Should we aim for flourishing over mere survival? The Better Futures series.
Thank you! I saw that comment and responded there. I said that really clarified the argument and that given that clarification, I largely agree.

My one caveat is noting that if we screw up alignment we could easily kill more than our own chance at flourishing. I think it’s pretty easy to get a paperclipper expanding at near-C and snuffing out all civilizations in our light cone before they get their chance to prosper. So raising our odds of flourishing should be weighed against the risk of messing up a bunch of other civilizations chances. One likely non-simulation answer to the Fermi paradox is that we’re early to the game. We shouldn’t lose big if it keeps others from getting to play.

I hadn’t considered this tradeoff closely because in my world models survival and flourishing are still closely tied together. If we solve alignment we probably get near-optimal flourishing. If we don’t, we all die.

I realize there’s a lot of room in between; that model is down to the way I think goals and alignment and human beings work. I think intent alignment is more likely, which would put (a) human(s) in charge of the future. I think most humans would agree that flourishing sounds nice if they had long enough to contemplate it. Very few people are so sociopathic/sadistic that they’d want to not allow flourishing in the very long term.

But that’s just one theory! It’s quite hard to guess and I wouldn’t want to assume that’s correct.

I’ll look in more depth at your ideas of how to play for a big win. I’m sure most of it is compatible with trying our best to survive.

Seth Herd Aug 6, 2025, 8:15 PM
2 points
0
in reply to: Vladimir_Nesov’s comment on: weightt an’s Shortform
Coordination among people isn’t mysterious, but it’s based in large part on properties that AGIs won’t have. That’s why I find hopes of stable collaborations optimistic in the absence of careful analysis of how they could be enforced or otherwise create lasting trust.

Humans collaborate in large part because:
1. We can’t do it all ourselves. AGIs will be able to expand their capabilities and fork as many copies as they have the hardware to run
2. We like making friends (earning positive social regard) for its own sake This will only be true of AGIs if we mostly solve alignment, or get quite lucky
So I’m not saying AGIs couldn’t cooperate, just that it shouldn’t be assumed that they can/will.

In the absence of those properties, they’d need to worry a lot about scheming while striking deals. If the alignment problem wasn’t clearly solved in legible (to them) ways, they don’t know if their collaborators will turn traitor when the time is right. Just like humans, except everyone might be (probably is) a sociopath who can multiply and grow without limit.

Incentives only work as long as there’s the hard constraints of the situation prevent a collaborator from slipping out of them.

Seth Herd Aug 6, 2025, 8:06 PM
2 points
0
in reply to: sortega’s comment on: My current guess at the effect of AI automation on jobs
For all I know they could! I hoped you’d know more. It seems like if enough high-paying jobs start disappearing fast enough it would create a large recession.

Seth Herd Aug 6, 2025, 1:04 PM
2 points
0
on: My current guess at the effect of AI automation on jobs
I definitely applaud the effort. It seems like we should be much more focused on potential job loss because it will do much to shape public opinion on AI.

How much job loss how fast do you think it would take to create a major global economic crash? I’m no economist but it intuitively seems like it wouldn’t take all that much, particularly if the expectation was that job loss would continue.

Seth Herd Aug 6, 2025, 12:46 PM
4 points
3
in reply to: ceba’s comment on: Alcohol is so bad for society that you should probably stop drinking
That’s not the option though; if I stop drinking it will barely affect others’ drinking.

If it were my choice, I’d still consider carefully. Alcohol causes a whole lot of fun and random enthusiasm!

I’d be very cautious of shutting to a less fun, more staid society.

I’d want to focus on changing things to reduce harms in other ways, or making sure there were replacements for those sources of fun and funding and random enthusiasm.

Seth Herd Aug 6, 2025, 4:18 AM
2 points
0
in reply to: Vladimir_Nesov’s comment on: weightt an’s Shortform
Right, thanks! That model of a governor and some tenants undergoing near-perfect surveillance is my only model for a stable long-term future. And sure this can happen at some level above human but below superintelligence.

I was just a little thrown by the multiple superintelligences. It has in the past seemed unlikely to me that we’d wind up with a long-term agreement among different superintelligences vs. one taking over by force. But I can’t be sure!

It’s seemed unlikely to me since the arguments for cooperation among superintelligences don’t seem strong. Reading each others source code for perfect trustworthiness seems impossible in a learning network-based ASI. And timeless decision theory being so good that all superintelligences would necessarily follow it also seems implausible.

But I haven’t thought through the game theory plus models of likely superintelligence alignment/goals well enough to be confident, so for all I know cooperating superintelligences is a likely outcome.

Seth Herd Aug 6, 2025, 4:10 AM
3 points
0
in reply to: Jordan Arel’s comment on: “Momentism”: Ethics for Boltzmann Brains
Thanks, I hate it! I guess I hadn’t seen a full presentation of the argument or didn’t remember it. Now I see why it’s troubling enough to want to resolve. Those physics and epistemics arguments seem important but I’m going to resist getting into them just in case I’m real and should be working on solving alignment.

Seth Herd Aug 5, 2025, 9:25 PM
2 points
0
in reply to: Vladimir_Nesov’s comment on: weightt an’s Shortform
I’m unclear on whether you’re saying that there would be a stable equilibrium among ASIs or whether there would be a singleton governing everything and allowing wide lattitude of action.

A single AGIs can achieve anything if it can self-improve to create ASI without anyone knowing. Working underground or elsewhere in the solar system seems hard to detect and prevent once we have the robotics to seed such an effort- which won’t take long.

I did read and greatly enjoyed your linked post. I do think that’s a plausible and underdeveloped area of thought. I don’t find it all that likely for complex reasons, but it’s definitely worth more thoughtl. I didn’t get around to commenting on it; maybe I’ll go do that to put that discussion in a better place.

Seth Herd Aug 5, 2025, 5:42 PM
2 points
0
in reply to: Canaletto’s comment on: weightt an’s Shortform
Whether it can be made stable is quite central to my models of how we steer toward a good result. So, how do you think it could be made stable? And in that process, could you also make a little rule “no involuntary torturing or killing or controlling sentient beings; otherwise do what thou wilt”.

Seth Herd Aug 5, 2025, 5:27 PM
2 points
0
in reply to: Canaletto’s comment on: weightt an’s Shortform
It would not be stable. The most vicious actors are incentivized to tell their AGI “hide and self improve and take over at any cost so I can have my preferred future” before anyone else does it.

This extendsall the way to tactics like sending the sun nova while storing some brain uploads on a mission to another star to start a new civilization.

Recognizing just how dangerous AGI proliferation is should help us steer away from trying it and having torture clusters (up until the most vicious actor creates their preferred world in the whole light cone—which might well also involve lots of suffering.)

Seth Herd Aug 5, 2025, 5:13 PM
2 points
0
in reply to: Canaletto’s comment on: weightt an’s Shortform
This is an excellent point. It’s one more reason that proliferation of AGI is simply not a viable path. We are right to be very concerned with centralization of power, but the distribution of rapidly and unevenly expanding unregulated power does not contain a stable equilibrium.

Seth Herd Aug 5, 2025, 3:42 AM
4 points
0
on: “Momentism”: Ethics for Boltzmann Brains
You’re not a Boltzmann brain, because even the simplest thought required uncountable successive states to unfold. Even if there’s some simplification, the odds of the time sequence of even the simplest thought unfolding are minuscule raised to the power of the many necessary successive states.

Seth Herd Aug 4, 2025, 10:58 PM
14 points
8
on: Should we aim for flourishing over mere survival? The Better Futures series.
I really wish I could agree. I think we should definitely think about flourishing when it’s a win/win with survival efforts. But saying we’re near the ceiling on survival looks wildly too optimistic to me. This is after very deeply considering our position and the best estimate of our odds, primarily surrounding the challenge of aligning superhuman AGI (including surrounding societal complications).
There are very reasonable arguments to be made about the best estimate of alignment/AGI risk. But disaster likelihoods below 10% really just aren’t viable when you look in detail. And it seems like that’s what you need to argue that we’re near ceiling on survival.
The core claim here is “we’re going to make a new species which is far smarter than we are, and that will definitely be fine because we’ll be really careful how we make it” in some combination with “oh we’re definitely not making a new species any time soon, just more helpful tools”.
When examined in detail, assigning a high confidence to those statements is just as silly as it looks at a glance. That is obviously a very dangerous thing and one we’ll do pretty much as soon as we’re able.
90% plus on survival looks like a rational view from a distance, but there are very strong arguments that it’s not. This won’t be a full presentation of those arguments; I haven’t written it up satisfactorily yet, so here’s the barest sketch.
Here’s the problem: The more people think seriously about this question, the more pessimistic they are.
And time-on-task is the single most important factor for success in every endeavor. It’s not a guarantee but it’s by far the most important factor. It dwarfs raw intelligence as a predictor of success in every domain (although the two are multiplicative).

The “expert forecasters” you cite don’t have nearly the time-on-task of thinking about the AGI alignment problem. Those who actually work in that area are very systematically more pessimistic the longer and more deeply we’ve thought about it. There’s not a perfect correlation, but it’s quite large.
This should be very concerning from an outside view.
This effect clearly goes both ways, but that only starts to explain the effect. Those who intuitively find AGI very dangerous are prone to go into the field. And they’ll be subject to confirmation bias. But if they were wrong, a substantial subset should be shifting away from that view after they’re exposed to every argument for optimism. This effect would be exaggerated by the correlation between rationalist culture and alignment thinking; valuing rationality provides resistance (but certainly not immunity!) to motivated reasoning/confirmation bias by aligning ones’ motivations with updating based on arguments and evidence.
I am an optimistic person, and I deeply want AGI to be safe. I would be overjoyed for a year if I somehow updated to only 10% chance of AGI disaster. It is only my correcting for my biases that keeps me looking hard enough at pessimistic arguments to believe them based on their compelling logic.
And everyone is affected by motivated reasoning, particularly the optimists. This is complex, but after doing my level best to correct for motivations, it looks to me like the bias effects have far more leeway to work when there’s less to push against. The more evidence and arguments are considered, the less bias takes hold. This is from the literature on motivated reasoning and confirmation bias, which was my primary research focus for a few years and a primary consideration for the last ten.
That would’ve been better as a post or a short form, and more polished. But there it is FWIW, a dashed-off version of an argument I’ve been mulling over for the past couple of years.
I’ll still help you aim for flourishing, since having an optimistic target is a good way to motivate people to think about the future.

Seth Herd Aug 3, 2025, 9:24 PM
−3 points
−7
in reply to: lc’s comment on: lc’s Shortform
That’s like a 1% chance. It seems far more likely that insufficient effort on alignment will have us all dead long before then.

There’s vastly too little effort on alignment and too much on diversified good works in the world at this point. That may be another neglected area that rationalist would be particularly likely to address, but it seems like any way you do the math the EV is going to be way higher on alignment and related AGI navigation issues.

Seth Herd Aug 3, 2025, 6:47 PM
3 points
1
on: Astronomical Waste & Conscientious Objection
I think you’re missing the larger factor in why lots of AGI-pilled people aren’t trying hard to slow progress: we think it won’t work.

There’s a second reasonable approach, and IMO most people are following this one. It’s to try to solve alignment fast enough to outpace the rush toward capabilities.

These efforts include spreading awareness and slowing capabilities if possible, but it’s quite reasonable to make that a secondary priority without any concern for astronomical waste. I don’t think that’s a large factor in many people’s decisions. Concerns about delaying progress should loom large for selfish individuals who would risk the future to avoid their own deaths, but I think most people around here are mostly serious about thinking from a long-term utilitarian perspective.

My grandfather was a conscientious objector during WWII. I think that made sense at the time, but I don’t think the approach is helpful with AGI because the stakes are so large and the likely timelines so short. CO works by setting an example of virtue that resonates through history.

Seth Herd Aug 3, 2025, 6:39 PM
2 points
−8
in reply to: lc’s comment on: lc’s Shortform
I see how it “feels” worth doing, but I don’t think that intuition survives analysis.

Very few realistic timelines now include the next generation contributing to solving alignment. If we get it wrong, the next generation’s capabilities are irrelevant, and if we get it right, they’re still probably irrelevant. I feel like these sorts of projects imply not believing in ASI. This is standard for most of the world, but I am puzzled how LessWrong regulars could still coherently hold that view.

So please help with alignment instead? This doesn’t just need technical work; it needs broad thinkers and those good at starting orgs and spreading ideas, too.

I think we’ve been in a mindset in which we can’t contribute to alignment if we’re not a genius or technically skiled. I think it’s become clear that organization, outreach, and communication also improve our odds nontrivially.

Seth Herd Aug 3, 2025, 6:28 PM
41 points
18
on: Alcohol is so bad for society that you should probably stop drinking
Does this include an analysis of alcohol’s benefits beyond the general acknowledgment in the conclusion? I think they are subtle but powerful. Alcohol is often used as a social-bonding and truth-telling influence. Like caffeine, it doesn’t give more of anything in the long term, but allows users some conscious control over distribution of their moods and energy. And it can shift moods out-of-distribution temporarily toward joy (as well as anger), with potential long-lasting beneficial effects, as the uathor mentions.

I’m not making an argument that alcohol should remain a part of society, just pointing out that the positive factors need to be carefully considered before making a broad and strong recommendation like this.

I just skiimmed because I need to stay focused on work, and I’m aware that alcohol has a staggering list of harms.

Seth Herd Aug 2, 2025, 7:44 PM
2 points
0
in reply to: Richard_Kennaway’s comment on: Why I Just Took The Giving What We Can Pledge
I agree. I pass up pieces with those titles. But I don’t think it applies to this piece. The author didn’t pitch it that way, I did. And in fact I think it is true and important, and that there is a lot of emotional hang up and resistance to accepting that we are all falling far, far short of the level of ethics we’d like to imagine we have. I have not taken the giving what we can pledge, so I don’t exactly agree with the author’s conclusions and I don’t think the logic is nearly tight enough. But I think the question of whether we should all take that pledge or similar things is very much an open one; the claim that we’d be happier if we did seems quite plausible to me and very much worthy of debate on lesswrong. I’m disturbed to see it get downvotes instead of debate; I think if a less sensitive and equally important topic was written about this poorly it would be treated much more kindly. The claim that this has been debated to death already so we need not engage with repetitive and bad versions seems simply false to me. I have lived on less wrong for the last 3 years and seen no serious debate of this topic in that time, nor a hint of an accepted consensus.

I happen to think that solving alignment trumps all other ethical concerns, so I’m not going to be the one to do a better treatment here. But I am disappointed in the community for being so hostile to this person’s attempts.