Seth Herd

Karma: 6,589

Message me here or at seth dot herd at gmail dot com.

I was a researcher in cognitive psychology and cognitive neuroscience for about two decades. I studied complex human thought using neural network models of brain function. I’m applying that knowledge to figuring out how we can align AI as it becomes capable of all the types of complex thought that make humans capable and dangerous.

If you’re new to alignment, see the Research overview section below. Field veterans who are curious about my particular take and approach should see the More on approach section at the end of the profile.

Important posts:

On the strategic overview of AGI risk:
- TBA, next post
- If we solve alignment, do we die anyway?
  - Risks of human-controlled AGI
On the psychology of alignment as a field:
- Cruxes of disagreement on alignment difficulty
- Motivated reasoning/confirmation bias as the most important cognitive bias
On technical alignment of LLM-based AGI agents:
- Seven sources of goals in LLM agents brief problem statement
- System 2 Alignment on how developers will try to align LLM agent AGI
On LLM-based agents as a route to takeover-capable AGI
- Brief argument for short timelines being plausible
- Capabilities and alignment of LLM cognitive architectures
  - Cognitive psychology perspective on routes to LLM-based AGI with no breakthroughs needed
On AGI alignment more broadly
- Instruction-following AGI is easier and more likely than value aligned AGI
- Goals selected from learned knowledge: an alternative to RL alignment
On communicating AGI risks:
- Humanity isn’t remotely longtermist, so arguments for AGI x-risk should focus on the near term
- AI scares and changing public beliefs

Research overview:

Alignment is the study of how to give AIs goals or values aligned with ours, so we’re not in competition with our own creations. Recent breakthroughs in AI like ChatGPT make it possible we’ll have smarter-than-human AIs soon. So we’d better get ready. If their goals don’t align well enough with ours, they’ll probably outsmart us and get their way — and treat us as we do ants or monkeys. See this excellent intro video for more.

There are good and deep reasons to think that aligning AI will be very hard. But I think we have promising solutions that bypass most of those difficulties, and could be relatively easy to use for the types of AGI we’re most likely to develop first.

That doesn’t mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won’t get many tries. If it were up to me I’d Shut It All Down, but I don’t see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.

In brief I think we can probably build and align language model agents (or language model cognitive architectures) even when they’re more autonomous and competent than humans. We’d use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too.

Bio

I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I’ve worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I’ve focused on the interactions between different brain neural networks that are needed to explain complex thought. Here’s a list of my publications.

I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I’m incredibly excited to now be working directly on alignment, currently as a research fellow at the Astera Institute.

More on approach

The field of AGI alignment is “pre-paradigmatic.” So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can’t afford.

When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans’ cognitive capacities—a “real” artificial general intelligence that will soon be able to outsmart humans.

My work since then has convinced me that we could probably also align such an AGI so that it stays aligned even if it grows much smarter than we are. Instead of trying to give it a definition of ethics it can’t misunderstand or re-interpret (value alignment mis-specification), we’ll continue doing with the alignment target developers currently use: Instruction-following. It’s counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there’s no logical reason this can’t be done. An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter.

There are significant problems to be solved in prioritizing instructions; we would need an agent to prioritize more recent instructions over previous ones, including hypothetical future instructions.

I increasingly suspect we should be actively working to build such intelligences. It seems like our our best hope of survival, since I don’t see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won’t “think” in English. Thus far, I haven’t been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven’t embarked on actually helping develop language model cognitive architectures.

Even though these approaches are pretty straightforward, they’d have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That’s despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.

Seth Herd Apr 15, 2025, 8:47 PM
4 points
0
in reply to: Kaj_Sotala’s comment on: Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI
Right, I got that. To be clear, my argument is that no breakthroughs are necessary, and further that progress is underway and rapid on filling in the existing gaps in LLM capabilities.

Memory definitely doesn’t require a breakthrough. Add-on memory systems (RAG and fine-tuning, as well as more sophisticated context management through prompting; CoT RL training effectively does this too).

Other cognitive capacities also exist in nascent form and so probably require no breakthroughs. Although I think no other external cognitive systems are needed given the rapid progress in multimodal and reasoning transformers.

Seth Herd Apr 15, 2025, 7:40 PM
9 points
0
on: Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI
This is great; I think it’s important to have this discussion. It’s key for where we put our all-too-limited alignment efforts.
I roughly agree with you that pure transformers won’t achieve AGI, for the reasons you give. They’re hitting a scaling wall, and they have marked cognitive blindspots like you document here, and Thane Ruthenis argues for convincingly in his bear case. But transformer-based agents (which are simple cognitive architectures) can still get there- and I don’t think they need breakthroughs, just integration and improvement. And people are already working on that.
To put it this way: humans have all of the cognitive weaknesses you identify, too. But we can use online learning (and spatial reasoning) to overcome them. We actually generalize only rarely and usually with careful thought. Scaffolded and/or RL-trained CoT models can do that too. Then we remember our conclusions and learn from them. Add-on memory systems and fine-tuning setups can replicate that.
Generalization: It’s a sort of informal general conclusion in cognitive psychology that “wow are people bad at generalizing”. For instance, if you teach them a puzzle, then change the names and appearances, it looks like they don’t apply the learning at all. These are undergraduates who are usually unpaid, so they’re not doing the careful thinking it requires humans to generalize knowledge. LLMs of the generation you’re testing don’t think carefully either (you note that in a couple of places “it’s like it’s not thinking” which is exactly right), but CoT RL on a variety of reward models is making disturbingly rapid progress at teaching them when and how to think carefully—and enabling generalization. Good memory systems will be necessary for them to internalize the new general principles they’ve learned, but those might be good enough already, and if not, they probably will be very soon.
I think you’re overstating the case, and even pure transformers could easily reach human-level “Real AGI” that’s dangerous as hell. Continued improvements in effectively using long context windows and specific training for better CoT reasoning do enable a type of online learning. Use Gemini 2.5 Pro for some scary demonstrations, although it’s not there yet. Naked transformers probably won’t reach AGI, which I’m confident of mostly because adding outside memory (and other cognitive) systems is so much easier and already underway.
My LLM cognitive architectures article also discusses other add-on cognitive systems like vision systems that would emulate how humans solve tic-tac-toe, sliding puzzles etc. - but now I don’t think transformers need those, just more multimodal training. (Note that Claude and o1 preview weren’t multimodal, so were weak at spatial puzzles. If this was full o1, I’m surprised.) They’ll still be relatively bad at fundamentally spatial tasks like yours and ARC-AGI, but they can fill that in by learning or being taught the useful concepts/tricks for actually useful problem spaces. That’s what a human would do if they happened to be have poor spatial cognition and had to solve spatial tasks.
I think transformers not reaching AGI is a common suspicion/hope amongst serious AGI thinkers. It could be true, but it’s probably not, so I’m worried that too many good thinkers are optimistically hoping we’ll get some different type of AGI. It’s fine and good if some of our best thinkers focus on other possible routes to AGI. It is very much not okay if most of our best thinkers incorrectly assume they won’t. Prosaic alignment work is not sufficient to align LLMs with memory/online learning, so we need more agent foundations style thinking there.
I address how we need to go beyond static alignment of LLMs to evolving learned systems of beliefs, briefly and inadequately in my memory changes alignment post and elsewhere. I am trying to crystallize the reasoning in a draft post with the working title “if Claude achieves AGI will it be aligned?”

Seth Herd Apr 15, 2025, 7:09 PM
16 points
0
in reply to: Noosphere89’s comment on: Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI
You beat me to it. Thanks for the callout!
Humans are almost useless without memory/in-context learning. It’s surprising how much LLMs can do with so little memory.
The important remainder is that LLM-based agents will have better memory/online learning as soon as they can handle it, and it will keep getting better, probably rapidly. I review current add-on memory systems in LLM AGI will have memory, and memory changes alignment. A few days after I posted that, OpenAI announced that they had given ChatGPT memory over all its chats, probably with a RAG and summarization system. That isn’t the self-directed memory for agents that I’m really worried about, but it’s exactly the same technical system you’d use for that purpose. Fortunately it doesn’t work that well—yet.
I wrote about this in Capabilities and alignment of LLM cognitive architectures two years ago, but I wasn’t sure how hard to push the point for fear of catalyzing capabilities work.
Now it’s obvious that many developers are aware and explicitly focused on the virtues of online learning/memory.
This is a great post because LLMs being a dead-end is a common suspicion/hope among AGI thinkers. It isn’t likely to be true, so it should be discussed. More in a separate comment.

Seth Herd Apr 15, 2025, 3:04 PM
3 points
0
in reply to: eva_’s comment on: A Dissent on Honesty
I think this issue of the difficulty of making each decision about lying as an independent decision is the main argument for treating it as a virtue ethics or deontological issue.

I think you make many good points in the essay arguing that one should not simply follow a rule of honesty. I think that in practice the difference can be split, and that is in fact what most rationalists and other wise human beings do. I also think it is highly useful to write this essay on the mini virtues of lying, so that that difference can be split well.

There are many subtle downsides to lying, so simply adding a bit of a fudge factor to the decision that weighs against it is one way to avoid taking forever to make that decision. You’ve talked about practicing making the decision quickly, and I suspect that is the result of that practice.

This is a separate issue, but your point about being technically correct is also a valuable one. It is clearly not being honest to say things you know will cause the listener to form false beliefs.

I have probably aired on the side of honesty as have many rationalists, treating it not as an absolute deontological issue and being willing to fudge a little on the side of technically correct to maintain social graces in some situations. I enjoy a remarkable degree of trust from my true friends, because they know me to be reliably honest. However, I have probably suffered reputational damages from acquaintances and failed friends, for whom my exceptional honesty has proven hurtful. Those people don’t have adequate experience with me to see that I am reliably honest and appreciate the advantages of having a friend who can be relied upon to tell the truth. That’s because they’ve ceased being my friend when they’ve been either insulted or irritated by my unhelpful honesty.

There is much here I agree with and much I disagree with. But I think this topic is hugely valuable for the rationalist community, and you’ve written it up very well. Nice work!

Seth Herd Apr 14, 2025, 7:22 PM
5 points
2
in reply to: Lucius Bushnaq’s comment on: Eli’s shortform feed
Surely you mean does not necessarily produce an agent that cares about x? (at any given relevant level of capability)
Having full confidence that we either can or can’t train an agent to have a desired goal both seem difficult to justify. I think the point here is that training for corrigibility seems safer than other goals because it makes the agent useful as an ally in keeping it aligned as it grows more capable or designs successors.

Seth Herd Apr 14, 2025, 12:45 PM
2 points
0
on: Thoughts on the Double Impact Project
This doesn’t work as advertised.

If I care about the election more than other charities, I won’t give to such a fund. My dollars will do more towards the campaign on average if I give directly to my side. This effect is trivial if the double impact group is small but very large if it is most donations.

In an extreme case, suppose that most people give to double impact and the two campaigns are tied $1b - $1b. One donor gives their $1m directly to their side. It is the only money actually spent on advertising; that side has a large advantage in ratio of funds spent.

More realistic scenarios yield smaller average ratios, but always less expected return for your preferred campaign if you give to it vs, double impact.

Seth Herd Apr 14, 2025, 6:52 AM
7 points
0
in reply to: Eli Tyre’s comment on: Eli’s shortform feed
I have the same question. My provisional answer is that it might work, and even if it doesn’t, it’s probably approximately what someone will try, to the extent they really bother with real alignment before it’s too late. What you suggest seems very close to the default path toward capabilities. That’s why I’ve been focused on this as perhaps the most practical path to alignment. But there are definitely still many problems and failure points.
I have accidentally written a TED talk below; thanks for coming, and you can still slip out before the lights go down.
What you’ve said above is essentially what I say in Instruction-following AGI is easier and more likely than value aligned AGI. Instruction-following (IF) is a poor man’s corrigibility—real corrigibility as the singular target seems safer. But instruction-following is also arguably already the single largest training objective in functional terms for current-gen models—a model that won’t follow instructions is considered a poor model. So making sure it’s the strongest factor in training isn’t a huge divergence from the default course in capabilities.
Constitutional AI and similar RL methods are one way of ensuring that’s the model’s main goal. There are many others, and some might be deployed even if devs want to skimp on alignment. See System 2 Alignment or at least the intro for more.
There are still ways it could go wrong, of course. One must decide: corrigible to whom? You don’t want full-on-AGI following orders from just anyone. And if it’s a restricted set, there will be power struggles. But hey, technically, you had (personal-intent-) aligned AGI. One might ask: If we solve alignment, do we die anyway? (I did). The answer I’ve got so far is maybe we would die anyway, but maybe we wouldn’t. This seems like our most likely path, and also quite possibly also our best chance (short of a global AI freeze starting soon).
Even if the base model is very well aligned, it’s quite possible for the full system to be unaligned. In particular, people will want to add online learning/memory systems, and let the models use them flexibly. This opens up the possibility of them forming new beliefs that change their interpretation of their corrigibility goal; see LLM AGI will have memory, and memory changes alignment. They might even form beliefs that they have a different goal altogether, coming from fairly random sources but etched into their semantic structure as belief that is functionally powerful even where it conflicts with the base model’s “thought generator”. See my Seven sources of goals in LLM agents.
Sorry to go spouting my own writings; I’m excited to see someone else pose this question, and I hope to see some answers that really grapple with it.

Seth Herd Apr 13, 2025, 11:26 PM
2 points
0
in reply to: Dylan Richardson’s comment on: A Bear Case: My Predictions Regarding AI Progress
Definitely. Excellent point. See my short bit on motivated reasoning, in lieu of the full post I have on the stack that will address its effects on alignment research.
I frequently check how to correct my timelines and takes based on potential motivated reasoning effects for myself. The result is usually to broaden my estimates and add uncertainty, because it’s difficult to identify which direction MR might’ve been pushing me during all of the mini-decisions that led to forming my beliefs and models. My motivations are many and which happened to be contextually relevant at key decision points is hard to guess.
On the whole I’d have to guess that MR effects are on average larger on long timelines and low p(dooms). They both allow us to imagine a sunny near future, and to work on our preferred projects instead of panicking and having to shift to work that can help with alignment if AGI happens soon. Sorry. This is worth a much more careful discussion, that’s just my guess in the absence of pushback.

Seth Herd Apr 13, 2025, 10:42 PM
2 points
0
on: Outer Alignment
I jsut realized that I’d embarassingly misunderstood outer alignment for a long time, and it was based directly on this wikitag. I’d been including the wise selection of an alignment target as part of outer alignment, which it is not by almost all considered usage of the term. The phrasing in first paragraph firmly implied it was. So I edited that and included a new very brief set of definitions at the bottom. Anyone is most welcome to change or eliminate any of that, except that I’d love to know why if you’re reverting it to the version that seemed flat wrong.

Seth Herd Apr 13, 2025, 9:44 PM
3 points
0
on: Breaking down the MEAT of Alignment
This is really good, and I’m sad I missed seeing and big upvoting it when it had a chance for frontpage. I think having these basic categories is useful for discussing alignment. I’ll be referencing it for the concept of alignment targets.
You might be interested in my Conflating value alignment and intent alignment is causing confusion on terminology.

Seth Herd Apr 11, 2025, 8:17 PM
2 points
0
in reply to: dynomight’s comment on: Paper
I installed it and did like one or two test diagrams and then never again. I should get back to it because it did seem good.

Seth Herd Apr 11, 2025, 7:25 PM
4 points
0
in reply to: dynomight’s comment on: Paper
Right. The post did inspire me to maybe get a new notebook for the first time in years for that reason.
I’ve been using Obsidian exculsively, but it’s really reduced how much diagramming I do. To me the speed does make up for being forced to write on lines and in a limited number of styles. I haven’t really gotten skilled enough in it to quickly diagram in its editor.
I’m not sure there’s much other advantage to paper for making neurons fire good. I wonder if it puts you into thinking mode based on associations or something? Or if staring at a blank page on which you can write or draw anywhere evokes a more openminded and analytical state for anyone?

Seth Herd Apr 11, 2025, 5:40 PM
4 points
0
on: Paper
Or you can use Obsidian or a similar notes software, where you can search, crosslink, and cross-reference in any way you please.

This not as satisfying as writing on paper. I agree that lines are dehumanizing, and I think visual variety of writing and sketching is good for the soul and for thinking.

But actually finding and using your old notes is also priceless.

This system requires having your laptop handy in most circumstances, but your phone is adequate for brief uses. And it’s easy and free to sync across at least two devices, maybe a little more complex for more.

Yes, there’s a lot lost by not being able to quickly hand-sketch. You can draw on paper and put the photo in but I don’t do that for some reason. But typing or talking is generally faster than handwriting.

And again, actually finding your old notes or thoughts is huge.

Seth Herd Apr 11, 2025, 4:52 PM
4 points
2
in reply to: Viliam’s comment on: Mo Putera’s Shortform
Exactly.

I’d go a bit farther and say it’s easier to develop an algorithm that can think about literally everything than one that can think about roughly half of things. That’s because the easiest general intelligence algorithms are about learning and reasoning, which apply to everything.

Seth Herd Apr 10, 2025, 10:13 PM
14 points
0
in reply to: Mo Putera’s comment on: Mo Putera’s Shortform
Blindsight was very well written but based on a premise that I think is importantly and dangerously wrong. That premise is that consciousness (in the sense of cognitive self-awareness) is not important for complex cognition.

This is the opposite of true, and a failure to recognize this is why people are predicting fantastic tool AI that doesn’t become self-aware and goal-directed.

The proof won’t fit in the margin unfortunately. To just gesture in that direction: it is possible to do complex general cognition without being able to think about one’s self and one’s cognition. It is much easier to do complex general cognition if the system is able to think about itself and its own thoughts.

Seth Herd Apr 8, 2025, 7:27 PM
2 points
0
in reply to: abramdemski’s comment on: abramdemski’s Shortform
Those seem like good suggestions if we had a means of slowing the current paradigm and making/keeping it non-agentic.

Do you know of any ideas for how we convince enough people to do those things? I can see a shift in public opinion in the US and even a movement for “don’t make AI that can replace people” which would technically translate to no generally intelligent learning agents.

But I can’t see the whole world abiding by such an agreement, because general tool AI like LLMs is just too easily converted into an agent as it keeps getting better.

Developing new tech in time to matter without a slowdown seems doomed to me.

I would love to be convinced that this is an option! But at this point it looks 80%-plus likely that LLMs-plus-scaffolding-or-related-breakthroughs get us to AGI within five years or a little more if global events work against it, which makes starting from scratch nigh impossible and even substantially different approaches very unlikely to catch up.

The exception is the de-slopifying tools you’ve discussed elsewhere. That approach has the potential to make progress on the current path while also reducing the risk of slop-induced doom. That doesn’t solve actual misalignment as in AI-2027, but it would help other alignment techniques work more predictably and reliably.

Seth Herd Apr 8, 2025, 7:02 PM
2 points
0
on: Are there any (semi-)detailed future scenarios where we win?
Well there’s post An Optimistic 2027 Timeline, published just one day before your question here :)
But in that scenario we don’t win really, we just don’t lose by 2027 because progress was a bit slower due to practical difficulties like a global recession from tariffs and an invasion of Taiwan. So I’m not totally counting it. It does provide more time for alignment. But how does alignment get accomplished, other than “make the AI do our alignment homework?”
So I’d say the best detailed scenario where we win is the version of AGI-2027 where we win. The other variations on that scenario where we win are going to be based on details of the political/economic/personal path to building it, and on the details of how alignment is solved adequately.

There’s an alignment-focused scenario sitting in my drafts folder. It does focus on how the technical problem gets solved, but it’s also about the sociopolitical challenges we’ll have to face to make that realistic, and how we prevent proliferation of AGI. Right now this is spread out across my posts. This comment in response to “what’s the plan for short timelines?” is the closest I’ve come to putting it in one place so far.
This isn’t detailed yet, though. My model matches Daniel K’s quite closely, although with slightly longer timelines due to practical difficulties like those discussed in the optimistic 2027 scenario I linked above.

Seth Herd Apr 8, 2025, 12:30 AM
2 points
0
in reply to: Noosphere89’s comment on: LLM AGI will have memory, and memory changes alignment
That’s very useful, thanks! That’s exactly the argument I was trying to make here. I didn’t use the term drop-in remote worker but that’s the economic incentive I’m addressing (among more immediate ones- I think large incentives start long before you have a system that can learn any job).

Lack of episodic memory looks to me like the primary reason LLMs have weaknesses humans do not. The other is a well-developed skillset for managing complex trains of thought. o1 and o3 and maybe the other reasoning models have learned some of that skillset but only mastered it in the narrow domains that allowed training on verifiable answers. Scaffolding and/or training for executive function (thought management) and/or memory seems poised to increase the growth rate of long time-horizon task performance. It’s going to take some work still but I don’t think it’s wise to assume that the seven-month doubling period won’t speed up, or that some point it will just jump to infinity, while the complexity of the necessary subtasks is still a limiting factor.

Humans don’t train on tons of increasingly long tasks, we just learn some strategies and some skills for managing our thought, like checking carefully whether a step has been accomplished, searching memory for task structure and where we’re at in the plan if we lose our place, etc. Humans are worse at longer tasks, but any normal adult human can tackle a task of any length and at least keep getting better at it for as long as they decide to stick with it.

Seth Herd Apr 8, 2025, 12:22 AM
9 points
0
in reply to: gwern’s comment on: METR: Measuring AI Ability to Complete Long Tasks
I think you’re right that online learning/memory here is an important consideration. I expect an increase in the rate of improvement in time horizons as memory systems are integrated with agents.

Noosphere pointed me to this comment in relation to my recent post on memory in LLM agents. I briefly argued there memory is so useful for doing long time-horizon tasks that we should expect LLM agents to have nontrivial memory capabilities as soon as they’re competent enough to do anything useful or dangerous. Humans without episodic memory are very limited in what they can accomplish, so I’m actually surprised that LLMs can do tasks even beyond 15 minutes equivalent—and even that might only be a subset of tasks that suits their strengths.

Seth Herd Apr 7, 2025, 2:24 PM
4 points
0
on: Meditation and Reduced Sleep Need
You mean meditation reduces sleep, right?

Whether or not there’s a reduced need or you’re just getting less sleep is a matter of anecdote. I find it entirely plausible that meditators are correct that they need less sleep, but the data you crunched don’t distinguish. To know if you needed less you’d need to have cognitive and health measures.