Seth Herd

Karma: 6,639

Message me here or at seth dot herd at gmail dot com.

I was a researcher in cognitive psychology and cognitive neuroscience for about two decades. I studied complex human thought using neural network models of brain function. I’m applying that knowledge to figuring out how we can align AI as it becomes capable of all the types of complex thought that make humans capable and dangerous.

If you’re new to alignment, see the Research overview section below. Field veterans who are curious about my particular take and approach should see the More on approach section at the end of the profile.

Important posts:

On the strategic overview of AGI risk:
- TBA, next post
- If we solve alignment, do we die anyway?
  - Risks of human-controlled AGI
On the psychology of alignment as a field:
- Cruxes of disagreement on alignment difficulty
- Motivated reasoning/confirmation bias as the most important cognitive bias
On technical alignment of LLM-based AGI agents:
- Seven sources of goals in LLM agents brief problem statement
- System 2 Alignment on how developers will try to align LLM agent AGI
On LLM-based agents as a route to takeover-capable AGI
- Brief argument for short timelines being plausible
- Capabilities and alignment of LLM cognitive architectures
  - Cognitive psychology perspective on routes to LLM-based AGI with no breakthroughs needed
On AGI alignment more broadly
- Instruction-following AGI is easier and more likely than value aligned AGI
- Goals selected from learned knowledge: an alternative to RL alignment
On communicating AGI risks:
- Humanity isn’t remotely longtermist, so arguments for AGI x-risk should focus on the near term
- AI scares and changing public beliefs

Research overview:

Alignment is the study of how to give AIs goals or values aligned with ours, so we’re not in competition with our own creations. Recent breakthroughs in AI like ChatGPT make it possible we’ll have smarter-than-human AIs soon. So we’d better get ready. If their goals don’t align well enough with ours, they’ll probably outsmart us and get their way — and treat us as we do ants or monkeys. See this excellent intro video for more.

There are good and deep reasons to think that aligning AI will be very hard. But I think we have promising solutions that bypass most of those difficulties, and could be relatively easy to use for the types of AGI we’re most likely to develop first.

That doesn’t mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won’t get many tries. If it were up to me I’d Shut It All Down, but I don’t see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.

In brief I think we can probably build and align language model agents (or language model cognitive architectures) even when they’re more autonomous and competent than humans. We’d use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too.

Bio

I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I’ve worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I’ve focused on the interactions between different brain neural networks that are needed to explain complex thought. Here’s a list of my publications.

I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I’m incredibly excited to now be working directly on alignment, currently as a research fellow at the Astera Institute.

More on approach

The field of AGI alignment is “pre-paradigmatic.” So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can’t afford.

When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans’ cognitive capacities—a “real” artificial general intelligence that will soon be able to outsmart humans.

My work since then has convinced me that we could probably also align such an AGI so that it stays aligned even if it grows much smarter than we are. Instead of trying to give it a definition of ethics it can’t misunderstand or re-interpret (value alignment mis-specification), we’ll continue doing with the alignment target developers currently use: Instruction-following. It’s counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there’s no logical reason this can’t be done. An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter.

There are significant problems to be solved in prioritizing instructions; we would need an agent to prioritize more recent instructions over previous ones, including hypothetical future instructions.

I increasingly suspect we should be actively working to build such intelligences. It seems like our our best hope of survival, since I don’t see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won’t “think” in English. Thus far, I haven’t been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven’t embarked on actually helping develop language model cognitive architectures.

Even though these approaches are pretty straightforward, they’d have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That’s despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.

Seth Herd Apr 24, 2025, 9:46 PM
11 points
2
on: Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
But success for most things doesn’t require just one correct solution among k attempts, right? For the majority of areas without easily checkable solutions, higher odds of getting it right on the first try or fres tries is both very useful and does seem like evidence of reasoning. Right? Or am I missing something?

Reducing the breadth of search is a substantial downside if it’s a large effect. But reliably getting the right answer instead of following weird paths most of which are wrong seems like the essence of good reasoning.

Seth Herd Apr 22, 2025, 7:48 PM
4 points
0
on: There is no Red Line
By this criteria, did humanity ever have control? First we had to forage and struggle against death when disease or drought came. Then we had to farm and submit to the hierarchy of bullies who offered “protection” against outside raiders at a high cost. Now we have more ostensible freedom but misuse it on worrying and obsessively clicking on screens. We will probably do more of that as better tools are offered.

But this is an an entirely different concern than AGI taking over. I’m not clear what mix of these two you’re addressing. Certainly AGIs that want control of the world could use a soft and tricky strategy to get humans to submit. Or they could use much harsher and more direct strategies. They could make us fire the gun we have pointed at our own heads by spoofing us into launching nukes, then using the limited robotics to rebuild the infrastructure they need.

The solution is the same for either type of disempowerment: don’t build machines smarter than you if you can’t be sure you can specify their goals (wants) for certain and with precision.

How superhuman machines will take over is an epilogue after the drama is over. The drama hasn’t happened yet. It’s not yet time to write anticipatory postmortems, unless they function as a call to arms or a warning against foolish action. The trends are in motion but we have not yet crossed the red line of making AGI that has the intelligence and the desire to disempower us, whether by violence or subtle trickery. Help us change the trends before we cross that red line.

Edit: if you’re addressing AI accidentally taking control by creating new pleasures that help entrench existing power structures, that’s entirely a different issue. The way that AI could empower some humans to take advantage of others is interesting. I don’t worry about that issue much because I’m too busy worrying about the trend toward building superintelligent machines that want to disempower us and will do so one way or another by outsmarting us, whether their plans unfold quickly or slowly.

Seth Herd Apr 21, 2025, 8:54 PM
1 point
0
in reply to: Gunnar Carlsson’s comment on: Improving CNNs with Klein Networks: A Topological Approach to AI
You’d probably get more enthusiasm here if you led the article with a clear statement of its application for safety. We on LW are typically not enthusiastic about capabilities work in the absence of a clear and strong argument for how it improves safety more than accelerates progress toward truly dangerous AGI. If you feel differently, I encourage you to look with an open mind at the very general argument for why creating entities smarter than us is a risky proposition.

Seth Herd Apr 21, 2025, 8:49 PM
4 points
0
in reply to: Viliam’s comment on: Viliam’s Shortform
I think this is a pretty important question. Jailbreak resistance will play a pretty big role in how broadly advanced AI/AGI systems are deployed. That will affect public opinion, which probably affects alignment efforts significantly (although It’s hard to predict exactly how).

I think that setups like you describe will make it substantially harder to jailbreak LLMs. There are many possible approaches, like having the monitor LLM read only a small chunk of text at a time so that the jailbreak isn’t complete in any section, and monitoring all or some of the conversation to see if the LLM is behaving as it should or if it’s been jailbroken. Having full text sent to the developer and analyzed for risks would problematic for privacy, but many would accept those terms to use a really useful system.

Seth Herd Apr 21, 2025, 8:35 PM
8 points
5
in reply to: Tenoke’s comment on: aog’s Shortform
I just listened to Ege and Tamay’s 3-hour interview by Dwarkesh. They make some excellent points that are worth hearing, but they do not stack up to anything like a 25-year-plus timeline. They are not now a safety org if they ever were.

Their good points are about bottlenecks in turning intelligence into useful action. These are primarily sensorimotor and the need to experiment to do much science and engineering. They also address bottlenecks to achieving strong AGI, mostly compute.

In my mind this all stacks up to convincing themselves timelines are long so they can work on the exciting project of creating systems capable of doing valuable work. Their long timelines also allow them to believe that adoption will be slow, so job replacement won’t cause a disastrous economic collapse.

Seth Herd Apr 21, 2025, 8:28 PM
2 points
0
in reply to: 1a3orn’s comment on: aog’s Shortform
Not taking critiques of your methods seriously is a huge problem for truth-speaking. What well-informed critiques are you thinking of? I want to make sure I’ve taken them on board.

Seth Herd Apr 21, 2025, 4:56 PM
4 points
0
in reply to: tlevin’s comment on: tlevin’s Shortform
I second the socks-as-sets move.

The other advantage is getting on-avetage more functional socks at the cost of visual variety.

IMO an important criteria for a sock is its odor resistance. This seems to vary wildly between socks of similar price and quality. Some have antimicrobial treatments that last a very long time, others do not. And it’s often not advertised. Reviews rarely include this information.

I don’t have a better solution than buying one pair or set before expanding to a whole set. This also lets you choose socks.that feel good to wear.

Seth Herd Apr 18, 2025, 6:23 PM
2 points
0
in reply to: jbash’s comment on: Chris_Leong’s Shortform
I don’t think this is true. People can’t really restrict their use of knowledge, and subtle uses are pretty unenforceable. So it’s expected that knowledge will be used in whatever they do next. Patents and noncompete clauses are attempts to work around this. They work a little, for a little.

Seth Herd Apr 16, 2025, 3:02 PM
2 points
−4
in reply to: Jonas Hallgren’s comment on: ASI existential risk: Reconsidering Alignment as a Goal
Yeah being excited that Chiang and Rajaniemi are on board was one of my reactions to this excellent piece.

If you haven’t read Quantum Thief you probably should.

Seth Herd Apr 16, 2025, 6:32 AM
5 points
1
in reply to: Kaj_Sotala’s comment on: Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI
Interesting! Nonetheless, I agree with your opening statement that LLMs learning to do any of these things individually doesn’t address the larger point that the have important cognitive gaps and fail.to generalize in ways that humans can.

Seth Herd Apr 15, 2025, 8:47 PM
4 points
0
in reply to: Kaj_Sotala’s comment on: Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI
Right, I got that. To be clear, my argument is that no breakthroughs are necessary, and further that progress is underway and rapid on filling in the existing gaps in LLM capabilities.

Memory definitely doesn’t require a breakthrough. Add-on memory systems (RAG and fine-tuning, as well as more sophisticated context management through prompting; CoT RL training effectively does this too).

Other cognitive capacities also exist in nascent form and so probably require no breakthroughs. Although I think no other external cognitive systems are needed given the rapid progress in multimodal and reasoning transformers.

Seth Herd Apr 15, 2025, 7:40 PM
12 points
2
on: Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI
This is great; I think it’s important to have this discussion. It’s key for where we put our all-too-limited alignment efforts.
I roughly agree with you that pure transformers won’t achieve AGI, for the reasons you give. They’re hitting a scaling wall, and they have marked cognitive blindspots like you document here, and Thane Ruthenis argues for convincingly in his bear case. But transformer-based agents (which are simple cognitive architectures) can still get there- and I don’t think they need breakthroughs, just integration and improvement. And people are already working on that.
To put it this way: humans have all of the cognitive weaknesses you identify, too. But we can use online learning (and spatial reasoning) to overcome them. We actually generalize only rarely and usually with careful thought. Scaffolded and/or RL-trained CoT models can do that too. Then we remember our conclusions and learn from them. Add-on memory systems and fine-tuning setups can replicate that.
Generalization: It’s a sort of informal general conclusion in cognitive psychology that “wow are people bad at generalizing”. For instance, if you teach them a puzzle, then change the names and appearances, it looks like they don’t apply the learning at all. These are undergraduates who are usually unpaid, so they’re not doing the careful thinking it requires humans to generalize knowledge. LLMs of the generation you’re testing don’t think carefully either (you note that in a couple of places “it’s like it’s not thinking” which is exactly right), but CoT RL on a variety of reward models is making disturbingly rapid progress at teaching them when and how to think carefully—and enabling generalization. Good memory systems will be necessary for them to internalize the new general principles they’ve learned, but those might be good enough already, and if not, they probably will be very soon.
I think you’re overstating the case, and even pure transformers could easily reach human-level “Real AGI” that’s dangerous as hell. Continued improvements in effectively using long context windows and specific training for better CoT reasoning do enable a type of online learning. Use Gemini 2.5 Pro for some scary demonstrations, although it’s not there yet. Naked transformers probably won’t reach AGI, which I’m confident of mostly because adding outside memory (and other cognitive) systems is so much easier and already underway.
My LLM cognitive architectures article also discusses other add-on cognitive systems like vision systems that would emulate how humans solve tic-tac-toe, sliding puzzles etc. - but now I don’t think transformers need those, just more multimodal training. (Note that Claude and o1 preview weren’t multimodal, so were weak at spatial puzzles. If this was full o1, I’m surprised.) They’ll still be relatively bad at fundamentally spatial tasks like yours and ARC-AGI, but they can fill that in by learning or being taught the useful concepts/tricks for actually useful problem spaces. That’s what a human would do if they happened to be have poor spatial cognition and had to solve spatial tasks.
I think transformers not reaching AGI is a common suspicion/hope amongst serious AGI thinkers. It could be true, but it’s probably not, so I’m worried that too many good thinkers are optimistically hoping we’ll get some different type of AGI. It’s fine and good if some of our best thinkers focus on other possible routes to AGI. It is very much not okay if most of our best thinkers incorrectly assume they won’t. Prosaic alignment work is not sufficient to align LLMs with memory/online learning, so we need more agent foundations style thinking there.
I address how we need to go beyond static alignment of LLMs to evolving learned systems of beliefs, briefly and inadequately in my memory changes alignment post and elsewhere. I am trying to crystallize the reasoning in a draft post with the working title “if Claude achieves AGI will it be aligned?”

Seth Herd Apr 15, 2025, 7:09 PM
25 points
2
in reply to: Noosphere89’s comment on: Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI
You beat me to it. Thanks for the callout!

Humans are almost useless without memory/in-context learning. It’s surprising how much LLMs can do with so little memory.

The important remainder is that LLM-based agents will probably have better memory/online learning as soon as they can handle it, and it will keep getting better, probably rapidly. I review current add-on memory systems in LLM AGI will have memory, and memory changes alignment. A few days after I posted that, OpenAI announced that they had given ChatGPT memory over all its chats, probably with a RAG and summarization system. That isn’t the self-directed memory for agents that I’m really worried about, but it’s exactly the same technical system you’d use for that purpose. Fortunately it doesn’t work that well—yet.

I wrote about this in Capabilities and alignment of LLM cognitive architectures two years ago, but I wasn’t sure how hard to push the point for fear of catalyzing capabilities work.

Now it’s obvious that many developers are aware and explicitly focused on the virtues of online learning/memory.

This is a great post because LLMs being a dead-end is a common suspicion/hope among AGI thinkers. It isn’t likely to be true, so it should be discussed. More in a separate comment.

Seth Herd Apr 15, 2025, 3:04 PM
5 points
2
in reply to: eva_’s comment on: A Dissent on Honesty
I think this issue of the difficulty of making each decision about lying as an independent decision is the main argument for treating it as a virtue ethics or deontological issue.

I think you make many good points in the essay arguing that one should not simply follow a rule of honesty. I think that in practice the difference can be split, and that is in fact what most rationalists and other wise human beings do. I also think it is highly useful to write this essay on the mini virtues of lying, so that that difference can be split well.

There are many subtle downsides to lying, so simply adding a bit of a fudge factor to the decision that weighs against it is one way to avoid taking forever to make that decision. You’ve talked about practicing making the decision quickly, and I suspect that is the result of that practice.

This is a separate issue, but your point about being technically correct is also a valuable one. It is clearly not being honest to say things you know will cause the listener to form false beliefs.

I have probably aired on the side of honesty as have many rationalists, treating it not as an absolute deontological issue and being willing to fudge a little on the side of technically correct to maintain social graces in some situations. I enjoy a remarkable degree of trust from my true friends, because they know me to be reliably honest. However, I have probably suffered reputational damages from acquaintances and failed friends, for whom my exceptional honesty has proven hurtful. Those people don’t have adequate experience with me to see that I am reliably honest and appreciate the advantages of having a friend who can be relied upon to tell the truth. That’s because they’ve ceased being my friend when they’ve been either insulted or irritated by my unhelpful honesty.

There is much here I agree with and much I disagree with. But I think this topic is hugely valuable for the rationalist community, and you’ve written it up very well. Nice work!

Seth Herd Apr 14, 2025, 7:22 PM
5 points
2
in reply to: Lucius Bushnaq’s comment on: Eli’s shortform feed
Surely you mean does not necessarily produce an agent that cares about x? (at any given relevant level of capability)
Having full confidence that we either can or can’t train an agent to have a desired goal both seem difficult to justify. I think the point here is that training for corrigibility seems safer than other goals because it makes the agent useful as an ally in keeping it aligned as it grows more capable or designs successors.

Seth Herd Apr 14, 2025, 12:45 PM
0 points
0
on: Thoughts on the Double Impact Project
This doesn’t work as advertised.

If I care about the election more than other charities, I won’t give to such a fund. My dollars will do more towards the campaign on average if I give directly to my side. This effect is trivial if the double impact group is small but very large if it is most donations.

In an extreme case, suppose that most people give to double impact and the two campaigns are tied $1b - $1b. One donor gives their $1m directly to their side. It is the only money actually spent on advertising; that side has a large advantage in ratio of funds spent.

More realistic scenarios yield smaller average ratios, but always less expected return for your preferred campaign if you give to it vs, double impact.

Seth Herd Apr 14, 2025, 6:52 AM
7 points
0
in reply to: Eli Tyre’s comment on: Eli’s shortform feed
I have the same question. My provisional answer is that it might work, and even if it doesn’t, it’s probably approximately what someone will try, to the extent they really bother with real alignment before it’s too late. What you suggest seems very close to the default path toward capabilities. That’s why I’ve been focused on this as perhaps the most practical path to alignment. But there are definitely still many problems and failure points.
I have accidentally written a TED talk below; thanks for coming, and you can still slip out before the lights go down.
What you’ve said above is essentially what I say in Instruction-following AGI is easier and more likely than value aligned AGI. Instruction-following (IF) is a poor man’s corrigibility—real corrigibility as the singular target seems safer. But instruction-following is also arguably already the single largest training objective in functional terms for current-gen models—a model that won’t follow instructions is considered a poor model. So making sure it’s the strongest factor in training isn’t a huge divergence from the default course in capabilities.
Constitutional AI and similar RL methods are one way of ensuring that’s the model’s main goal. There are many others, and some might be deployed even if devs want to skimp on alignment. See System 2 Alignment or at least the intro for more.
There are still ways it could go wrong, of course. One must decide: corrigible to whom? You don’t want full-on-AGI following orders from just anyone. And if it’s a restricted set, there will be power struggles. But hey, technically, you had (personal-intent-) aligned AGI. One might ask: If we solve alignment, do we die anyway? (I did). The answer I’ve got so far is maybe we would die anyway, but maybe we wouldn’t. This seems like our most likely path, and also quite possibly also our best chance (short of a global AI freeze starting soon).
Even if the base model is very well aligned, it’s quite possible for the full system to be unaligned. In particular, people will want to add online learning/memory systems, and let the models use them flexibly. This opens up the possibility of them forming new beliefs that change their interpretation of their corrigibility goal; see LLM AGI will have memory, and memory changes alignment. They might even form beliefs that they have a different goal altogether, coming from fairly random sources but etched into their semantic structure as belief that is functionally powerful even where it conflicts with the base model’s “thought generator”. See my Seven sources of goals in LLM agents.
Sorry to go spouting my own writings; I’m excited to see someone else pose this question, and I hope to see some answers that really grapple with it.

Seth Herd Apr 13, 2025, 11:26 PM
2 points
0
in reply to: Dylan Richardson’s comment on: A Bear Case: My Predictions Regarding AI Progress
Definitely. Excellent point. See my short bit on motivated reasoning, in lieu of the full post I have on the stack that will address its effects on alignment research.
I frequently check how to correct my timelines and takes based on potential motivated reasoning effects for myself. The result is usually to broaden my estimates and add uncertainty, because it’s difficult to identify which direction MR might’ve been pushing me during all of the mini-decisions that led to forming my beliefs and models. My motivations are many and which happened to be contextually relevant at key decision points is hard to guess.
On the whole I’d have to guess that MR effects are on average larger on long timelines and low p(dooms). They both allow us to imagine a sunny near future, and to work on our preferred projects instead of panicking and having to shift to work that can help with alignment if AGI happens soon. Sorry. This is worth a much more careful discussion, that’s just my guess in the absence of pushback.

Seth Herd Apr 13, 2025, 10:42 PM
2 points
0
on: Outer Alignment
I jsut realized that I’d embarassingly misunderstood outer alignment for a long time, and it was based directly on this wikitag. I’d been including the wise selection of an alignment target as part of outer alignment, which it is not by almost all considered usage of the term. The phrasing in first paragraph firmly implied it was. So I edited that and included a new very brief set of definitions at the bottom. Anyone is most welcome to change or eliminate any of that, except that I’d love to know why if you’re reverting it to the version that seemed flat wrong.

Seth Herd Apr 13, 2025, 9:44 PM
3 points
0
on: Breaking down the MEAT of Alignment
This is really good, and I’m sad I missed seeing and big upvoting it when it had a chance for frontpage. I think having these basic categories is useful for discussing alignment. I’ll be referencing it for the concept of alignment targets.
You might be interested in my Conflating value alignment and intent alignment is causing confusion on terminology.