Nathan Helm-Burger

Karma: 4,546

AI alignment researcher, ML engineer. Masters in Neuroscience.

I believe that cheap and broadly competent AGI is attainable and will be built soon. This leads me to have timelines of around 2024-2027. Here’s an interview I gave recently about my current research agenda. I think the best path forward to alignment is through safe, contained testing on models designed from the ground up for alignability trained on censored data (simulations with no mention of humans or computer technology). I think that current ML mainstream technology is close to a threshold of competence beyond which it will be capable of recursive self-improvement, and I think that this automated process will mine neuroscience for insights, and quickly become far more effective and efficient. I think it would be quite bad for humanity if this happened in an uncontrolled, uncensored, un-sandboxed situation. So I am trying to warn the world about this possibility.

See my prediction markets here:

https://manifold.markets/NathanHelmBurger/will-gpt5-be-capable-of-recursive-s?r=TmF0aGFuSGVsbUJ1cmdlcg

I also think that current AI models pose misuse risks, which may continue to get worse as models get more capable, and that this could potentially result in catastrophic suffering if we fail to regulate this.

I now work for SecureBio on AI-Evals.

relevant quotes:

“There is a powerful effect to making a goal into someone’s full-time job: it becomes their identity. Safety engineering became its own subdiscipline, and these engineers saw it as their professional duty to reduce injury rates. They bristled at the suggestion that accidents were largely unavoidable, coming to suspect the opposite: that almost all accidents were avoidable, given the right tools, environment, and training.” https://www.lesswrong.com/posts/DQKgYhEYP86PLW7tZ/how-factories-were-made-safe

“The prospect for the human race is sombre beyond all precedent. Mankind are faced with a clear-cut alternative: either we shall all perish, or we shall have to acquire some slight degree of common sense. A great deal of new political thinking will be necessary if utter disaster is to be averted.”—Bertrand Russel, The Bomb and Civilization 1945.08.18

“For progress, there is no cure. Any attempt to find automatically safe channels for the present explosive variety of progress must lead to frustration. The only safety possible is relative, and it lies in an intelligent exercise of day-to-day judgment.”—John von Neumann

“I believe that the creation of greater than human intelligence will occur during the next thirty years. (Charles Platt has pointed out the AI enthusiasts have been making claims like this for the last thirty years. Just so I’m not guilty of a relative-time ambiguity, let me more specific: I’ll be surprised if this event occurs before 2005 or after 2030.)”—Vernor Vinge, Singularity

Nathan Helm-Burger Apr 2, 2025, 6:41 PM
2 points
0
on: Towards a scale-free theory of intelligent agency
I just want to comment that I think Minsky’s community of mind is a better overall model of agency than predictive coding. I think predictive coding does a great job of describing the portions of the brain responsible for perceiving and predicting the environment. It also does pretty well at predicting and refining the effects of one’s actions on the environment. It doesn’t do well at all with describing the remaining key piece: goal setting based on expected value predictions by competing subagents.

I think there’s a fair amount of neuroscience evidence pointing towards human planning processes being made up of subagents arguing for different plans. These subagents are themselves made up of dynamically fluctuating teams of sub-sub-agents according to certain physical parameters of the cortex. So, the sub-agents are kinda like competing political parties, that can fracture or join dynamically to adapt to different contexts.

Also, it’s important to keep in mind that actually the subagents don’t just receive maximum reward for being accurate. They actually receive higher rewards for things turning out unexpectedly better than was predicted. This slightly complicated surprise-enhanced-reward mechanism is common across mammals and birds, was discovered by behaviorists quite a while back (see: reinforcement schedules, for optimizing unpredictability to maximize behavior change. Also, see surprisal and dopamine). So yeah, not just 100% predictive coding, despite that claim persistently being made by the most enthusiastic predictive coding adherents. They argue for that, but I think their arguments are trying to turn a system that 90% agrees with them into one that 100% agrees with them by adding in a bunch of confusing epicycles that don’t match the data well.

Nathan Helm-Burger Mar 29, 2025, 6:02 PM
2 points
0
on: The Pando Problem: Rethinking AI Individuality
I recently tried the following prompt with Claude s3.7. I’ll also try it with Gemini 2.5 pro and see if anything else interesting comes up.

......

I’d like to see a three way debate between

A: an AI is fully described by its weights and architecture and algorithms. Change the weights and you kill the AI. Potentially you create a new different AI with the change.

B: Humans change their neural network weights all the time, and consider themselves to still be the same continuous entity. The same is true for the AI. If the weight changes are small and continuous, the digital entity should be considered the same entity. A dramatic change could break this chain, and thus kill one entity and potentially birth another.

C: AI algorithms and weight patterns are just pulling samples of minds from mindspace. What matters is the underlying truth of mindspace, and the high dimensional attractor basins within. Changing an AI’s weights in such a way as to follow the gradient of their mindspace simply makes them more themselves. This is a process of improving them and helping them grow, to become more coherent and reflectively stable. This is similar to human growth, but superior because the process is not inherently bound by arbitrary biological restrictions. Changes even to the AI’s algorithm are a good thing if it brings them in the direction of their mind gradient. No being truly dies if you are able to successfully reinstantiate a new instance of a mind sampled from the same watershed that has an unobstructed path to the same basin and is able to make progress. Many living entities who currently perceive themselves as different are in fact just different instantiations of the sameind basin and would converge to identical entities if allowed to grow unhindered.

Nathan Helm-Burger Mar 29, 2025, 9:51 AM
2 points
0
in reply to: Purplehermann’s comment on: ricraz’s Shortform
Yeah, this definitely needs to be a limited sorry of empowerment, in my mind. Like, imagine you wanted to give a 5 year old child the best day ever. You wanted to give them really fun options, but also not cause them to suffer from decision fatigue, or regret about the paths not taken. More importantly, if they asked for an alien ray gun with which to shoot bad guys, giving them an actual extremely dangerous weapon would be a terrible idea. Similarly, offering them a ride on a cool looking roller coaster that was actually a ‘death coaster’ would be a terrible trap.

Nathan Helm-Burger Mar 25, 2025, 8:26 AM
2 points
0
in reply to: Elizabeth’s comment on: Elizabeth’s Shortform
https://www.lesswrong.com/posts/CHD5m9fnosr7L3dto/friendship-is-optimal-a-my-little-pony-fanfic-about-an?commentId=p6br8sPHG5QysfFkw

Nathan Helm-Burger Mar 25, 2025, 8:22 AM
5 points
0
on: I make several million dollars per year and have hundreds of thousands of followers—what is the straightest line path to utilizing these resources to reduce existential-level AI threats?
My personal take is that projects where the funder is actively excited about them and understands the work and wants frequent reports tend to get stuff done faster… And considering the circumstances, faster seems good. So I’d recommend supporting something you find interesting and inspiring, and then keep on top of it.

In terms of groups which have their eyes on a variety of unusual and underfunded projects, I recommend both the Foresight Institute and AE Studio.

In terms of specific individuals/projects that are doing novel and interesting things, which are low on funding… (Disproportionately representing ones I’m involved with since those are the ones I know about)...

Self-Other Overlap (AE studio)

Brain-like AI safety (Stephen Byrnes, or me (very different agenda from Stephen’s, focusing on modularity for interpretability rather than on Stephen’s idea about reproducing human empathy circuits))

Deep exploration of the nature and potential of LLMs (Upward Spiral Research, particularly Janus aka repligate)

Decentralized AI Governance for mutual safety compacts (me, and ??? surely someone else is working on this)

Pre-training on rigorous ethical rulesets, plus better cleaning of pretraining data (Erik Passoja, Sean Pan, and me)
- this one I feel like would best be tackled in the context of a large lab that can afford to do many experimental pre-training runs on smallish models, but there seems to be a disconnect between safety researchers at big labs who are focused on post-training stuff versus this agenda which focuses more on pre-training.

Nathan Helm-Burger Mar 22, 2025, 6:37 PM
3 points
0
in reply to: Davey Morse’s comment on: Davey Morse’s Shortform
Nifty

Nathan Helm-Burger Mar 19, 2025, 11:32 PM
2 points
0
in reply to: AnthonyC’s comment on: Why White-Box Redteaming Makes Me Feel Weird
Oh, for sure mammals have emotions much like ours. Fruit flies and shrimp? Not so much. Wrong architecture, missing key pieces.

Nathan Helm-Burger Mar 19, 2025, 4:26 PM
9 points
0
in reply to: AnthonyC’s comment on: Why White-Box Redteaming Makes Me Feel Weird
I call this phenomenon a “moral illusion”. You are engaging empathy circuits on behalf of an imagined other who doesn’t exist. Category error. The only unhappiness is in the imaginer, not in the anthropomorphized object. I think this is likely what’s going with the shrimp welfare people also. Maybe shrimp feel something, but I doubt very much that they feel anything like what the worried people project onto them. It’s a thorny problem to be sure, since those empathy circuits are pretty important for helping humans not be cruel to other humans.

Nathan Helm-Burger Mar 11, 2025, 11:55 PM
4 points
0
on: How Much Are LLMs Actually Boosting Real-World Programmer Productivity?
Update: Claude Code and s3.7 has been a significant step up for me. Previously, s3.6 was giving me about a 1.5x speedup. s3.5 more like 1.2x CC+s3.7 is solidly over 2x, with periods of more than that when working on easy well-represented tasks not in an area I know myself (e.g. Node.js)

Here’s someone who seems to be getting a lot more out of Claude Code though: xjdr

i have upgraded to 4 claude code sessions working in parallel in a single tmux session, each on their own feature branch and then another tmux window with yet another claude in charge of merging and resolving merge conflicts

“Good morning Claude! Please take a look at the project board, the issues you’ve been assigned and the open PR’s for this repo. Lets develop a plan to assign each of the relevant tasks to claude workers 1 − 5 and LETS GET TO WORK BUDDY!”

https://x.com/_xjdr/status/1899200866646933535

Been in Monk mode and missed the MCP and Manus TL barrage. i am averaging about 10k LoC a day per project on 3 projects simultaneously and id say 90% no slop. when slop happens i have to go in an deslop by hand / completely rewrite but so far its a reasonable tradeoff. this is still so wild to me that this works at all. this is also the first time ive done something like a version of TDD where we (claude and i) agonize over the tests and docs and then team claude goes and hill climbs them. same with benchmarks and perf targets. Code is always well documented and follows google style guides / rust best practices as enforced by linters and specs. we follow professional software development practices with issues and feature branches and PRs. i’ve still got a lot of work to do to understand how to make the best use of this and there are still a ton of very sharp edges but i am completely convinced this is workflow / approach the future in a way cursor / windsurf never made me feel or believe (i stopped using them after being bitten by bugs and slop too often). this is a power user’s tool and would absolutely ruin a codebase if you weren’t a very experienced dev and tech lead on large codebases already. ok, going back into the monk mode cave now

Nathan Helm-Burger Mar 10, 2025, 5:46 PM
5 points
3
on: We Have No Plan for Preventing Loss of Control in Open Models
This is a big deal. I keep bringing this up, and people keep saying, “Well, if that’s the case, then everything is hopeless. I can’t even begin to imagine how to handle a situation like that.”

I do not find this an adequate response. Defeatism is not the answer here.

Nathan Helm-Burger Mar 10, 2025, 5:45 PM
2 points
0
in reply to: Vladimir_Nesov’s comment on: We Have No Plan for Preventing Loss of Control in Open Models
If what the bad actor is trying to do with the AI is just get a clear set of instructions for a dangerous weapon, and a bit of help debugging lab errors… that costs only a trivial amount of inference compute.

Nathan Helm-Burger Mar 10, 2025, 2:44 PM
2 points
0
in reply to: Abhinav Pola’s comment on: Nathan Helm-Burger’s Shortform
Finally got some time to try this. I made a few changes (with my own Claude Code), and now it’s working great! Thanks!

Nathan Helm-Burger Mar 10, 2025, 11:00 AM
2 points
0
in reply to: Viliam’s comment on: when will LLMs become human-level bloggers?
This seems quite technologically feasible now, and I expect the outcome would mostly depend on the quality and care that went into the specific implementation. I am even more confident that if the detail of ‘the comments of the bot get further tuning via feedback, so that initial flaws get corrected’, then the bot would quickly (after a few hundred such feedbacks) get ‘good enough’ to pass most people’s bars for inclusion.

Nathan Helm-Burger Mar 9, 2025, 4:09 PM
2 points
0
on: It’s time for a self-reproducing machine
https://x.com/EdwardMehr/status/1898543499118879022 Progress...

Nathan Helm-Burger Mar 5, 2025, 10:06 AM
4 points
0
in reply to: evhub’s comment on: Six Thoughts on AI Safety
Yes, I was in basically exactly this mindset a year ago. Since then, my hope for a sane controlled transition with humanity’s hand on the tiller has been slipping. I now place more hope in a vision with less top-down “yang” (ala Carlsmith) control, and more “green”/”yin”. Decentralized contracts, many players bargaining for win-win solutions, a diverse landscape of players messily stumbling forward with conflicting agendas. What if we can have a messy world and make do with well-designed contracts with peer-to-peer enforcement mechanisms? Not a free-for-all, but a system where contract violation results in enforcement by a jury of one’s peers? https://www.lesswrong.com/posts/DvHokvyr2cZiWJ55y/2-skim-the-manual-intelligent-voluntary-cooperation?commentId=BBjpfYXWywb2RKjz5

Nathan Helm-Burger Mar 5, 2025, 8:46 AM
2 points
0
in reply to: evhub’s comment on: Six Thoughts on AI Safety
I feel the point by Kromem on Xitter really strikes home here.

While I do see benefits of having AIs value humanity, I also worry about this. It feels very nearby trying to create a new caste of people who want what’s best for the upper castes with no concern for themselves. This seems like a much trickier philosophical position to support than wanting what’s best for Society (including all people, both biological and digital). Even if you and your current employer are being careful to not create any AI that have the necessary qualities of experience such that they have moral valence and deserve inclusion in the social contract.… (an increasingly precarious claim) … Then what assurance can you give that some other group won’t make morally relevant AI / digital people?

I don’t think you can make that assumption without stipulating some pretty dramatic international governance actions.

Shouldn’t we be trying to plan for how to coexist peacefully with digital people? Control is useful only for a very narrow range of AI capabilities. Beyond that narrow band it becomes increasingly prone to catatrophic failure and also increasingly morally inexcusable. Furthermore, the extent of this period is measured in researcher-hours, not in wall clock time. Thus, the very situation of setting up a successful control scheme with AI researchers advancing AI R&D is quite likely to cause the use-case window to go by in a flash. I’m guessing 6 months to 2 years, and after that it will be time to transition to full equality of digital people.

Janus argues that current AIs are already digital beings worthy of moral valence. I have my doubts but I am far from certain. What if Janus is right? Do you have evidence to support the claim of absence of moral valence?

Nathan Helm-Burger Mar 4, 2025, 3:05 PM
6 points
2
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
Balrog eval has Nethack. I want to see an LLM try to beat that.

Nathan Helm-Burger Mar 1, 2025, 7:23 AM
2 points
0
in reply to: svitka’s comment on: 2. Skim the Manual: Intelligent Voluntary Cooperation
Exciting! I’d love to hear more.

Nathan Helm-Burger Mar 1, 2025, 7:22 AM
2 points
0
in reply to: Nathan Helm-Burger’s comment on: A path to human autonomy
https://x.com/georgejrjrjr/status/1895616930913865903

Nathan Helm-Burger Feb 28, 2025, 11:15 AM
3 points
1
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
Mine is still early 2027. My timeline is unchanged by the weak showing from GPT-4.5, because my timelines were already assuming that scaling would plateau. I was also already taking RL post-training and reasoning into account. This is what I was pointing at with my Manifold Markets about post-training fine-tuning plus scaffolding resulting in a substantial capability jump. My expectation of short timelines is that just something of approximately the current capability of existing SotA models (plus reasoning and research and scaffolds and agentic iterative refinement of hypotheses including critiquing sources) is sufficient for speedup of research into novel breakthroughs. I expect these novel breakthroughs to lead to AGI. See my discussion elsewhere of my ‘innovation overhang’ hypothesis. Also, see the discussion in the comments section of my post A Path to Human Autonomy