When AI 10x’s AI R&D, What Do We Do?

Note: below is a hypothetical future written in strong terms and does not track my actual probabilities.

Throughout 2025, a huge amount of compute is spent on producing data in verifiable tasks, such as math[1] (w/​ “does it compile as a proof?” being the ground truth label) and code (w/​ “does it compile and past unit tests?” being the ground truth label).

In 2026, when the next giant compute clusters w/​ their GB200′s are built, labs train the next larger model over 100 days, then some extra RL(H/​AI)F and whatever else they’ve cooked up by then.

By mid-2026, we have a model that is very generally intelligent, that is superhuman in coding and math proofs.

Naively, 10x-ing research means releasing 10x the same quality amount of papers in a year; however, these new LLM’s have a different skill profile, allowing different types of research and workflows.

If AI R&D is just (1) Code (2) Maths & (3) Idea generation/​understanding, then LLMs will have 1 & 2 covered and top-researchers will have (3).[2]

In practice, a researcher tells the LLM “Read the maths sections of arxiv and my latest research on optimizers, and sort by most relevant.”, and then they sort through the top 20 results and select the project to pursue. Then they dictate which code or math proofs to focus on (or ask the LLM to suggest potential paths), and let the LLM run.

Throughout the day, each researcher is supervising 2-10 different projects; reviewing the results from LLMs as they come in.

GPU’s are still a constraint, but research projects can be filtered by compute: fundamental research into optimizers or compute cluster protocols or interpreting GPT-2 can be done w/​ a small number of GB200′s (or the previous compute cluster).

With these generally intelligent LLMs w/​ superhuman coding & math proofs, how can we...

Scale Capabilities Safely

Having a plan on the order of research to pursue allows researchers & labs a way to coordinate scaling capabilties. RSPs are an example of this; however, they currently don’t specify what research to pursue when LLM’s can automate AI R&D.

Ideally we can produce a plan that you would trust (with your life) any lab honestly pursuing. This is my version 1 of this:

Step 1: Hardening Defenses & More Control

LLMs can now produce larger amounts of high-quality code and proofs. It might be practical now to write a secure OS or do massive amounts of white-hat hacking to patch vulnerabilities before they’re found, including social hacking w/​ AI voice and video.

Additionally, more work on the control agenda would ideally increase the amount of useful work you can get out of potentially misaligned LLMs. However, control fails when there’s too large of a gap between your most trusted model and your most capable model. This leads to:

Step 2: Automate Interp

I’m expecting in the 18 months before this super LLM is made, we’ve already done more basic interp, such as finding features of deception/​sycophancy/​etc through model organism research, and figuring out parts of circuits of personality traits & personas.[3]

But how do we win?

If we can reverse engineer all the algorithms that LLMs learned when predicting text, then we can build larger and larger trust in more capable models. Understanding the algorithms that a model runs is a way to scale capabilities safely.

For example, only scale past [specific capabilties] once you understand the entire model up to that level.

What do you mean by “understand”? There’s a million facts about LLMs, but not all facts are useful for goals we care about (e.g. neuron 3030 is > 4 10% of the time is an “understanding”).

By “understand”, I mean we should be able to replace each part of the model w/​ our hypothesis of it. For example, if we say part of the model is used to increase the entropy of the output distribution, then we should be able to replace it w/​ code that explicitely does that.

Just because every single part is interpretable, doesn’t mean you understand the whole or understand its dynamics in long-contexts.

I agree that focusing on single part misses its interactions w/​ other parts. It’s easier to predict how a feature generalizes if you understand how it interacts w/​ previous/​later functions, in the same way that understanding a variable in a program is easier when you know how it was made and how its used in downstream functions.

In fact, suppose we interpret layer 0 features and how those compute layer 1 features (ie Z = 3A + 4B), then we now understand layer 2 features.

Okay, but it does sound like a whole lot of work and will be intractable.

This is why it’s Step 2 and not 1. Step 1 is all the reasonable, practical things; the minimal viable product that we can get working in time. Step 2 is intended as a solution that scales to even higher capabilities.

Although I am curious what other research agenda we could rely on that allow safe scaling (I mostly work in interp, so that’s what I know).

But just thinking through the benefits of this. If we have an alternative computation graph (ACG) of e.g. GPT-2, where (some better version of ) SAE-features are the nodes and the edges are how earlier features compute later features. Then:

  1. We can have similar guarantees of generalization behavior as normal, large-scale coding projects (which still mess up, but we do have experience in).

  2. We can remove all backdoors by simply running computation through this ACG (alternative computation graph).

    1. If this is more expensive, we should be able to translate this back into a normal NN, leaving out backdoors which weren’t captured in the ACG.

  3. We can do massive model distillation: keep the circuits you need for e.g. coding, and remove everything else.

    1. Although this might be outcompeted by simply training another model on only coding data generated by an LLM

  4. Jailbreaks might be solved by default: suppose Z = 3A + 4B, and a jailbreak ends up activating Z in a weird way e.g. Z = 10C + 20D (but C & D don’t co-occur in distribution, but does on the jailbreak prompt)

Currently we do not have a good enough ontology/​ unit-of-computation for LLMs. SAE’s are a solid v1, but are not good enough. I’m hoping in the next year, we can solve this, then the automating AI R&D can just scale w/​ [GPT-5]. If not, then pursuing fundamental interp research is valuable but requires in-house interp researchers to lead.

As an example, trying to fully interpret small LLMs w/​ current best techniques (SAEs), finding the holes in that, and iterating is a solid research direction where current researchers (and those in the future w/​ super-coding/​math LLMs) can work on.

Conclusion

It’s valuable to have a plan to coordinate the order of research to pursue when AI’s can 10x research (whether in 1.5 years or 5). RSPs are an attempt at this coordination metric, but currently don’t specify any plans on what to research w/​ accelerated AI R&D.

Above is my current best guess of a v1 of what an ideal world would do. There are most likely alternative research directions that also would be desirable, and I’d appreciate comments that pitch them (e.g. SLT for converting data into algorithms, automating meta-philosophy somehow, etc)

Additionally what’s missing is any sort of enforcement mechanism (e.g. self-enforcement by actually convincing people of the risks, government agencies, etc)

  1. ^

    The verifiable part is going from e.g. Lean code to a proof in Lean, but not from words to Lean code. However, there is work already of converting papers into Lean code, which might already be good enough to rely on.

  2. ^

    Although I do gain more understanding of what I want & mean by coding, so it might be the case that researchers who stop coding will lose their touch. Although, there might be enough verification methods (e.g. sanity checking research ideas w/​ graphs and corollaries and talking to an LLM w/​ interruptible voice chat on the results) such that researchers can stay grounded.

  3. ^

    If you decide to work on this, I can give feedback on this or mentoring