Controlling AI to produce outputs that accelerate AI Safety Research
Ram Potham
It seems like a somewhat natural but unfortunate next step. At Epoch AI, they see massive future market in AI automation and have a better understanding of evals and timelines, so now they aim to capitalize it.
To clarify, is the optimal approach the following?
AI Control for early transformative AI
Controlled AI accelerate AI Safety, create automated AI Safety Researcher
Hope we get ASI aligned
If so, what are currently the most promising approaches for AI control on transformative AI?
Memory in AI Agents can cause a large security risk. Without memory, it’s easier to red-team LLMs for safety. But once they are fine-tuned (a form of encoding memory), misalignment can be introduced.
According to AI Safety Atlas, most scaffolding approaches for memory provide
a way for AI to access a vast repository of knowledge and use this information to construct more informed responses. However, this approach may not be the most elegant due to its reliance on external data sources and complex retrieval mechanisms. A potentially more seamless and integrated solution could involve utilizing the neural network’s weights as dynamic memory, constantly evolving and updating based on the tasks performed by the network.
We need ways to ensure safety in powerful agents with memory or not introduce memory modules at all. Otherwise, agents are constantly learning and can find motivations not aligned with human volition.
Any thoughts on ensuring safety in agents that can update their memory?
I argue that the optimal ethical stance is to become a rational Bodhisattva: a synthesis of effective altruism, two‑level utilitarianism, and the Bodhisattva ideal.
Effective altruism insists on doing the most good per unit of resource, but can demand extreme sacrifices (e.g., donating almost all disposable income).
Two‑level utilitarianism lets us follow welfare‑promoting rules in daily life and switch to explicit cost‑benefit calculations when rules conflict. Yet it offers little emotional motivation.
The Bodhisattva ideal roots altruism in felt interdependence: the world’s suffering is one’s own. It supplies deep motivation and inner peace, but gives no algorithm for choosing the most beneficial act.
A rational Bodhisattva combines the strengths and cancels the weaknesses:
Motivation: Like a Bodhisattva, they experience others’ suffering as their own, so compassion is effortless and durable.
Method: Using reason and evidence (from effective altruism and two‑level utilitarianism), they pick the action that maximizes overall benefit.
Flexibility: They apply the “middle way,” recognizing that different compassionate choices can be permissible when values collide.
Illustration
Your grandparent needs $50,000 for a life‑saving treatment, but the same money could save ten strangers through a GiveWell charity.
A strict effective altruist/utilitarian would donate to GiveWell.
A purely sentimental agent might fund the treatment.
The rational Bodhisattva weighs both outcomes, also including duties into the calculation, acts from compassion, and accepts the result without regret. In most cases they will choose the option with the greatest net benefit, but they can act otherwise when a compassionate rule or relational duty justifies it.
Thus, the rational Bodhisattva unites rigorous impact with deep inner peace.
I wonder why evolution has selected for this flawed lens and why decision-making shortcuts are selected for. It seems to me that a better picture of reality we have, the greater ability we have to survive and reproduce.
This is great! Would like to see a continually updating public leaderboard of this.
Thanks gwern, really interesting correlation between compression ratio and intelligence, it works for LLMs, but less so for agentic systems and not sure would scale to reasoning models because test-time scaling is a large factor of the intelligence LLMs exhibit.
I agree we should see a continued compression benchmark.
Accelerating AI Safety research is critical for developing aligned systems, and transformative AI is a powerful approach in doing so.
However to create AI systems that accelerate AI safety requires benchmarks to find the best performing system. The problem is that benchmarks can be flawed and gamed for the following core reasons:
Benchmarks may not represent real world ability
Benchmark information can be leaked into AI model training
However, we can have a moving benchmark of reproducing the most recent technical alignment research using the methods and data.
By reproducing a paper, something many researchers do, we ensure it has real world ability.
By only using papers after a certain model finished training, we ensure data is not leaked.
This comes with the inconvenience of making it difficult to compare models or approaches in 2 papers because the benchmark is always changing, but I argue this is a worthwhile tradeoff:
We can still compare approaches if the the same papers are used in both approaches, ensuring papers were produced after all models finished training
Benchmark data is always recent and relevant
Any thoughts?
Thanks for the trendlines—they help us understand when AI can automate years of work!
Like you said, the choice of tasks can heavily change the trendline
Our estimate of the length of tasks that an agent can complete depends on methodological choices like the tasks used and the humans whose performance is measured. However, we’re fairly confident that the overall trend is roughly correct, at around 1-4 doublings per year. If the measured trend from the past 6 years continues for 2-4 more years, generalist autonomous agents will be capable of performing a wide range of week-long tasks.
I believe SWE-bench is the best benchmark to control for variables like the choice of task and how the agentic system is built, so I’m leaning more towards the doubling time of 70 days.
For a large scale / complex app, it takes around 1 year of development (though this is not a completely fair estimate since it doesn’t take into account the number of man-hours), but going with this estimate and doubling in SWE-bench, it takes around 13 doublings from the beginning of 2025 or June 2027 to automate production of entire apps / complex websites.
Another big factor that this trendline and other trendlines don’t take into account is the amount of AI acceleration. If AI automates a large portion of the work, the time to double would shorten as AI gets better, I’d be interested to see how that would affect this model.
I believe we are doomed from superintelligence but I’m not sad.
There are simply too many reasons why alignment will fail. We can assign a probability p(S_n aligned | S_{n-1} aligned) where S_n is the next level of superintelligence. This probability is less than 1.
As long as misalignment keeps increasing and superintelligence iterates on itself exponentially fast, we are bound to get misaligned superintelligence. Misalignment can decrease due to generalizability, but we have no way of knowing if that’s the case and is optimistic to think so.
The misaligned superintelligence will destroy humanity. This can lead to a lot of fear or wish to die with dignity but that misses the point.
We only live in the present, not the past nor the future. We know the future is not optimistic but that doesn’t matter, because we live in the present and we constantly work on making the present better by expected value.
By remaining in the present, we can make the best decisions and put all of our effort into AI safety. We are not attached to the outcome of our work because we live in the present. But, we still model the best decision to make.
We model that our work on AI safety is the best decision because it buys us more time by expected value. And, maybe, we can buy enough time to upload our mind to a device so our continuation lives on despite our demise.
Then, even knowing the future is bad, we remain happy in the present working on AI safety.
I believe the best motivation is compassion without attachment. Being attached to any specific future—we are certain to be doomed or I have this plan that will maybe save us all—is pointless speculation.
Instead, we should use our bayesian inference mind to find the most likely path to help—out of compassion—and pursue it with all our heart. Regardless of whether it works or not, we aren’t sad or experiencing suffering since our mind is just filled with compassion.
If we’ve done this, then we’ve done our best.
Thanks so much for the comprehensive analysis—this makes it easier to reason about how the political situation and trendlines affect AI development.
I have some objections that AGI and ASI will be released and that wrapper startups can use them as mentioned below:
Agent-3-mini is hugely useful for both remote work jobs and leisure. An explosion of new apps and B2B SAAS products rocks the market. Gamers get amazing dialogue with lifelike characters in polished video games that took only a month to make. 10% of Americans, mostly young people, consider an AI “a close friend.” For almost every white-collar profession, there are now multiple credible startups promising to “disrupt” it with AI.
Once this AGI is there, it can recreate all these B2B SaaS at a cheap cost since SaaS is mostly plug and play / doesn’t depend on specific company IP / resources
B2B SaaS created by OpenBrain likely have more credibility than most other B2B SaaS as OpenBrain likely has a large reputation and following as the leading AI company
If OpenBrain owns all B2B SaaS, their profits, resources, and compute likely increase
An API to AGI will allow competitors to develop their own AGI by accelerating their foundation model development
From competitive pressures and more profit, AGI will not be released and instead used to create services.
I am also much more pessimistic about the endings after September 2027 on the website for the following reasons:
It took humans to help detect and rollback misaligned AI in the predictions, which will likely happen
There is likely no complete alignment in the AI, given specific context that is out of training distribution, it will likely scheme, even if it’s 0.0001% of the time
These misalignment issues will likely accumulate as it progressively trains strong superintelligence
Humans will not be able to tell at all what stronger superintelligence are doing—they could only barely tell with the weaker superintelligence
Finally, a sufficiently misaligned superintelligence replaces all humans with new lifeforms as in the pessimistic ending to satisfy its value function
I can accept the argument that even if AI are misaligned 0.0001% of the time, this misalignment will only decrease over time as AI can generalize better, but I find this too optimistic.
Thanks again for working on this, hopefully this will inform government policy in the right direction.
There are patterns in how the world works. This is effective way in finding those patterns. The most optimal way of finding the right beliefs / patterns will not have any attachment to views—anticipating a result seems a little counterproductive to finding optimal patterns as it biases you to hold onto current beliefs.
Simply ask questions—explore and exploit working patterns.
Instance of Halo Effect in addition to Undue Deference. We believe they are a good strategic thinking because good researchers must be brilliant in all fields.
Still important to value their view, compare it with views of strategic thinkers and find symmetries that can better predict answers to questions.
Thanks for the commentary—it helps in better understanding why some people are pessimistic and why others are optimistic on ASI.
However, I’m not convinced with the argument that an alignment technique will consistently produce AI with human compatible values as you mentioned below.
An alignment technique that works 99% of the time to produce an AI with human compatible values is very close to a full alignment solution[5]. If you use this technique once, gradient descent will not thereafter change its inductive biases to make your technique less effective. There’s no creative intelligence that’s plotting your demise
When we have self-improving AI, architectures may change and become more complex. Then, how do we know the alignment of the stronger AIs is still sufficient. How can we tell the difference between the sychophant or schemer from a saint?
I agree that a potential route to get there is personal intent alignment.
What are your thoughts on using a survey like World Value Survey to get value alignment?
With all this new funding, are we finally turning LW-Bux into real money? I’m ready to sell!
Definitely agree that an AI with no goal and maximum cooperation is best for society
A couple questions to help me understand the argument better:
CEV also acknowledges the volition / wants of all humans, finding a unified will, then creating a utility function around it. What are the differences between recursive alignment and CEV?
As another note, how does the AI choose the best decision to make—is it by evaluation all decisions by a utility function, doing pairwise comparisons of the top decisions, or simply inference of a single decision.
You mentioned inference of 1 decision and retraining to ensure it stays aligned—though this might be suboptimal. Since you did mention CoT, however, the agent might be comparing and simulating outcomes of decision in its thought process
Who do we account for in the alignment and how do we weight them? eg, how should AIs, humans, animals, and insects be weighted. We can have a huge number of AIs, but we don’t want humans to be accounted as a miniscule fraction.
I assume that since we want an unbiased view, it won’t simply be an equal weighting of all parties, but instead finding a view that aims for an acceptable middle ground—then the question is what does that middle ground look like
I have experienced similar problems to you when building an AI tool—better models did not necessarily lead to better performance despite external benchmarks. I believe there are 2 main reasons why this is, alluded to in your post:
Selection Bias—when a foundation model company releases their newest model, they show performance on benchmarks most favorable to it
Alignment—You mentioned how AI is not truly understanding the instructions you meant. While this can be mitigated by creating better prompts, it does not fully solve the issue
Based on previous data, it seems like CCP AGI will be less safety-tested than US AGI. Take Benchmark results:
DeepSeek R1: Demonstrated a 100% failure rate in blocking harmful prompts related to bioweapons, according to Anthropic’s safety tests.
OpenAI GPT-4o: Showed an 86% failure rate in the same tests, indicating better but still concerning gaps in safety measures.
Meta Llama-3.1-405B: Had a 96% failure rate, performing slightly better than DeepSeek but worse than OpenAI.
Though, if it was just CCP making AGI or just US making AGI it might be better because it’d reduce competitive pressures. But, due to competitive pressures and investments like Stargate, the AGI timeline is accelerated, and the first AGI model will likely not be safety tested as well as it could be.