Controlling AI to produce outputs that accelerate AI Safety Research
Ram Potham
I believe a recursively aligned AI model would be more aligned and safe than a corrigible model, although both would be susceptible to misuse.
Why do you disagree with the above statement?
Thanks for the clarification, this makes sense! The key is the tradeoff with corrigibility.
Thanks, updated the comment to be more accurate
If you ask a corrigible agent to bring you a cup of coffee, it should confirm that you want a hot cup of simple, black coffee, then internally check to make sure that the cup won’t burn you, that nobody will be upset at the coffee being moved or consumed, that the coffee won’t be spilled, and so on. But it will also, after performing these checks, simply do what’s instructed. A corrigible agent’s actions should be straightforward, easy to reverse and abort, plainly visible, and comprehensible to a human who takes time to think about them. Corrigible agents proactively study themselves, honestly report their own thoughts, and point out ways in which they may have been poorly designed. A corrigible agent responds quickly and eagerly to corrections, and shuts itself down without protest when asked. Furthermore, small flaws and mistakes when building such an agent shouldn’t cause these behaviors to disappear, but rather the agent should gravitate towards an obvious, simple reference-point.
Isn’t corrigibility still susceptible to power-seeking according to this definition? It wants to bring you a cup of coffee, it notices the chances of spillage are reduced if it has access to more coffee, so it becomes a coffee maximizer as in instrumental goal.
Now, it is still corrigible, it does not hide it’s thought processes, it tells the human exactly what it is doing and why. But when the agent is doing millions of decisions and humans can only review so many thought processes (only so many humans will take the time to think about the agent’s actions), many decisions will fall through the crack and end up being misaligned.
Is the goal to learn the human’s preferences through interaction then, and hope that it learns the preferences enough to know that power-seeking (and other harmful behaviors) are bad?
The problem is, there could be harmful behaviors we haven’t thought of to train the AI in, and they are never corrected, so the AI proceeds with them.
If so, can we define a corrigible agent that is actually what we want?
How does corrigibility relate to recursive alignment? It seems like recursive alignment is also a good attractor—is it that you believe it is less tractable?
What assumptions do you disagree with?
Thanks for this insightful post! It clearly articulates a crucial point: focusing on specific failure modes like spite offers a potentially tractable path for reducing catastrophic risks, complementing broader alignment efforts.
You’re right that interventions targeting spite – such as modifying training data (e.g., filtering human feedback exhibiting excessive retribution or outgroup animosity) or shaping interactions/reward structures (e.g., avoiding selection based purely on relative performance in multi-agent environments, as discussed in the post) – aim directly at reducing the intrinsic motivation for agents to engage in harmful behaviors. This isn’t just about reducing generic competition; it’s about decreasing the likelihood that an agent values frustrating others’ preferences, potentially leading to costly conflict.
Further exploration in this area could draw on research in:
Multi-Agent Reinforcement Learning (MARL) in Sequential Social Dilemmas
Focusing on reducing specific negative motivations like spite seems like a pragmatic and potentially high-impact approach within the broader AI safety landscape.
Appreciate the insights on how to maximize leveraged activities.
With the planning fallacy making it very difficult to predict engineering timelines, how do top performers / managers create effective schedules and track progress against the schedule?
I get the feeling that you are suggesting to create a Gantt chart, but from your experience, what practices do teams use to maximize progress in a project?
Based on previous data, it’s plausible like CCP AGI will perform worse on safety benchmarks than US AGI. Take Cisco Harmbench evaluation results:
DeepSeek R1: Demonstrated a 100% failure rate in blocking harmful prompts according to Anthropic’s safety tests.
OpenAI GPT-4o: Showed an 86% failure rate in the same tests, indicating better but still concerning gaps in safety measures.
Meta Llama-3.1-405B: Had a 96% failure rate, performing slightly better than DeepSeek but worse than OpenAI.
Though, if it was just CCP making AGI or just US making AGI it might be better because it’d reduce competitive pressures.
But, due to competitive pressures and investments like Stargate, the AGI timeline is accelerated, and the first AGI model may not perform well on safety benchmarks.
It seems like a somewhat natural but unfortunate next step. At Epoch AI, they see massive future market in AI automation and have a better understanding of evals and timelines, so now they aim to capitalize it.
AI Control Methods Literature Review
To clarify, is the optimal approach the following?
AI Control for early transformative AI
Controlled AI accelerate AI Safety, create automated AI Safety Researcher
Hope we get ASI aligned
If so, what are currently the most promising approaches for AI control on transformative AI?
Memory in AI Agents can cause a large security risk. Without memory, it’s easier to red-team LLMs for safety. But once they are fine-tuned (a form of encoding memory), misalignment can be introduced.
According to AI Safety Atlas, most scaffolding approaches for memory provide
a way for AI to access a vast repository of knowledge and use this information to construct more informed responses. However, this approach may not be the most elegant due to its reliance on external data sources and complex retrieval mechanisms. A potentially more seamless and integrated solution could involve utilizing the neural network’s weights as dynamic memory, constantly evolving and updating based on the tasks performed by the network.
We need ways to ensure safety in powerful agents with memory or not introduce memory modules at all. Otherwise, agents are constantly learning and can find motivations not aligned with human volition.
Any thoughts on ensuring safety in agents that can update their memory?
I argue that the optimal ethical stance is to become a rational Bodhisattva: a synthesis of effective altruism, two‑level utilitarianism, and the Bodhisattva ideal.
Effective altruism insists on doing the most good per unit of resource, but can demand extreme sacrifices (e.g., donating almost all disposable income).
Two‑level utilitarianism lets us follow welfare‑promoting rules in daily life and switch to explicit cost‑benefit calculations when rules conflict. Yet it offers little emotional motivation.
The Bodhisattva ideal roots altruism in felt interdependence: the world’s suffering is one’s own. It supplies deep motivation and inner peace, but gives no algorithm for choosing the most beneficial act.
A rational Bodhisattva combines the strengths and cancels the weaknesses:
Motivation: Like a Bodhisattva, they experience others’ suffering as their own, so compassion is effortless and durable.
Method: Using reason and evidence (from effective altruism and two‑level utilitarianism), they pick the action that maximizes overall benefit.
Flexibility: They apply the “middle way,” recognizing that different compassionate choices can be permissible when values collide.
Illustration
Your grandparent needs $50,000 for a life‑saving treatment, but the same money could save ten strangers through a GiveWell charity.
A strict effective altruist/utilitarian would donate to GiveWell.
A purely sentimental agent might fund the treatment.
The rational Bodhisattva weighs both outcomes, also including duties into the calculation, acts from compassion, and accepts the result without regret. In most cases they will choose the option with the greatest net benefit, but they can act otherwise when a compassionate rule or relational duty justifies it.
Thus, the rational Bodhisattva unites rigorous impact with deep inner peace.
I wonder why evolution has selected for this flawed lens and why decision-making shortcuts are selected for. It seems to me that a better picture of reality we have, the greater ability we have to survive and reproduce.
This is great! Would like to see a continually updating public leaderboard of this.
Thanks gwern, really interesting correlation between compression ratio and intelligence, it works for LLMs, but less so for agentic systems and not sure would scale to reasoning models because test-time scaling is a large factor of the intelligence LLMs exhibit.
I agree we should see a continued compression benchmark.
Accelerating AI Safety research is critical for developing aligned systems, and transformative AI is a powerful approach in doing so.
However to create AI systems that accelerate AI safety requires benchmarks to find the best performing system. The problem is that benchmarks can be flawed and gamed for the following core reasons:
Benchmarks may not represent real world ability
Benchmark information can be leaked into AI model training
However, we can have a moving benchmark of reproducing the most recent technical alignment research using the methods and data.
By reproducing a paper, something many researchers do, we ensure it has real world ability.
By only using papers after a certain model finished training, we ensure data is not leaked.
This comes with the inconvenience of making it difficult to compare models or approaches in 2 papers because the benchmark is always changing, but I argue this is a worthwhile tradeoff:
We can still compare approaches if the the same papers are used in both approaches, ensuring papers were produced after all models finished training
Benchmark data is always recent and relevant
Any thoughts?
Thanks for the trendlines—they help us understand when AI can automate years of work!
Like you said, the choice of tasks can heavily change the trendline
Our estimate of the length of tasks that an agent can complete depends on methodological choices like the tasks used and the humans whose performance is measured. However, we’re fairly confident that the overall trend is roughly correct, at around 1-4 doublings per year. If the measured trend from the past 6 years continues for 2-4 more years, generalist autonomous agents will be capable of performing a wide range of week-long tasks.
I believe SWE-bench is the best benchmark to control for variables like the choice of task and how the agentic system is built, so I’m leaning more towards the doubling time of 70 days.
For a large scale / complex app, it takes around 1 year of development (though this is not a completely fair estimate since it doesn’t take into account the number of man-hours), but going with this estimate and doubling in SWE-bench, it takes around 13 doublings from the beginning of 2025 or June 2027 to automate production of entire apps / complex websites.
Another big factor that this trendline and other trendlines don’t take into account is the amount of AI acceleration. If AI automates a large portion of the work, the time to double would shorten as AI gets better, I’d be interested to see how that would affect this model.
Reading Resources for Technical AI Safety independent researchers upskilling to apply to roles:
EA—Technical AI Safety
Advice for Independent Research
Working in Technical AI Safety
AGI Safety Career Advice
Be careful of failure modes
Working at a frontier lab
Upgradeable—Career Planning
Application and Upskilling resources;
Job Board
Events and Training