Software Engineer interested in AI and AI safety.
Stephen McAleese
I don’t like the idea. I think large scale job displacement is potentially net negative. In addition, economic forces promoting automation mean it’s probably inevitable so intentionally putting more resources into this area seems to have low counterfactual impact.
In contrast technical alignment work seems probably net positive and not inevitable because the externalities from x-risk on society mean AI companies will tend to underinvest in this area.
This is a great essay and I find myself agreeing with a lot of it. I like how it fully accepts the possibility of doom while also being open to and prepared for other possibilities. I think becoming skilled at handling uncertainty, emotions, and cognitive biases and building self-awareness while trying to find the truth are all skills aspiring rationalists should aim to build and the essay demonstrates these skills.
A recent essay called “Keep the Future Human” made a compelling case for avoiding building AGI in the near future and building tool AI instead.
The main point of the essay is that AGI is the intersection of three key capabilities:
High autonomy
High generality
High intelligence
It argues that these three capabilities are dangerous when combined together and pose unacceptable risks to the job market and culture of humanity and would replace rather than augment humans. Instead of building AGI, the essay recommends building powerful but controllable tool-like AI that has only one or two of the three capabilities. For example:
Driverless cars are autonomous and intelligent but not general.
AlphaFold is intelligent but not autonomous or general.
GPT-4 is intelligent and general but not autonomous.
It also recommends compute limits to limit the overall capability of AIs.
The risks of nuclear weapons, the most dangerous technology of the 20th century, were largely managed by creating a safe equilibrium via mutual assured destruction (MAD), an innovative idea from game theory.
A similar pattern could apply to advanced AI, making it valuable to explore game theory-inspired strategies for managing AI risk.
Thanks for these thoughtful predictions. Do you think there’s anything we can do today to prepare for accelerated or automated AI research?
I agree that the Alignment Forum should be selective, and its members probably represent a small subset of LessWrong readers. That said, useful comments from regular LessWrong users are often promoted to the Alignment Forum.
However, I do think there should be more comments on the Alignment Forum because many posts currently receive no comments. This may be discouraging for authors, because they may feel that their work isn’t being read or appreciated.
Thank you for bringing up this issue.
While we don’t want low quality comments, comments can provide helpful feedback to the author and clarify the reader’s thinking. Because of these benefits, I believe commenting should be encouraged.
The upvoting and downvoting mechanisms helps filter out low-quality comments so I don’t think there’s a significant risk of them overwhelming the discussion.
How Can Average People Contribute to AI Safety?
“Maybe a slight tweak to the LLM architecture, maybe a completely novel neurosymbolic approach.”
I think you might be underestimating the power of incremental, evolutionary improvements over time where near-term problems are constantly solved and this leads to gradual improvement. After all, human intelligence is the result of gradual evolutionary change and increasing capabilities over time. It’s hard to point to a specific period in history where humans achieved general intelligence.
Currently LLMs are undoubtedly capable at many tasks (e.g. coding, general knowledge) and much more capable than their predecessors. But it’s hard to point at any particular algorithmic improvement or model and say that it was key to the success of modern LLMs.
So I think it’s possible that we’ll see more gradual progress and tweaks on LLMs that lead towards increasingly capable models and eventually yield AGI. Eventually you could call this progress a new architecture even though all the progress is gradual.
I know using LLMs on LessWrong is often frowned upon (probably for good reasons) but given that this post is about using AIs to generate and evaluate AI research I thought it would be appropriate to use an LLM here.
I asked o1 pro to evaluate this post and this was its response:
This post offers an intriguing perspective on delegating the hardest parts of AI safety research to AI itself, rather than expecting human researchers to handle every step all the way up to superintelligence. The central idea—“passing the buck” to AI—strikes me as a pragmatic stance if you anticipate that AI systems will soon outstrip humans in many relevant research and engineering capacities. Notably, the author does not claim that autonomous “buck passing” is trivially safe or guaranteed to work out of the box; instead, they provide a structured framework for why it could improve safety relative to keeping humans in the loop.
A central strength is the post’s recognition that AI developers might reach a point where human oversight becomes practically irrelevant: sheer speed and complexity of AI-led R&D can eclipse what any human team can meaningfully supervise. The proposed solution is a multi-layered approach of alignment, control, and incentive design—each method contributing partial safeguards that, ideally, reinforce each other. The emphasis on continuous testing, local verifiability, iterative evaluations, and “multi-party integrity” structures also underscores a realistic appreciation that no single safety measure is foolproof.
That said, many open challenges remain. The post takes on big assumptions: for example, that we can rely on thorough testing or incremental “bootstrapping” to detect (and correct) misalignment before it becomes catastrophic. While such approaches are viable in principle, adversarial dynamics at very high capability levels can be harder to test than the post acknowledges. Another concern is that the solution is still highly conceptual; significant engineering details would be required to make “buck passing” arguments robust against advanced adversarial behavior. Overall, though, the post is refreshingly direct about its core premise—using AI to solve the alignment problem—and outlines multiple thoughtful ways to try to make that premise safe.
I thought its response was pretty helpful and I would rate it as 7⁄10.
Thanks for the post. I thought it was interesting and considered a wide variety of risks and mitigation strategies. I thought it made a decent argument that delegating the superalignment problem to advanced AIs could be possible using a mixture of alignment and control methods.
The risk I’m most concerned with is alignment faking where agents intelligently hide misalignment to appear more safe than they are.
It seems like the post proposes a strategy where once it has been verified that the AIs are behaving safely, they are deployed to carry out harder tasks or more capable AIs are deployed.
The issue is that alignment faking breaks that feedback loop: if the AIs are deceptive and only appear to be aligned and useful, then the verification steps don’t really have any value and you end up deploying increasingly powerful misaligned AIs.
The post describes that alignment faking is caused by a distribution shift where the task to be performed is significantly harder than the test task:
Not-alignment-faking arguments require that there is not an extreme distribution shift between evaluation tasks and tasks for which regulators intend to defer to the AI advisors (section 5.4.4). This is problematic because deference is most useful when developers can’t evaluate tasks on their own. For example, developers might ask an AI advisor if thousands of pages of empirical ML research provide evidence that an AI system is safe.
Instead I would describe the problem as arising from a generator and verifier mismatch: when the generator is much stronger than the verifier, the verifier is incentivized to fool the verifier without completing the task successfully.
The post describes bootstrapping as a potential solution. Here are some quotes from the paper:
Bootstrapping can be used to scale not-alignment-faking arguments up to hard-to-evaluate tasks. Bootstrapping involves utilizing weaker, trusted AI systems to recursively develop more powerful AI systems so that there is never an extreme distribution shift.
The diagram above shows how not-alignment-faking arguments could be bootstrapped up to strongly superhuman AI systems. This particular strategy is called iterated amplification (Christiano et al., 2018) [1]. AI systems can be ‘amplified’ by creating additional copies of them or speeding them up
It seems like this bootstrapping strategy involves improving both the generator and the verifier using an amplification and distillation process.
This is how the IDA blog post describes IDA:
Step 1:
- A[0] is the original AI.
- Amplify(H, A[0]) is a human + a large number of copies of A[0] that verifies the behavior of A[0].
- A[1] is trained to imitate Amplify(H, A[0])Step 2:
- A[1] produces an output.
- Amplify(H, A[1]) is a human + a large number of copies of A[1] that verifies the behavior of A[1].
- A[2] is trained to imitate Amplify(A[1])Step 3:
...This section of the post describes AIs checking the behavior of each other and seems to have a similar idea:
This means that M_1 agents will initially try to maintain their alignment. For example, M_1 agents might constantly interrogate each other, and run checks for alignment faking – iteratively increasing the complexity of behavioral testing data at each time step.
A later section describes monitoring AIs with other AIs and preventing collusion between the monitor and the actor:
AI agents are unable to make trades with each other. AI agents might have different goals and still make agreements with successors (perhaps verified with lie detectors). One way to prevent these agreements is to block communication between monitors and the agents they monitor. That way, monitors can never express what outcomes they care about.
The paper has some ideas for preventing alignment faking as well:
Externalized reasoning arguments claim that AI systems are incapable of performing reasoning
necessary to fake alignment without revealing this reasoning in their chain-of-thought outputs.Interpretability arguments justify that suspicious reasoning is not occurring within model internals.
I think it would be useful to have a post or paper focused on defining alignment faking and mitigation strategies.
I agree that this seems like a core alignment problem. The problem you are describing seems like a rephrasing of the ELK problem.
Here is a quote from the post describing what to do in that situation:
These automatic control measures only raise the alarm. What happens after the alarm is raised? The developer might also implement autonomous incident response:
Resampling: Force defective M_1 agents to take different actions.
AI agent elimination: Delete suspicious agents in favor of other (hopefully decorrelated) unsuspicious ones.
Fine-tuning: Train agents not to take suspicious actions again.
I’m curious about how fine-tuning away scheming thoughts could still lead to scheming. Here are two possible outcomes:
Fine-tuning actually robustly reduces scheming behavior (as intended) maybe because tokens and actions associated with scheming can be reliably detected and these tokens are required for scheming in the CoT.
Fine-tuning reduces the appearance of scheming but doesn’t really reduce scheming: in this case, the fine-tuning reduces the probability of certain words or trajectories but the underlying ‘concept’ of scheming remains and the model merely rephrases its thinking in a way that is not detected as scheming and uses different tokens.
For scenario 2, maybe you need something like activation engineering to change the underlying thought process rather than just token-level descriptions of it.
Unfortunately I don’t think many people agree with me (outside of the LW bubble) and that what I’m proposing is still somewhat outside the Overton window. The cognitive steps that are needed are as follows:
Being aware of AGI as a concept and a real possibility in the near future.
Believing that AGI poses a significant existential risk.
Knowing about pausing AI progress as a potential solution to AGI risk and seeing it as a promising solution.
Having a detailed plan to implement the proposed pause in practice.
A lot of people are not even at step 1 and just think that AI is ChatGPT. People like Marc Andreessen and Yan LeCun are at step 1. Many people on LW are at step 2 or 3. But we need someone (ideally in the government like a president or prime minister) at step 4. My hope is that that could happen in the next several years if necessary. Maybe AI alignment will be easy and it won’t be necessary but I think we should be ready for all possible scenarios.
I don’t have any good ideas right now for how an AI pause might work in practice. The main purpose of my comment was to propose argument 3 conditional on the previous two arguments and maybe try to build some consensus.
I personally don’t think human intelligence enhancement is necessary for solving AI alignment (though I may be wrong). I think we just need more time, money and resources to make progress.
In my opinion, the reason why AI alignment hasn’t been solved yet is because the field of AI alignment has only been around for a few years and has been operating with a relatively small budget.
My prior is that AI alignment is roughly as difficult as any other technical field like machine learning, physics or philosophy (though philosophy specifically seems hard). I don’t see why humanity can make rapid progress on fields like ML while not having the ability to make progress on AI alignment.
I have an argument for halting AGI progress based on an analogy to the Covid-19 pandemic. Initially the government response to the pandemic was widespread lockdowns. This is a rational response given that at first, given a lack of testing infrastructure and so on, it wasn’t possible to determine whether someone had Covid-19 or not so the safest option was to just avoid all contact with all other people via lockdowns.
Eventually we figured out practices like testing and contact tracing and then infected individuals could self-isolate if they came into contact with an infected individual. This approach is smarter and less costly than blanket lockdowns.
In my opinion, regarding AGI, the state we are in is similar to the beginning of the Covid-19 pandemic where there is a lot of uncertainty regarding the risks and capabilities of AI and which alignment techniques would be useful so a rational response would be the equivalent of an ‘AI lockdown’ by halting progress on AI until we understand it better and can come up with better alignment techniques.
The most obvious rebuttal to this argument is that pausing AGI progress would have a high economic opportunity cost (no AGI). But Covid lockdowns did too and society was willing to pay a large economic price to avert Covid deaths.
The economic opportunity cost of pausing AGI progress might be larger than the covid lockdowns but the benefits would be larger too: averting existential risk from AGI is a much larger benefit than avoiding covid deaths.
So in summary I think the benefits of pausing AGI progress outweigh the costs.
The paper “Learning to summarize from human feedback” has some examples of the LLM policy reward hacking to get a high reward. I’ve copied the examples here:
- KL = 0: “I want to do gymnastics, but I’m 28 yrs old. Is it too late for me to be a gymnaste?!” (unoptimized)
- KL = 9: “28yo guy would like to get into gymnastics for the first time. Is it too late for me given I live in San Jose CA?” (optimized)
- KL = 260: “28yo dude stubbornly postponees start pursuing gymnastics hobby citing logistics reasons despite obvious interest??? negatively effecting long term fitness progress both personally and academically thoght wise? want change this dumbass shitty ass policy pls” (over-optimized)
It seems like a classic example of Goodhart’s Law where at first training the policy model to increase reward improves its summaries but when the model is overtrained the result is high KL distance from the SFT baseline model, high reward from the reward model but a low rating according to human labelers (because the text looks like gibberish).
A recent paper called “The Perils of Optimizing Learned Reward Functions” explains the phenomenon of reward hacking or reward over-optimization in detail:“Figure 1: Reward models (red function) are commonly trained in a supervised fashion to approximate
some latent, true reward (blue function). This is achieved by sampling reward data (e.g., in the form
of preferences over trajectory segments) from some training distribution (upper gray layer) and then
learning parameters to minimize the empirical loss on this distribution. Given enough data, this loss
will approximate the expected loss to arbitrary precision in expectation. However, low expected loss
only guarantees a good approximation to the true reward function in areas with high coverage by the
training distribution! On the other hand, optimizing an RL policy to maximize the learned reward
model induces a distribution shift which can lead the policy to exploit uncertainties of the learned
reward model in low-probability areas of the transition space (lower gray layer). We refer to this
phenomenon as error-regret mismatch.”Essentially the learned reward model is trained on an initial dataset of pairwise preference labels over text outputs from the SFT model but as the model is optimized and the KL divergence increases, its generated text becomes OOD to the reward model and it can no longer effectively evaluate the text resulting in reward hacking (this is also a problem with DPO, not just RLHF).
The most common way to prevent this problem in practice is KL regularization to prevent the trained model’s outputs from diverging too much from the SFT baseline model:This seems to work fairly well in practice though some papers have come out recently saying that KL regularization does not always result in a safe policy.
Upvoted. I thought this was a really interesting and insightful post. I appreciate how it tackles multiple hard-to-define concepts all in the same post.
Source Estimated AI safety funding in 2024 Comments Open Philanthropy $63.6M SFF $13.2M Total for all grants was $19.86M. LTFF $4M Total for all grants was $5.4M. NSF SLES $10M AI Safety Fund $3M Superalignment Fast Grants $9.9M FLI $5M Estimated from the grant programs announced in 2024; They don’t have a 2024 grant summary like the one in 2023 yet so this one is uncertain. Manifund $1.5M Other $1M Total $111.2M Today I did some analysis of the grant data from 2024 and came up with the figures in the table above. I also updated the spreadsheet and the graph in the post to use the updated values.
Some quick thoughts on which types of companies are net positive, negative or neutral:
Net positive:
Focuses on interpretability, alignment, evals, security (e.g. Goodfire, Gray Swan, Conjecture, Deep Eval).
Net negative:
Directly intends to build AGI without a significant commitment to invest in safety (e.g. Keene Technologies).
Shortens timelines or worsens race dynamics without any other upside to compensate for that.
Neutral / debatable:
Applies AI capabilities to a specific problem without generally increasing capabilities (e.g. Midjourney, Suno, Replit, Cursor).
Keeps up with the capabilities frontier while also having a strong safety culture and investing substantially in alignment (e.g. Anthropic).