Here are the 2024 AI safety papers and posts I like the most.
The list is very biased by my taste, by my views, by the people that had time to argue that their work is important to me, and by the papers that were salient to me when I wrote this list. I am highlighting the parts of papers I like, which is also very subjective.
Important ideas—Introduces at least one important idea or technique.
Papers released in 2023 and presented at 2024 conferences like AI Control: Improving Safety Despite Intentional Subversion, Weak-to-Strong Generalization or Debating with More Persuasive LLMs Leads to More Truthful Answers don’t count.
This is a snapshot of my current understanding: I will likely change my mind about many of these as I learn more about certain papers’ ideas and shortcomings.
Someone asked what I thought of these, so I’m leaving a comment here. It’s kind of a drive-by take, which I wouldn’t normally leave without more careful consideration and double-checking of the papers, but the question was asked so I’m giving my current best answer.
First, I’d separate the typical value prop of these sort of papers into two categories:
Propaganda-masquerading-as-paper: the paper is mostly valuable as propaganda for the political agenda of AI safety. Scary demos are a central example. There can legitimately be valuable here.
Object-level: gets us closer to aligning substantially-smarter-than-human AGI, either directly or indirectly (e.g. by making it easier/safer to use weaker AI for the problem).
My take: many of these papers have some value as propaganda. Almost all of them provide basically-zero object-level progress toward aligning substantially-smarter-than-human AGI, either directly or indirectly.
Notable exceptions:
Gradient routing probably isn’t object-level useful, but gets special mention for being probably-not-useful for more interesting reasons than most of the other papers on the list.
Sparse feature circuits is the right type-of-thing to be object-level useful, though not sure how well it actually works.
Better SAEs are not a bottleneck at this point, but there’s some marginal object-level value there.
Propaganda-masquerading-as-paper: the paper is mostly valuable as propaganda for the political agenda of AI safety. Scary demos are a central example. There can legitimately be valuable here.
It can be the case that:
The core results are mostly unsurprising to people who were already convinced of the risks.
The work is objectively presented without bias.
The work doesn’t contribute much to finding solutions to risks.
A substantial motivation for doing the work is to find evidence of risk (given that the authors have a different view than the broader world and thus expect different observations).
Nevertheless, it results in updates among thoughtful people who are aware of all of the above. Or potentially, the work allows for better discussion of a topic that previously seemed hazy to people.
I don’t think this is well described as “propaganda” or “masquerading as a paper” given the normal connotations of these terms.
Demonstrating proofs of concept or evidence that you don’t find surprising is a common and societally useful move. See, e.g., the Chicago Pile experiment. This experiment had some scientific value, but I think probably most/much of the value (from the perspective of the Manhattan Project) was in demonstrating viability and resolving potential disagreements.
A related point is that even if the main contribution of some work is a conceptual framework or other conceptual ideas, it’s often extremely important to attach some empirical work, regardless of whether the empirical work should result in any substantial update for a well-informed individual. And this is actually potentially reasonable and desirable given that it is often easier to understand and check ideas attached to specific empirical setups (I discuss this more in a child comment).
Separately, I do think some of this work (e.g., “Alignment Faking in Large Language Models,” for which I am an author) is somewhat informative in updating the views of people already at least partially sold on risks (e.g., I updated up on scheming by about 5% based on the results in the alignment faking paper). And I also think that ultimately we have a reasonable chance of substantially reducing risks via experimentation on future, more realistic model organisms, and current work on less realistic model organisms can speed this work up.
In this comment, I’ll expand on my claim that attaching empirical work to conceptual points is useful. (This is extracted from a unreleased post I wrote a long time ago.)
Even if the main contribution of some work is a conceptual framework or other conceptual ideas, it’s often extremely important to attach some empirical work regardless of whether the empirical work should result in any substantial update for a well-informed individual. Often, careful first principles reasoning combined with observations on tangentially related empirical work, suffices to predict the important parts of empirical work. The importance of empirical work is both because some people won’t be able to verify careful first principles reasoning and because some people won’t engage with this sort of reasoning if there aren’t also empirical results stapled to the reasoning (at least as examples). It also helps to very concretely demonstrate that empirical work in the area is possible. This was a substantial part of the motivation for why we tried to get empirical results for AI control even though the empirical work seemed unlikely to update us much on the viability of control overall[1]. I think this view on the necessity of empirical work also (correctly) predicts the sleeper agents paper[2] will likely cause people to be substantially more focused on deceptive alignment and more likely to create model organisms of deceptive alignment even though the actual empirical contribution isn’t that large over careful first principles reasoning combined with analyzing tangentially related empirical work, leading to complaints like Dan H’s.
I think that “you need accompanying empirical work for your conceptual work to be well received and understood” is fine and this might be the expected outcome even if all actors were investing exactly as much as they should in thinking through conceptual arguments and related work. (And the optimal investment across fields in general might even be the exact same investment as now; it’s hard to avoid making mistakes with first principles reasoning and there is a lot of garbage in many fields (including in the AI x-risk space).)
Kudos for correctly identifying the main cruxy point here, even though I didn’t talk about it directly.
The main reason I use the term “propaganda” here is that it’s an accurate description of the useful function of such papers, i.e. to convince people of things, as opposed to directly advancing our cutting-edge understanding/tools. The connotation is that propagandists over the years have correctly realized that presenting empirical findings is not a very effective way to convince people of things, and that applies to these papers as well.
And I would say that people are usually correct to not update much on empirical findings! Not Measuring What You Think You Are Measuring is a very strong default, especially among the type of papers we’re talking about here.
The connotation is that propagandists over the years have correctly realized that presenting empirical findings is not a very effective way to convince people of things
I would be interested to understand why you would categorize something like “Frontier Models Are Capable of In-Context Scheming” as non-empirical or as falling into “Not Measuring What You Think You Are Measuring”.
Thanks for sharing! I’m a bit surprised that sleeper agent is listed as the best demo (e.g., higher than alignment faking). Do you focus on the main idea instead of specific operationalization here—asking because I think backdoored/password-locked LMs could be quite different from real-world threat models.
Here are the 2024 AI safety papers and posts I like the most.
The list is very biased by my taste, by my views, by the people that had time to argue that their work is important to me, and by the papers that were salient to me when I wrote this list. I am highlighting the parts of papers I like, which is also very subjective.
Important ideas—Introduces at least one important idea or technique.
★★★ The intro to AI control (The case for ensuring that powerful AIs are controlled)
★★ Detailed write-ups of AI worldviews I am sympathetic to (Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI, Situational Awareness)
★★ Absorption could enable interp and capability restrictions despite imperfect labels (Gradient Routing)
★★ Security could be very powerful against misaligned early-TAI (A basic systems architecture for AI agents that do autonomous research) and (Preventing model exfiltration with upload limits)
★★ IID train-eval splits of independent facts can be used to evaluate unlearning somewhat robustly (Do Unlearning Methods Remove Information from Language Model Weights?)
★ Studying board games is a good playground for studying interp (Evidence of Learned Look-Ahead in a Chess-Playing Neural Network, Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models)
★ A useful way to think about threats adjacent to self-exfiltration (AI catastrophes and rogue deployments)
★ Micro vs macro control protocols (Adaptative deployment of untrusted LLMs reduces distributed threats)?
★ A survey of ways to make safety cases (Safety Cases: How to Justify the Safety of Advanced AI Systems)
★ How to make safety cases vs scheming AIs (Towards evaluations-based safety cases for AI scheming)
★ An example of how SAEs can be useful beyond being fancy probes (Sparse Feature Circuits)
★ Fine-tuning AIs to use codes can break input/output monitoring (Covert Malicious Finetuning)
Surprising findings—Presents some surprising facts about the world
★★ A surprisingly effective way to make models drunk (Mechanistically Eliciting Latent Behaviors in Language Models)
★★ A clever initialization for unsupervised explanations of activations (SelfIE)
★★ Transformers are very bad at single-forward-pass multi-hop reasoning (Yang 2024, Yang 2024, Balesni 2024, Feng 2024)
★ Robustness for ViT is not doomed because of low transfer (When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?)
★ Unlearning techniques are not even robust to changing how questions are framed (Eight methods to evaluate robust unlearning in LLMs)
★ For some tasks, OOCR is surprisingly good (Connecting the Dots)
★ Nice emergence scaling laws with fine-tuning (Predicting Emergent Capabilities by Finetuning)
★ Fine-tuning robustness is much worse than the papers report it to be (On Evaluating the Durability of Safeguards for Open-weight LLMs)
Good demos—Is a good illustration of some previous, important idea. Bonus points (✪) for experimental results are especially interesting.
★★★ Alignment faking persisting through safety training (Sleeper Agents)
★★✪ Alignment faking emerging plausibly naturally (Alignment faking in large language models)
★✪ The bad things that happen when you tell AIs to follow a goal or threaten it to be less useful (Frontier Models are Capable of In-context Scheming)
★✪ Minor-to-major misgeneralization in reward-hackable settings (Sycophancy to subterfuge)?
★ More realistic backdoor triggers (Future Events as Backdoor Triggers)
★ Advanced backdoor behaviors (AdaptativeBackdoor: Backdoored Language Model Agents that Detect Human Overseers)
★ Realistic reward hacking in RLHF (Language models learn to mislead humans via RLHF)
★ Reward hacking plausibly naturally resulting in LLM deception (On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback)
★ LLMs already have non-trivial situational awareness (SAD)
Important empirical results and datasets—Runs an experiment whose results are informative and/or a useful benchmark
★★ How good are probes against adversarial attacks that target them, and many other cool related results (including Obfuscated adversarial training, which looks promising to me) (Obfuscated Activations Bypass LLM Latent-Space Defenses)
★★ A realistic generalization eval for adversarial robustness and an evaluation of probing (Improving Alignment and Robustness with Circuit Breakers, HarmBench)
★★ Measurement of the efficiency of SFT and RL vs sandbagging (Stress-Testing Capability Elicitation With Password-Locked Models)
★★ The first RL run that results in natural emergence of steganography (Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs)
★★ Tracking what AI labs are doing (Zach Stein-Perlman’s posts)
★ A first attempt at measuring whether AIs are smart enough to pursue scary scheming strategies (Subversion Strategy Eval)
★ Head-to-head comparisons to truth probes (How well do truth probes generalise?)
★ Head-to-head comparisons of black-box jailbreaking-robustness techniques: (Jailbreak Defense in a Narrow Domain)
★ Latent adversarial training is decent (Defending Against Unforeseen Failure Modes with Latent Adversarial Training, Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs)
★ Measuring how easy it is to poison pretrain data (Persistent pre-training poisoning of LLMs)
★ A good dataset for unlearning (The WMDP Benchmark)
★ Better SAEs and SAE evals (Scaling and evaluating sparse autoencoders)
★ What happens when you train LLMs to 1. Maximize gsm8k accuracy 2. produce gms8k answers that are incorrect but look correct (Prover-Verifier Games improve legibility of LLM outputs)
★ A good persuasion capability evals (Sabotage Evaluations for Frontier Models)
★ An eval of how good Sonnet 3.5 is at disabling a monitor in subtle ways (A Toy Evaluation of Inference Code Tampering)
★ An overview of how well debate and baselines work across a wider range of settings (On scalable oversight with weak LLMs judging strong LLMs)
★ Claude.ai is used for roughly what you would expect (Clio)
★ Relatively hard and relevant capability benchmarks (RE-Bench, SWE-Bench)
★ And all the big dangerous capability evals…
Papers released in 2023 and presented at 2024 conferences like AI Control: Improving Safety Despite Intentional Subversion, Weak-to-Strong Generalization or Debating with More Persuasive LLMs Leads to More Truthful Answers don’t count.
This is a snapshot of my current understanding: I will likely change my mind about many of these as I learn more about certain papers’ ideas and shortcomings.
Someone asked what I thought of these, so I’m leaving a comment here. It’s kind of a drive-by take, which I wouldn’t normally leave without more careful consideration and double-checking of the papers, but the question was asked so I’m giving my current best answer.
First, I’d separate the typical value prop of these sort of papers into two categories:
Propaganda-masquerading-as-paper: the paper is mostly valuable as propaganda for the political agenda of AI safety. Scary demos are a central example. There can legitimately be valuable here.
Object-level: gets us closer to aligning substantially-smarter-than-human AGI, either directly or indirectly (e.g. by making it easier/safer to use weaker AI for the problem).
My take: many of these papers have some value as propaganda. Almost all of them provide basically-zero object-level progress toward aligning substantially-smarter-than-human AGI, either directly or indirectly.
Notable exceptions:
Gradient routing probably isn’t object-level useful, but gets special mention for being probably-not-useful for more interesting reasons than most of the other papers on the list.
Sparse feature circuits is the right type-of-thing to be object-level useful, though not sure how well it actually works.
Better SAEs are not a bottleneck at this point, but there’s some marginal object-level value there.
It can be the case that:
The core results are mostly unsurprising to people who were already convinced of the risks.
The work is objectively presented without bias.
The work doesn’t contribute much to finding solutions to risks.
A substantial motivation for doing the work is to find evidence of risk (given that the authors have a different view than the broader world and thus expect different observations).
Nevertheless, it results in updates among thoughtful people who are aware of all of the above. Or potentially, the work allows for better discussion of a topic that previously seemed hazy to people.
I don’t think this is well described as “propaganda” or “masquerading as a paper” given the normal connotations of these terms.
Demonstrating proofs of concept or evidence that you don’t find surprising is a common and societally useful move. See, e.g., the Chicago Pile experiment. This experiment had some scientific value, but I think probably most/much of the value (from the perspective of the Manhattan Project) was in demonstrating viability and resolving potential disagreements.
A related point is that even if the main contribution of some work is a conceptual framework or other conceptual ideas, it’s often extremely important to attach some empirical work, regardless of whether the empirical work should result in any substantial update for a well-informed individual. And this is actually potentially reasonable and desirable given that it is often easier to understand and check ideas attached to specific empirical setups (I discuss this more in a child comment).
Separately, I do think some of this work (e.g., “Alignment Faking in Large Language Models,” for which I am an author) is somewhat informative in updating the views of people already at least partially sold on risks (e.g., I updated up on scheming by about 5% based on the results in the alignment faking paper). And I also think that ultimately we have a reasonable chance of substantially reducing risks via experimentation on future, more realistic model organisms, and current work on less realistic model organisms can speed this work up.
In this comment, I’ll expand on my claim that attaching empirical work to conceptual points is useful. (This is extracted from a unreleased post I wrote a long time ago.)
Even if the main contribution of some work is a conceptual framework or other conceptual ideas, it’s often extremely important to attach some empirical work regardless of whether the empirical work should result in any substantial update for a well-informed individual. Often, careful first principles reasoning combined with observations on tangentially related empirical work, suffices to predict the important parts of empirical work. The importance of empirical work is both because some people won’t be able to verify careful first principles reasoning and because some people won’t engage with this sort of reasoning if there aren’t also empirical results stapled to the reasoning (at least as examples). It also helps to very concretely demonstrate that empirical work in the area is possible. This was a substantial part of the motivation for why we tried to get empirical results for AI control even though the empirical work seemed unlikely to update us much on the viability of control overall[1]. I think this view on the necessity of empirical work also (correctly) predicts the sleeper agents paper[2] will likely cause people to be substantially more focused on deceptive alignment and more likely to create model organisms of deceptive alignment even though the actual empirical contribution isn’t that large over careful first principles reasoning combined with analyzing tangentially related empirical work, leading to complaints like Dan H’s.
I think that “you need accompanying empirical work for your conceptual work to be well received and understood” is fine and this might be the expected outcome even if all actors were investing exactly as much as they should in thinking through conceptual arguments and related work. (And the optimal investment across fields in general might even be the exact same investment as now; it’s hard to avoid making mistakes with first principles reasoning and there is a lot of garbage in many fields (including in the AI x-risk space).)
Though it did predictably provide sufficient updates on various topics to be worth doing regardless of the argument I’m making in this section.
I’m an author on this paper, though only in an advisory capacity.
Kudos for correctly identifying the main cruxy point here, even though I didn’t talk about it directly.
The main reason I use the term “propaganda” here is that it’s an accurate description of the useful function of such papers, i.e. to convince people of things, as opposed to directly advancing our cutting-edge understanding/tools. The connotation is that propagandists over the years have correctly realized that presenting empirical findings is not a very effective way to convince people of things, and that applies to these papers as well.
And I would say that people are usually correct to not update much on empirical findings! Not Measuring What You Think You Are Measuring is a very strong default, especially among the type of papers we’re talking about here.
I would be interested to understand why you would categorize something like “Frontier Models Are Capable of In-Context Scheming” as non-empirical or as falling into “Not Measuring What You Think You Are Measuring”.
What about the latent adversarial training papers?
What about the Mechanistically Elicitating Latent Behaviours?
the latter is in the list
Alexander is replying to John’s comment (asking him if he thinks these papers are worthwhile); he’s not replying to the top level comment.
Thanks for sharing! I’m a bit surprised that sleeper agent is listed as the best demo (e.g., higher than alignment faking). Do you focus on the main idea instead of specific operationalization here—asking because I think backdoored/password-locked LMs could be quite different from real-world threat models.