Someone asked what I thought of these, so I’m leaving a comment here. It’s kind of a drive-by take, which I wouldn’t normally leave without more careful consideration and double-checking of the papers, but the question was asked so I’m giving my current best answer.
First, I’d separate the typical value prop of these sort of papers into two categories:
Propaganda-masquerading-as-paper: the paper is mostly valuable as propaganda for the political agenda of AI safety. Scary demos are a central example. There can legitimately be valuable here.
Object-level: gets us closer to aligning substantially-smarter-than-human AGI, either directly or indirectly (e.g. by making it easier/safer to use weaker AI for the problem).
My take: many of these papers have some value as propaganda. Almost all of them provide basically-zero object-level progress toward aligning substantially-smarter-than-human AGI, either directly or indirectly.
Notable exceptions:
Gradient routing probably isn’t object-level useful, but gets special mention for being probably-not-useful for more interesting reasons than most of the other papers on the list.
Sparse feature circuits is the right type-of-thing to be object-level useful, though not sure how well it actually works.
Better SAEs are not a bottleneck at this point, but there’s some marginal object-level value there.
Propaganda-masquerading-as-paper: the paper is mostly valuable as propaganda for the political agenda of AI safety. Scary demos are a central example. There can legitimately be valuable here.
It can be the case that:
The core results are mostly unsurprising to people who were already convinced of the risks.
The work is objectively presented without bias.
The work doesn’t contribute much to finding solutions to risks.
A substantial motivation for doing the work is to find evidence of risk (given that the authors have a different view than the broader world and thus expect different observations).
Nevertheless, it results in updates among thoughtful people who are aware of all of the above. Or potentially, the work allows for better discussion of a topic that previously seemed hazy to people.
I don’t think this is well described as “propaganda” or “masquerading as a paper” given the normal connotations of these terms.
Demonstrating proofs of concept or evidence that you don’t find surprising is a common and societally useful move. See, e.g., the Chicago Pile experiment. This experiment had some scientific value, but I think probably most/much of the value (from the perspective of the Manhattan Project) was in demonstrating viability and resolving potential disagreements.
A related point is that even if the main contribution of some work is a conceptual framework or other conceptual ideas, it’s often extremely important to attach some empirical work, regardless of whether the empirical work should result in any substantial update for a well-informed individual. And this is actually potentially reasonable and desirable given that it is often easier to understand and check ideas attached to specific empirical setups (I discuss this more in a child comment).
Separately, I do think some of this work (e.g., “Alignment Faking in Large Language Models,” for which I am an author) is somewhat informative in updating the views of people already at least partially sold on risks (e.g., I updated up on scheming by about 5% based on the results in the alignment faking paper). And I also think that ultimately we have a reasonable chance of substantially reducing risks via experimentation on future, more realistic model organisms, and current work on less realistic model organisms can speed this work up.
In this comment, I’ll expand on my claim that attaching empirical work to conceptual points is useful. (This is extracted from a unreleased post I wrote a long time ago.)
Even if the main contribution of some work is a conceptual framework or other conceptual ideas, it’s often extremely important to attach some empirical work regardless of whether the empirical work should result in any substantial update for a well-informed individual. Often, careful first principles reasoning combined with observations on tangentially related empirical work, suffices to predict the important parts of empirical work. The importance of empirical work is both because some people won’t be able to verify careful first principles reasoning and because some people won’t engage with this sort of reasoning if there aren’t also empirical results stapled to the reasoning (at least as examples). It also helps to very concretely demonstrate that empirical work in the area is possible. This was a substantial part of the motivation for why we tried to get empirical results for AI control even though the empirical work seemed unlikely to update us much on the viability of control overall[1]. I think this view on the necessity of empirical work also (correctly) predicts the sleeper agents paper[2] will likely cause people to be substantially more focused on deceptive alignment and more likely to create model organisms of deceptive alignment even though the actual empirical contribution isn’t that large over careful first principles reasoning combined with analyzing tangentially related empirical work, leading to complaints like Dan H’s.
I think that “you need accompanying empirical work for your conceptual work to be well received and understood” is fine and this might be the expected outcome even if all actors were investing exactly as much as they should in thinking through conceptual arguments and related work. (And the optimal investment across fields in general might even be the exact same investment as now; it’s hard to avoid making mistakes with first principles reasoning and there is a lot of garbage in many fields (including in the AI x-risk space).)
Kudos for correctly identifying the main cruxy point here, even though I didn’t talk about it directly.
The main reason I use the term “propaganda” here is that it’s an accurate description of the useful function of such papers, i.e. to convince people of things, as opposed to directly advancing our cutting-edge understanding/tools. The connotation is that propagandists over the years have correctly realized that presenting empirical findings is not a very effective way to convince people of things, and that applies to these papers as well.
And I would say that people are usually correct to not update much on empirical findings! Not Measuring What You Think You Are Measuring is a very strong default, especially among the type of papers we’re talking about here.
The connotation is that propagandists over the years have correctly realized that presenting empirical findings is not a very effective way to convince people of things
I would be interested to understand why you would categorize something like “Frontier Models Are Capable of In-Context Scheming” as non-empirical or as falling into “Not Measuring What You Think You Are Measuring”.
Someone asked what I thought of these, so I’m leaving a comment here. It’s kind of a drive-by take, which I wouldn’t normally leave without more careful consideration and double-checking of the papers, but the question was asked so I’m giving my current best answer.
First, I’d separate the typical value prop of these sort of papers into two categories:
Propaganda-masquerading-as-paper: the paper is mostly valuable as propaganda for the political agenda of AI safety. Scary demos are a central example. There can legitimately be valuable here.
Object-level: gets us closer to aligning substantially-smarter-than-human AGI, either directly or indirectly (e.g. by making it easier/safer to use weaker AI for the problem).
My take: many of these papers have some value as propaganda. Almost all of them provide basically-zero object-level progress toward aligning substantially-smarter-than-human AGI, either directly or indirectly.
Notable exceptions:
Gradient routing probably isn’t object-level useful, but gets special mention for being probably-not-useful for more interesting reasons than most of the other papers on the list.
Sparse feature circuits is the right type-of-thing to be object-level useful, though not sure how well it actually works.
Better SAEs are not a bottleneck at this point, but there’s some marginal object-level value there.
It can be the case that:
The core results are mostly unsurprising to people who were already convinced of the risks.
The work is objectively presented without bias.
The work doesn’t contribute much to finding solutions to risks.
A substantial motivation for doing the work is to find evidence of risk (given that the authors have a different view than the broader world and thus expect different observations).
Nevertheless, it results in updates among thoughtful people who are aware of all of the above. Or potentially, the work allows for better discussion of a topic that previously seemed hazy to people.
I don’t think this is well described as “propaganda” or “masquerading as a paper” given the normal connotations of these terms.
Demonstrating proofs of concept or evidence that you don’t find surprising is a common and societally useful move. See, e.g., the Chicago Pile experiment. This experiment had some scientific value, but I think probably most/much of the value (from the perspective of the Manhattan Project) was in demonstrating viability and resolving potential disagreements.
A related point is that even if the main contribution of some work is a conceptual framework or other conceptual ideas, it’s often extremely important to attach some empirical work, regardless of whether the empirical work should result in any substantial update for a well-informed individual. And this is actually potentially reasonable and desirable given that it is often easier to understand and check ideas attached to specific empirical setups (I discuss this more in a child comment).
Separately, I do think some of this work (e.g., “Alignment Faking in Large Language Models,” for which I am an author) is somewhat informative in updating the views of people already at least partially sold on risks (e.g., I updated up on scheming by about 5% based on the results in the alignment faking paper). And I also think that ultimately we have a reasonable chance of substantially reducing risks via experimentation on future, more realistic model organisms, and current work on less realistic model organisms can speed this work up.
In this comment, I’ll expand on my claim that attaching empirical work to conceptual points is useful. (This is extracted from a unreleased post I wrote a long time ago.)
Even if the main contribution of some work is a conceptual framework or other conceptual ideas, it’s often extremely important to attach some empirical work regardless of whether the empirical work should result in any substantial update for a well-informed individual. Often, careful first principles reasoning combined with observations on tangentially related empirical work, suffices to predict the important parts of empirical work. The importance of empirical work is both because some people won’t be able to verify careful first principles reasoning and because some people won’t engage with this sort of reasoning if there aren’t also empirical results stapled to the reasoning (at least as examples). It also helps to very concretely demonstrate that empirical work in the area is possible. This was a substantial part of the motivation for why we tried to get empirical results for AI control even though the empirical work seemed unlikely to update us much on the viability of control overall[1]. I think this view on the necessity of empirical work also (correctly) predicts the sleeper agents paper[2] will likely cause people to be substantially more focused on deceptive alignment and more likely to create model organisms of deceptive alignment even though the actual empirical contribution isn’t that large over careful first principles reasoning combined with analyzing tangentially related empirical work, leading to complaints like Dan H’s.
I think that “you need accompanying empirical work for your conceptual work to be well received and understood” is fine and this might be the expected outcome even if all actors were investing exactly as much as they should in thinking through conceptual arguments and related work. (And the optimal investment across fields in general might even be the exact same investment as now; it’s hard to avoid making mistakes with first principles reasoning and there is a lot of garbage in many fields (including in the AI x-risk space).)
Though it did predictably provide sufficient updates on various topics to be worth doing regardless of the argument I’m making in this section.
I’m an author on this paper, though only in an advisory capacity.
Kudos for correctly identifying the main cruxy point here, even though I didn’t talk about it directly.
The main reason I use the term “propaganda” here is that it’s an accurate description of the useful function of such papers, i.e. to convince people of things, as opposed to directly advancing our cutting-edge understanding/tools. The connotation is that propagandists over the years have correctly realized that presenting empirical findings is not a very effective way to convince people of things, and that applies to these papers as well.
And I would say that people are usually correct to not update much on empirical findings! Not Measuring What You Think You Are Measuring is a very strong default, especially among the type of papers we’re talking about here.
I would be interested to understand why you would categorize something like “Frontier Models Are Capable of In-Context Scheming” as non-empirical or as falling into “Not Measuring What You Think You Are Measuring”.
What about the latent adversarial training papers?
What about the Mechanistically Elicitating Latent Behaviours?
the latter is in the list
Alexander is replying to John’s comment (asking him if he thinks these papers are worthwhile); he’s not replying to the top level comment.