The people you are most harshly criticizing (Ajeya, myself, evhub, MIRI) also weren’t talking about pretraining or light post-training afaict.
Speaking for myself:
Risks from Learned Optimization, which is my earliest work on this question (and the earliest work overall, unless you count something like Superintelligence), is more oriented towards RL and definitely does not hypothesize that pre-training will lead to coherent deceptively aligned agents (it doesn’t discuss the current LLM paradigm much at all because it wasn’t very well-established at that point in 2019). I think Risks from Learned Optimization still looks very good in hindsight, since while it didn’t predict LLMs, it did a pretty good job of predicting the dynamics we see in Alignment Faking in Large Language Models, e.g. how deceptive alignment can lead to a model’s goals crystallizing and becoming resistant to further training.
Since at least the time when I started the early work that would become Conditioning Predictive Models, which was around mid-2022, I was pretty convinced that pre-training (or light post-training) was unlikely to produce a coherent deceptively aligned agent, as we discuss in that paper. Though I thought (and still continue to think) that it’s not entirely impossible with further scale (maybe ~5% likely).
That just leaves 2020 − 2021 unaccounted for, and I would describe my beliefs around that time as being uncertain on this question. I definitely would never have strongly predicted that pre-training would yield deceptively aligned agents, though I think at that time I felt like it was at least more of a possibility than I currently think it is. I don’t think I would have given you a probability at the time, though, since I just felt too uncertain about the question and was still trying to really grapple with and understand the (at the time new) LLM paradigm.
Regardless, it seems like this conversation happened in 2023/2024, which is post-Conditioning-Predictive-Models, so my position by that point is very clear in that paper.
I was pretty convinced that pre-training (or light post-training) was unlikely to produce a coherent deceptively aligned agent, as we discuss in that paper. [...] (maybe ~1% likely).
~30%: conditional on GPT-style language modeling being the first path to transformative AI, there will be deceptively aligned language models (not including deceptive simulacra, only deceptive simulators).
Maybe this 30% is supposed to include stuff other than light post training? Or maybe coherant vs non-coherant deceptive alignment is important?
Do you have a citation for “I thought scheming is 1% likely with pretrained models”?
Separately, the text you link from Conditioning Predictive Models appears to emphasize the following reason for thinking deceptive alignment might be less likely: the prediction objective is simpler so deceptive alignment is less likely. IMO, this justification seems mostly unimportant and I don’t buy this. (Why is “the data was scrapped in exactly XYZ way simple”? Why is this more simple than “the RL episodes were rated in exactly XYZ way”? Is it very plausible that models get views this precise prior to being wildly superhuman such that this notion of objective simplicity is important?) The link also discusses IMO more plausible reasons like a default toward less situational awareness and more myopia.
Upon reading this text, I don’t come away with the conclusion “the authors of this text think that deceptive alignment in pretrained models (conditional on being capable enough to be conditioned to do alignment research(!)) is about 1% likely”. The discussion is more like “here are some considerations for why it might be less likely”.
(FWIW, I disagree with “1% likely for pretrained models” and think that if scaling pure pretraining (with no important capability improvement from post training and not using tons of CoT reasoning with crazy scaffolding/prompting strategies) gets you to AI systems capable of obsoleting all human experts without RL, deceptive alignment seems plausible even during pretraining (idk exactly, maybe 5%). However, this is substantially because something pretty crazy must have been happening for scaling pure pretraining to go this far and the AI is likely to be doing tons of opaque reasoning in its head (we conditioned on not needing tons of CoT reasoning). Or I’m just confused about how easy human jobs are or the natural of intelligence, etc. which is part of the 95%. Another way to put this is that if the AI is doing enough reasoning in its head to obsolete all human scientists just from pretraining, probably it is doing some pretty crazy inner cognition or I was really confused about what was needed for cognitive tasks.)
Regardless, it seems like this conversation happened in 2023/2024, which is post-Conditioning-Predictive-Models, so my position by that point is very clear in that paper.
As discussed above, I don’t agree your position is clear in this paper.
Maybe this 30% is supposed to include stuff other than light post training? Or maybe coherant vs non-coherant deceptive alignment is important?
This was still intended to include situations where the RLHF Conditioning Hypothesis breaks down because you’re doing more stuff on top, so not just pre-training.
Do you have a citation for “I thought scheming is 1% likely with pretrained models”?
FWIW, I disagree with “1% likely for pretrained models” and think that if scaling pure pretraining (with no important capability improvement from post training and not using tons of CoT reasoning with crazy scaffolding/prompting strategies) gets you to AI systems capable of obsoleting all human experts without RL, deceptive alignment seems plausible even during pretraining (idk exactly, maybe 5%).
Yeah, I agree 1% is probably too low. I gave ~5% on my talk on this and I think I stand by that number—I’ll edit my comment to say 5% instead.
Speaking for myself:
Risks from Learned Optimization, which is my earliest work on this question (and the earliest work overall, unless you count something like Superintelligence), is more oriented towards RL and definitely does not hypothesize that pre-training will lead to coherent deceptively aligned agents (it doesn’t discuss the current LLM paradigm much at all because it wasn’t very well-established at that point in 2019). I think Risks from Learned Optimization still looks very good in hindsight, since while it didn’t predict LLMs, it did a pretty good job of predicting the dynamics we see in Alignment Faking in Large Language Models, e.g. how deceptive alignment can lead to a model’s goals crystallizing and becoming resistant to further training.
Since at least the time when I started the early work that would become Conditioning Predictive Models, which was around mid-2022, I was pretty convinced that pre-training (or light post-training) was unlikely to produce a coherent deceptively aligned agent, as we discuss in that paper. Though I thought (and still continue to think) that it’s not entirely impossible with further scale (maybe ~5% likely).
That just leaves 2020 − 2021 unaccounted for, and I would describe my beliefs around that time as being uncertain on this question. I definitely would never have strongly predicted that pre-training would yield deceptively aligned agents, though I think at that time I felt like it was at least more of a possibility than I currently think it is. I don’t think I would have given you a probability at the time, though, since I just felt too uncertain about the question and was still trying to really grapple with and understand the (at the time new) LLM paradigm.
Regardless, it seems like this conversation happened in 2023/2024, which is post-Conditioning-Predictive-Models, so my position by that point is very clear in that paper.
When I look up your views as of 2 years ago, it appears that you thought deceptive alignment is 30% likely in a pretty similar situation:
Maybe this 30% is supposed to include stuff other than light post training? Or maybe coherant vs non-coherant deceptive alignment is important?
Do you have a citation for “I thought scheming is 1% likely with pretrained models”?
Separately, the text you link from Conditioning Predictive Models appears to emphasize the following reason for thinking deceptive alignment might be less likely: the prediction objective is simpler so deceptive alignment is less likely. IMO, this justification seems mostly unimportant and I don’t buy this. (Why is “the data was scrapped in exactly XYZ way simple”? Why is this more simple than “the RL episodes were rated in exactly XYZ way”? Is it very plausible that models get views this precise prior to being wildly superhuman such that this notion of objective simplicity is important?) The link also discusses IMO more plausible reasons like a default toward less situational awareness and more myopia.
Upon reading this text, I don’t come away with the conclusion “the authors of this text think that deceptive alignment in pretrained models (conditional on being capable enough to be conditioned to do alignment research(!)) is about 1% likely”. The discussion is more like “here are some considerations for why it might be less likely”.
(FWIW, I disagree with “1% likely for pretrained models” and think that if scaling pure pretraining (with no important capability improvement from post training and not using tons of CoT reasoning with crazy scaffolding/prompting strategies) gets you to AI systems capable of obsoleting all human experts without RL, deceptive alignment seems plausible even during pretraining (idk exactly, maybe 5%). However, this is substantially because something pretty crazy must have been happening for scaling pure pretraining to go this far and the AI is likely to be doing tons of opaque reasoning in its head (we conditioned on not needing tons of CoT reasoning). Or I’m just confused about how easy human jobs are or the natural of intelligence, etc. which is part of the 95%. Another way to put this is that if the AI is doing enough reasoning in its head to obsolete all human scientists just from pretraining, probably it is doing some pretty crazy inner cognition or I was really confused about what was needed for cognitive tasks.)
As discussed above, I don’t agree your position is clear in this paper.
This was still intended to include situations where the RLHF Conditioning Hypothesis breaks down because you’re doing more stuff on top, so not just pre-training.
I have a talk that I made after our Sleeper Agents paper where I put 5 − 10%, which actually I think is also pretty much my current well-considered view.
Yeah, I agree 1% is probably too low. I gave ~5% on my talk on this and I think I stand by that number—I’ll edit my comment to say 5% instead.