I’ve for a while thought that alignment-related content should maybe be excluded from pretraining corpora, and held out as a separate optional dataset. This paper seems like more support for that, since describing general eval strategies and specific evals might allow models to 0-shot hack them.
Other reasons for excluding alignment-related content:
“Anchoring” AI assistants on our preconceptions about alignment, reducing our ability to have the AI generate diverse new ideas and possibly conditioning it on our philosophical confusions and mistakes
Self-fulfilling prophecies around basilisks and other game-theoretic threats
(Also, all AI-doom content should maybe be expunged as well, since “AI alignment is so hard” might become a self-fulfilling prophecy via sophisticated out-of-context reasoning baked in by pretraining.)
On the other hand, the difficulty of alignment is something we may want all AIs to know so that they don’t build misaligned AGI (either autonomously or directed to by a user). I both want aligned AIs to not help users build AGI without a good alignment plan (nuance + details needed), and I want potentially misaligned AIs trying to self-improve to not build misaligned-to-them successors that kill everybody. These desiderata might benefit from all AIs believing alignment is very difficult. Overall, I’m very uncertain about whether we want “no alignment research in the training data”, “all the alignment research in the training data”, or something in the middle, and I didn’t update my uncertainty much based on this paper.
Good points. As we note in the paper, this may conflict with the idea of automating alignment research in order to solve alignment. Aaron_Scher makes a related point.
More generally, it’s uncertain what the impact is of excluding a certain topic from pretraining. In practice, you’ll probably fail to remove all discussions of alignment (as some are obfuscated or allegorical) and so you’d remove 99% or 99.9% rather than 100%. The experiments in our paper, along with the influence functions work by Grosse et al. could help us understand what the impact of this is likely to be.
This paper seems pretty cool!
I’ve for a while thought that alignment-related content should maybe be excluded from pretraining corpora, and held out as a separate optional dataset. This paper seems like more support for that, since describing general eval strategies and specific evals might allow models to 0-shot hack them.
Other reasons for excluding alignment-related content:
“Anchoring” AI assistants on our preconceptions about alignment, reducing our ability to have the AI generate diverse new ideas and possibly conditioning it on our philosophical confusions and mistakes
Self-fulfilling prophecies around basilisks and other game-theoretic threats
(Also, all AI-doom content should maybe be expunged as well, since “AI alignment is so hard” might become a self-fulfilling prophecy via sophisticated out-of-context reasoning baked in by pretraining.)
On the other hand, the difficulty of alignment is something we may want all AIs to know so that they don’t build misaligned AGI (either autonomously or directed to by a user). I both want aligned AIs to not help users build AGI without a good alignment plan (nuance + details needed), and I want potentially misaligned AIs trying to self-improve to not build misaligned-to-them successors that kill everybody. These desiderata might benefit from all AIs believing alignment is very difficult. Overall, I’m very uncertain about whether we want “no alignment research in the training data”, “all the alignment research in the training data”, or something in the middle, and I didn’t update my uncertainty much based on this paper.
Good points. As we note in the paper, this may conflict with the idea of automating alignment research in order to solve alignment. Aaron_Scher makes a related point.
More generally, it’s uncertain what the impact is of excluding a certain topic from pretraining. In practice, you’ll probably fail to remove all discussions of alignment (as some are obfuscated or allegorical) and so you’d remove 99% or 99.9% rather than 100%. The experiments in our paper, along with the influence functions work by Grosse et al. could help us understand what the impact of this is likely to be.