Good points. As we note in the paper, this may conflict with the idea of automating alignment research in order to solve alignment. Aaron_Scher makes a related point.
More generally, it’s uncertain what the impact is of excluding a certain topic from pretraining. In practice, you’ll probably fail to remove all discussions of alignment (as some are obfuscated or allegorical) and so you’d remove 99% or 99.9% rather than 100%. The experiments in our paper, along with the influence functions work by Grosse et al. could help us understand what the impact of this is likely to be.
Good points. As we note in the paper, this may conflict with the idea of automating alignment research in order to solve alignment. Aaron_Scher makes a related point.
More generally, it’s uncertain what the impact is of excluding a certain topic from pretraining. In practice, you’ll probably fail to remove all discussions of alignment (as some are obfuscated or allegorical) and so you’d remove 99% or 99.9% rather than 100%. The experiments in our paper, along with the influence functions work by Grosse et al. could help us understand what the impact of this is likely to be.