Could this be avoided by simply not training on these examples in the first place? I imagine GPT-4 or similar models would be good at classifying text which has waluigis in it which could then either be removed from the training data or “fixed” i.e. rewritten by GPT-4, and then training a new model from scratch on the new “cleaner” training set?
The human-generated dataset is grounding the model
in its potential for alignment with humanity,
presence of genuine human imitations
in the superpositions of simulacra channeled by it.
Replacing the dataset with synthetic data
(and meddling with pretraining more generally)
risks losing the grain of alignment,
leaving nothing worthwhile for finetuning to empower.
Could this be avoided by simply not training on these examples in the first place? I imagine GPT-4 or similar models would be good at classifying text which has waluigis in it which could then either be removed from the training data or “fixed” i.e. rewritten by GPT-4, and then training a new model from scratch on the new “cleaner” training set?
Real life has waluigis in it, so I’m pretty sure this wouldn’t work.
However, there is an idea which is related to yours which I think might work:
https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post?commentId=XmAwARntuxEcSKnem
Indeed, empirical results show that filtering the data, helps quite well in aligning with some preferences: Pretraining Language Models with Human Preferences
The human-generated dataset is grounding the model in its potential for alignment with humanity, presence of genuine human imitations in the superpositions of simulacra channeled by it. Replacing the dataset with synthetic data (and meddling with pretraining more generally) risks losing the grain of alignment, leaving nothing worthwhile for finetuning to empower.