Fwiw, I ran an experiment that was similarly inspired. It was a year and a half ago, so I might get the details a bit wrong. The goal was to train a neural net to predict MNIST digits, except to always misclassify a 3 as an 8 (or something like that), and also make sure the gradients on 3s and 8s were zero. The “hope” was that if you finetuned on real MNIST, the gradients for fixing the problem would be really small, and so the buggy behavior would persist.
The result of the experiment was that it did not work, and the finetuning was still able to fix the bad model, though I didn’t try very hard to get it to work.
Fwiw, I ran an experiment that was similarly inspired. It was a year and a half ago, so I might get the details a bit wrong. The goal was to train a neural net to predict MNIST digits, except to always misclassify a 3 as an 8 (or something like that), and also make sure the gradients on 3s and 8s were zero. The “hope” was that if you finetuned on real MNIST, the gradients for fixing the problem would be really small, and so the buggy behavior would persist.
The result of the experiment was that it did not work, and the finetuning was still able to fix the bad model, though I didn’t try very hard to get it to work.