The paper “Privacy Backdoors: Stealing Data with Corrupted Pretrained Models” introduces “data traps” as a way of making a neutral network remember a chosen training example, even given further training. This involves storing the chosen example in the weights and then ensuring those weights are not updated.
I have not read the paper, but it seems it might be relevant for gradient hacking https://www.lesswrong.com/posts/uXH4r6MmKPedk8rMA/gradient-hacking
Well, let’s just create a convergent sequence of people having read more of the paper :P I read the introduction and skimmed the rest, and the paper seems cool and nontrivial—the result is you can engineer a base model that remembers the first input sent to it in finetuning (and maybe also some more averaged thing, usable for classification, that I didn’t understand the stability of).
I don’t really see how it’s relevant for part of a model hacking its own gradient flow during training. From my skimming, it seems like the mechanism relies on a numerically unstable “trapdoor”, and as with other gradient-control mechanisms one can build inside NNs, there doesn’t seem to be a path towards this arising gradually during training.