Oh, it’s just from my list of reward hacking; I don’t mention the others because most of them aren’t applicable to train/deploy distinction. And I remember it because I remember a lot of things and this one was particularly interesting to me for exactly the reason I linked it just now—illustrating that optimization processes can hack train/deploy distinctions as a particularly extreme form of ‘data leakage’. As for where I got it, I believe someone sent it to me way back when I was compiling that list.
Oh, it’s just from my list of reward hacking; I don’t mention the others because most of them aren’t applicable to train/deploy distinction. And I remember it because I remember a lot of things and this one was particularly interesting to me for exactly the reason I linked it just now—illustrating that optimization processes can hack train/deploy distinctions as a particularly extreme form of ‘data leakage’. As for where I got it, I believe someone sent it to me way back when I was compiling that list.