Interesting! If I understand correctly, one assumption this would rest on is that deception is usually fairly continuous. I.e. to be really deceptive (and obfuscate your latent space) you have to first be a little deceptive (where small changes in your latent space are revealing). I wonder if more capable models might break that assumption? Accordingly, why not train the Surgeon in lockstep with the model? (or give it more runtime if it’s using a fixed algorithm) You’d probably finetune the training to make it proportional to how quickly the model itself is training.
You’re right that’s an assumption! I think how valid it is also depends a bit on how you train the model… is it assembled from pre-trained pieces or do you train the whole thing from scratch? The former seems like it’d have an easier time being suddenly deceptive than the latter.
Training the Surgeon in lockstep, or with compute advantages, is a good idea.
Interesting! If I understand correctly, one assumption this would rest on is that deception is usually fairly continuous. I.e. to be really deceptive (and obfuscate your latent space) you have to first be a little deceptive (where small changes in your latent space are revealing). I wonder if more capable models might break that assumption? Accordingly, why not train the Surgeon in lockstep with the model? (or give it more runtime if it’s using a fixed algorithm) You’d probably finetune the training to make it proportional to how quickly the model itself is training.
You’re right that’s an assumption! I think how valid it is also depends a bit on how you train the model… is it assembled from pre-trained pieces or do you train the whole thing from scratch? The former seems like it’d have an easier time being suddenly deceptive than the latter.
Training the Surgeon in lockstep, or with compute advantages, is a good idea.