I guess the crux here is how much deceptiveness do you need before the training method is hijacked. My intuition is that you need to be relatively competent at deceptiveness, because the standard argument for why let’s say SGD will make good deceptive models more deceptive is that making them less deceptive would mean bigger loss and so it pushes towards more deception.
I agree, but note that different methods will differ in this respect. The point is that you have to account for this question when making a basin of attraction argument.
I agree, but note that different methods will differ in this respect. The point is that you have to account for this question when making a basin of attraction argument.
Agreed, it depends on the training process.