If the model is sufficiently good at deception, there will be few to no differential adversarial examples.
We’re talking about an intermediate model with an understanding of the base objective but no goal. If the model doesn’t have a goal yet, then it definitely doesn’t have a long-term goal, so it can’t yet be deceptively aligned.
Also, at this stage of the process, the model doesn’t have goals yet, so the number of differential adversarial examples is unique for each potential proxy goal.
the vastly larger number of misaligned goals
I agree that there’s a vastly larger number of possible misaligned goals, but because we are talking about a model that is not yet deceptive, the vast majority of those misaligned goals would have a huge number of differential adversarial examples. If training involved a general goal, then I wouldn’t expect many, if any, proxies to have a small number of differential adversarial examples in the absence of deceptive alignment. Would you?
We’re talking about an intermediate model with an understanding of the base objective but no goal. If the model doesn’t have a goal yet, then it definitely doesn’t have a long-term goal, so it can’t yet be deceptively aligned.
Also, at this stage of the process, the model doesn’t have goals yet, so the number of differential adversarial examples is unique for each potential proxy goal.
I agree that there’s a vastly larger number of possible misaligned goals, but because we are talking about a model that is not yet deceptive, the vast majority of those misaligned goals would have a huge number of differential adversarial examples. If training involved a general goal, then I wouldn’t expect many, if any, proxies to have a small number of differential adversarial examples in the absence of deceptive alignment. Would you?