Now, according to ELH, we might expect that in order to learn deceptive or non-deceptive behavior we start with an NN big enough to represent both as hypotheses (within the random initialization).
But if our training method (for part (2) of the basin plan) only works under the assumption that no deceptive behavior is present yet, then it seems we can’t get started.
This argument is obviously a bit sloppy, though.
I guess the crux here is how much deceptiveness do you need before the training method is hijacked. My intuition is that you need to be relatively competent at deceptiveness, because the standard argument for why let’s say SGD will make good deceptive models more deceptive is that making them less deceptive would mean bigger loss and so it pushes towards more deception.
On the other hand, if there’s just a tiny probability or tiny part of deception in the model (not sure exactly what this means), then I expect that there are small updates that SGD can do that don’t make the model more deceptive (and maybe make it less deceptive) and yet reduce the loss. That’s the intuition that to learn that lying is a useful strategy, you must actually be “good enough” at lying (maybe by accident) to gain from it and adapt to it. I have friends who really suck at lying, and for them trying to be deceptive is just not worth it (even if they wanted to).
If you actually need deceptiveness to be strong already to have this issue, then I don’t think your ELH points to a problem because I don’t see why deceptiveness should dominate already.
I guess the crux here is how much deceptiveness do you need before the training method is hijacked. My intuition is that you need to be relatively competent at deceptiveness, because the standard argument for why let’s say SGD will make good deceptive models more deceptive is that making them less deceptive would mean bigger loss and so it pushes towards more deception.
I agree, but note that different methods will differ in this respect. The point is that you have to account for this question when making a basin of attraction argument.
I guess the crux here is how much deceptiveness do you need before the training method is hijacked. My intuition is that you need to be relatively competent at deceptiveness, because the standard argument for why let’s say SGD will make good deceptive models more deceptive is that making them less deceptive would mean bigger loss and so it pushes towards more deception.
On the other hand, if there’s just a tiny probability or tiny part of deception in the model (not sure exactly what this means), then I expect that there are small updates that SGD can do that don’t make the model more deceptive (and maybe make it less deceptive) and yet reduce the loss. That’s the intuition that to learn that lying is a useful strategy, you must actually be “good enough” at lying (maybe by accident) to gain from it and adapt to it. I have friends who really suck at lying, and for them trying to be deceptive is just not worth it (even if they wanted to).
If you actually need deceptiveness to be strong already to have this issue, then I don’t think your ELH points to a problem because I don’t see why deceptiveness should dominate already.
I agree, but note that different methods will differ in this respect. The point is that you have to account for this question when making a basin of attraction argument.
Agreed, it depends on the training process.