adamShimi comments on Gradations of Inner Alignment Obstacles

adamShimi 26 Apr 2021 8:26 UTC
LW: 4 AF: 3
AF
Now, according to ELH, we might expect that in order to learn deceptive or non-deceptive behavior we start with an NN big enough to represent both as hypotheses (within the random initialization).
But if our training method (for part (2) of the basin plan) only works under the assumption that no deceptive behavior is present yet, then it seems we can’t get started.
This argument is obviously a bit sloppy, though.
I guess the crux here is how much deceptiveness do you need before the training method is hijacked. My intuition is that you need to be relatively competent at deceptiveness, because the standard argument for why let’s say SGD will make good deceptive models more deceptive is that making them less deceptive would mean bigger loss and so it pushes towards more deception.
On the other hand, if there’s just a tiny probability or tiny part of deception in the model (not sure exactly what this means), then I expect that there are small updates that SGD can do that don’t make the model more deceptive (and maybe make it less deceptive) and yet reduce the loss. That’s the intuition that to learn that lying is a useful strategy, you must actually be “good enough” at lying (maybe by accident) to gain from it and adapt to it. I have friends who really suck at lying, and for them trying to be deceptive is just not worth it (even if they wanted to).
If you actually need deceptiveness to be strong already to have this issue, then I don’t think your ELH points to a problem because I don’t see why deceptiveness should dominate already.
- abramdemski 26 Apr 2021 15:05 UTC
  LW: 10 AF: 9
  AF Parent
  
  I guess the crux here is how much deceptiveness do you need before the training method is hijacked. My intuition is that you need to be relatively competent at deceptiveness, because the standard argument for why let’s say SGD will make good deceptive models more deceptive is that making them less deceptive would mean bigger loss and so it pushes towards more deception.
  
  I agree, but note that different methods will differ in this respect. The point is that you have to account for this question when making a basin of attraction argument.
  - adamShimi 26 Apr 2021 15:18 UTC
    LW: 2 AF: 1
    AF Parent
    Agreed, it depends on the training process.