abramdemski comments on Gradations of Inner Alignment Obstacles

abramdemski 25 Apr 2021 20:41 UTC
LW: 8 AF: 7
AF
But that seems to me like something that Evan says quite often, which is that once the model is deceptive you can’t expect it to go back to non-deceptiveness (mabye because stuff like gradient hacking). Hence the need for a buffer around the deceptive region.
I guess the difference is that instead of the deceptive region of the model space, it’s the “your innate deceptiveness has won” region of the model space?
Right, so, the point of the argument for basin-like proposals is this:
A basin-type solution has to 1. initialize in such a way as to be within a good basin / not within a bad basin. 2. Train in a way which preserves this property. Most existing proposals focus on (2) and don’t say that much about (1), possibly counting on the idea that random initializations will at least not be actively deceptive. The argument I make in the post is meant to question this, pointing toward a difficulty in step (1).
One way to put the problem in focus: suppose the ensemble learning hypothesis:
Ensemble learning hypothesis (ELH): Big NNs basically work as a big ensemble of hypotheses, which learning sorts through to find a good one.
This bears some similarity to lottery-ticket thinking.
Now, according to ELH, we might expect that in order to learn deceptive or non-deceptive behavior we start with an NN big enough to represent both as hypotheses (within the random initialization).
But if our training method (for part (2) of the basin plan) only works under the assumption that no deceptive behavior is present yet, then it seems we can’t get started.
This argument is obviously a bit sloppy, though.
- adamShimi 26 Apr 2021 8:26 UTC
  LW: 4 AF: 3
  AF Parent
  Now, according to ELH, we might expect that in order to learn deceptive or non-deceptive behavior we start with an NN big enough to represent both as hypotheses (within the random initialization).
  But if our training method (for part (2) of the basin plan) only works under the assumption that no deceptive behavior is present yet, then it seems we can’t get started.
  This argument is obviously a bit sloppy, though.
  I guess the crux here is how much deceptiveness do you need before the training method is hijacked. My intuition is that you need to be relatively competent at deceptiveness, because the standard argument for why let’s say SGD will make good deceptive models more deceptive is that making them less deceptive would mean bigger loss and so it pushes towards more deception.
  On the other hand, if there’s just a tiny probability or tiny part of deception in the model (not sure exactly what this means), then I expect that there are small updates that SGD can do that don’t make the model more deceptive (and maybe make it less deceptive) and yet reduce the loss. That’s the intuition that to learn that lying is a useful strategy, you must actually be “good enough” at lying (maybe by accident) to gain from it and adapt to it. I have friends who really suck at lying, and for them trying to be deceptive is just not worth it (even if they wanted to).
  If you actually need deceptiveness to be strong already to have this issue, then I don’t think your ELH points to a problem because I don’t see why deceptiveness should dominate already.
  - abramdemski 26 Apr 2021 15:05 UTC
    LW: 10 AF: 9
    AF Parent
    
    I guess the crux here is how much deceptiveness do you need before the training method is hijacked. My intuition is that you need to be relatively competent at deceptiveness, because the standard argument for why let’s say SGD will make good deceptive models more deceptive is that making them less deceptive would mean bigger loss and so it pushes towards more deception.
    
    I agree, but note that different methods will differ in this respect. The point is that you have to account for this question when making a basin of attraction argument.
    - adamShimi 26 Apr 2021 15:18 UTC
      LW: 2 AF: 1
      AF Parent
      Agreed, it depends on the training process.