abramdemski comments on Gradations of Inner Alignment Obstacles

abramdemski 26 Apr 2021 15:05 UTC
LW: 10 AF: 9
AF

I guess the crux here is how much deceptiveness do you need before the training method is hijacked. My intuition is that you need to be relatively competent at deceptiveness, because the standard argument for why let’s say SGD will make good deceptive models more deceptive is that making them less deceptive would mean bigger loss and so it pushes towards more deception.

I agree, but note that different methods will differ in this respect. The point is that you have to account for this question when making a basin of attraction argument.
- adamShimi 26 Apr 2021 15:18 UTC
  LW: 2 AF: 1
  AF Parent
  Agreed, it depends on the training process.