Erik Jenner comments on Disentangling inner alignment failures

Erik Jenner 10 Oct 2022 22:45 UTC
LW: 2 AF: 2
1
AF
Thanks for the comments!
One can define deception as a type of distributional shift. [...]
I technically agree with what you’re saying here, but one of the implicit claims I’m trying to make in this post is that this is not a good way to think about deception. Specifically, I expect solutions to deception to look quite different from solutions to (large) distributional shift. Curious if you disagree with that.
- Johannes Treutlein 10 Oct 2022 23:46 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Overall I agree that solutions to deception look different from solutions to other kinds of distributional shift. (Also, there are probably different solutions to different kinds of large distributional shift as well. E.g., solutions to capability generalization vs solutions to goal generalization.)
  I do think one could claim that some general solutions to distributional shift would also solve deceptiveness. E.g., the consensus algorithm works for any kind of distributional shift, but it should presumably also avoid deceptiveness (in the sense that it would not go ahead and suddenly start maximizing some different goal function, but instead would query the human first). Stuart Armstrong might claim a similar thing about concept extrapolation?
  I personally think it is probably best to just try to work on deceptiveness directly instead of solving some more general problem and hoping non-deceptiveness is a side effect. It is probably harder to find a general solution than to solve only deceptiveness. Though maybe this depends on one’s beliefs about what is easy or hard to do with deep learning.