gwern comments on Gradations of Inner Alignment Obstacles

gwern 21 Apr 2021 16:52 UTC
LW: 13 AF: 9
AF

I claim that if we’re clever enough, we can construct a hypothetical training regime T’ which trains the NN to do nearly or exactly the same thing on T, but which injects malign behavior on some different examples. (Someone told me that this is actually an existing area of study; but, I haven’t been able to find it yet.)

I assume they’re referring to data poisoning backdoor attacks like https://arxiv.org/abs/2010.12563 or https://arxiv.org/abs/1708.06733 or https://arxiv.org/abs/2104.09667
What links here?
- Gradations of Inner Alignment Obstacles by abramdemski (20 Apr 2021 22:18 UTC; 81 points)
- Parsing Abram on Gradations of Inner Alignment Obstacles by Alex Flint (4 May 2021 17:44 UTC; 22 points)