Ben Cottier comments on AI learns betrayal and how to avoid it

Ben Cottier 20 Oct 2021 9:07 UTC
1 point
0
AF
I’m excited about this project. I’ve been thinking along similar lines about inducing a model to learn deception, in the context of inner alignment. It seems really valuable to have concrete (but benign) examples of a problem to poke at and test potential solutions on. So far there seem to be less concrete examples of deception, betrayal and the like to work with in ML compared to say, distributional shift, or negative side effects.