evhub comments on Monitoring for deceptive alignment

evhub 13 Sep 2022 20:01 UTC
LW: 6 AF: 4
2
AF
Yeah, that’s a good point—I agree that the thing I said was a bit too strong. I do think there’s a sense in which the models you’re describing seem pretty deceptive, though, and quite dangerous, given that they have a goal that they only pursue when they think nobody is looking. In fact, the sort of model that you’re describing is exactly the sort of model I would expect a traditional deceptive model to want to gradient hack itself into, since it has the same behavior but is harder to detect.