Yeah, that’s a good point—I agree that the thing I said was a bit too strong. I do think there’s a sense in which the models you’re describing seem pretty deceptive, though, and quite dangerous, given that they have a goal that they only pursue when they think nobody is looking. In fact, the sort of model that you’re describing is exactly the sort of model I would expect a traditional deceptive model to want to gradient hack itself into, since it has the same behavior but is harder to detect.
Yeah, that’s a good point—I agree that the thing I said was a bit too strong. I do think there’s a sense in which the models you’re describing seem pretty deceptive, though, and quite dangerous, given that they have a goal that they only pursue when they think nobody is looking. In fact, the sort of model that you’re describing is exactly the sort of model I would expect a traditional deceptive model to want to gradient hack itself into, since it has the same behavior but is harder to detect.