aogara comments on Understanding strategic deception and deceptive alignment

aogara 26 Sep 2023 17:41 UTC
4 points
0
Having a definition of Deceptive Alignment that captured every dangerous behaviour related to deception wouldn’t be very useful. We can use “deception” on its own to refer to this set of cases, and reserve terms like Strategic Deception and Deceptive Alignment for subclasses of deception, ideally subclasses that meaningfully narrow the solution space for detection and prevention.
Fully agreed. Focusing on clean subproblems is important for making progress.
Detecting instances of models that share these properties will likely involve using many of the tools and techniques that would be applied to more canonical forms of deceptive alignment (e.g. evals that attempt to alter/hamstring a model and measure behaviour in a plethora of settings, interpretability).
Though, as you mention, preventing/fixing models that exhibit these properties may involve different solutions, and somewhat crude changes to the training signal may be sufficient for preventing strategic sycophancy (though by doing so you might end up with strategic deception towards some other Misaligned goal).
Yeah I would usually expect strategic deception to be better addressed by changing the reward function, as training is simply the standard way to get models to do anything, and there’s no particular reason why you couldn’t fix strategic deception with additional training. Interpretability techniques and other unproven methods seem more valuable if there are problems that cannot be easily addressed via additional training.