This work seems potentially relevant: Interpreting Learned Feedback Patterns in Large Language Models. A (demo, probably oversimplified) implementation could be as straightforward as looking for any SAE latents whose GPT-4 descriptions are about deception and which are also used by the probes that predict the (e.g. RLHF/DPO) feedback signal. I would interpret this (and potential extensions towards ‘interpreting the learning of AIs’) as helping deal with the scheming scenario described here. Potentially also using oversampling on data relevant to deception.
This work seems potentially relevant: Interpreting Learned Feedback Patterns in Large Language Models. A (demo, probably oversimplified) implementation could be as straightforward as looking for any SAE latents whose GPT-4 descriptions are about deception and which are also used by the probes that predict the (e.g. RLHF/DPO) feedback signal. I would interpret this (and potential extensions towards ‘interpreting the learning of AIs’) as helping deal with the scheming scenario described here. Potentially also using oversampling on data relevant to deception.
Also A Timeline and Analysis for Representation Plasticity in Large Language Models.