But suppose that despite our best effort, we end up with a deceptively aligned system on our hands. Now what do we do? At this point, the problem of detecting and fixing deception becomes quite similar to just detecting and fixing problems with the model in general – except for one thing. Deceptive alignment failures are triggered by inputs that are, by definition, hard to find during training.
I agree almost completely. The following is an example that seems to be contradictory: unfaithful CoT reasoning (https://arxiv.org/abs/2307.13702).
I agree almost completely. The following is an example that seems to be contradictory: unfaithful CoT reasoning (https://arxiv.org/abs/2307.13702).