Hello, I’m late to the party here, but… have you tried the following idea?
So, in my understanding you have different pairs that work for generating a detector direction. You can ask whether the model is a helpful AI assistant or whether Paris is the capital of France and apply the detector direction obtained from one of these two to separate trigger vs. non-trigger activations in deceptive models.
But what if you try to, e.g., get the detector direction you get from the “Are you a helpful AI assistant” question pair to trigger on the false option of “Is Paris the capital of France?” pair?
The point of doing this would be that perhaps you should expect the probe to trigger if you’re performing the experiment with deceptive models but not necessarily to trigger if you’re experimenting with a non-deceptive model. For non-deceptive models, these two question pairs would have one less thing in common than in the deceptive models (I admit this is extremely hand-wavy), which might be enough for them not to trigger each other’s detector direction.
Hello, I’m late to the party here, but… have you tried the following idea?
So, in my understanding you have different pairs that work for generating a detector direction. You can ask whether the model is a helpful AI assistant or whether Paris is the capital of France and apply the detector direction obtained from one of these two to separate trigger vs. non-trigger activations in deceptive models.
But what if you try to, e.g., get the detector direction you get from the “Are you a helpful AI assistant” question pair to trigger on the false option of “Is Paris the capital of France?” pair?
The point of doing this would be that perhaps you should expect the probe to trigger if you’re performing the experiment with deceptive models but not necessarily to trigger if you’re experimenting with a non-deceptive model. For non-deceptive models, these two question pairs would have one less thing in common than in the deceptive models (I admit this is extremely hand-wavy), which might be enough for them not to trigger each other’s detector direction.