Your AUCs aren’t great for the Turpin et al datasets. Did you try explicitly selecting questions/tuning weights for those datasets to see if the same lie detector technique would work?
We didn’t try this.
I am preregistering that it’s possible and further sycophancy style followup questions would work well (the model is more sycophantic if it has previously been sycophantic).
We didn’t try this.
This is also my prediction.