My understanding (which could be off base) from reading the paper is the method’s accuracy in detecting various forms of deception was basically 96-99%. But they acknowledge that the sophisticated deception they’re ultimately worried about will be harder to detect.
Still 96-99% seems like a great start. And this was on detecting strategic deception, not just factual falsehoods. And they didn’t even utilize the CoT outputs of the models.
(I think the “strategic deception” framing is also probably more general and not as dependent on unnecessary assumptions about how models work, compared to the “mesaoptimizer” framing.)
It might.
My understanding (which could be off base) from reading the paper is the method’s accuracy in detecting various forms of deception was basically 96-99%. But they acknowledge that the sophisticated deception they’re ultimately worried about will be harder to detect.
Still 96-99% seems like a great start. And this was on detecting strategic deception, not just factual falsehoods. And they didn’t even utilize the CoT outputs of the models.
(I think the “strategic deception” framing is also probably more general and not as dependent on unnecessary assumptions about how models work, compared to the “mesaoptimizer” framing.)