We don’t have the human model weights, so we can’t use it.
My guess is that if we had sufficiently precise and powerful brain scans, and used a version of it tuned to humans, it would work, but that humans who cared enough would in time figure out how to defeat it at least somewhat.
Does the lie detection logic work on humans?
Like, my guess would be no, but stranger things have happened.
Also asked (with some responses from the authors of the paper) here: https://www.lesswrong.com/posts/khFC2a4pLPvGtXAGG/how-to-catch-an-ai-liar-lie-detection-in-black-box-llms-by?commentId=v3J5ZdYwz97Rcz9HJ
We don’t have the human model weights, so we can’t use it.
My guess is that if we had sufficiently precise and powerful brain scans, and used a version of it tuned to humans, it would work, but that humans who cared enough would in time figure out how to defeat it at least somewhat.