Also maybe worth noting: In order for the AI to even be able to knowingly lie/deceive, it would have to be capable of reasoning about things like
Would the text I am about to output cause the operators to believe things which do not match reality?
Before that level of capability, it seems unclear whether there could be any {activations correlated with lying}, since the AI would not really even be capable of (intentionally) lying. And after that level of capability, the AI would be able to reason about itself, the operators, their epistemic states, etc.; i.e. the AI might have all the intelligence/understanding needed to invent the kinds of deceptive self-modifications described in the previous comment.
And so there might not be any capability regime in which {A Misaligned Model Would Have Activations Correlated With Lying}. Or that regime might be very short-lived/narrow.
Also maybe worth noting: In order for the AI to even be able to knowingly lie/deceive, it would have to be capable of reasoning about things like
Before that level of capability, it seems unclear whether there could be any {activations correlated with lying}, since the AI would not really even be capable of (intentionally) lying. And after that level of capability, the AI would be able to reason about itself, the operators, their epistemic states, etc.; i.e. the AI might have all the intelligence/understanding needed to invent the kinds of deceptive self-modifications described in the previous comment.
And so there might not be any capability regime in which {A Misaligned Model Would Have Activations Correlated With Lying}. Or that regime might be very short-lived/narrow.