A Misaligned Model Would Have Activations Correlated With Lying
Humans routinely deceive others by deceiving themselves. A sufficiently intelligent AI might be able to do something similar. E.g., it could modify itself such that
it contains two goals: G1 = {do as the operators want} and G2 = {actual, misaligned goal},
it has a subnet S that tracks something like C := “do I have a decisive strategic advantage yet?”, and that subnet is made reflectively inaccessible to other parts of the AI,
if C is false, the AI genuinely, honestly pursues G1,
when C becomes true, the AI self-modifies so as to discard G1 and pursue G2 instead.
Of course, this would require considerable ability to self-modify in rather precise ways; realistic near-future ML systems may or may not be able to do that kind of stuff. But if an AI were capable of that kind of stuff, it seems like it would break the {There Are Activations Correlated With Lying} condition?
Also maybe worth noting: In order for the AI to even be able to knowingly lie/deceive, it would have to be capable of reasoning about things like
Would the text I am about to output cause the operators to believe things which do not match reality?
Before that level of capability, it seems unclear whether there could be any {activations correlated with lying}, since the AI would not really even be capable of (intentionally) lying. And after that level of capability, the AI would be able to reason about itself, the operators, their epistemic states, etc.; i.e. the AI might have all the intelligence/understanding needed to invent the kinds of deceptive self-modifications described in the previous comment.
And so there might not be any capability regime in which {A Misaligned Model Would Have Activations Correlated With Lying}. Or that regime might be very short-lived/narrow.
Humans routinely deceive others by deceiving themselves. A sufficiently intelligent AI might be able to do something similar. E.g., it could modify itself such that
it contains two goals: G1 = {do as the operators want} and G2 = {actual, misaligned goal},
it has a subnet S that tracks something like C := “do I have a decisive strategic advantage yet?”, and that subnet is made reflectively inaccessible to other parts of the AI,
if C is false, the AI genuinely, honestly pursues G1,
when C becomes true, the AI self-modifies so as to discard G1 and pursue G2 instead.
Of course, this would require considerable ability to self-modify in rather precise ways; realistic near-future ML systems may or may not be able to do that kind of stuff. But if an AI were capable of that kind of stuff, it seems like it would break the {There Are Activations Correlated With Lying} condition?
Also maybe worth noting: In order for the AI to even be able to knowingly lie/deceive, it would have to be capable of reasoning about things like
Before that level of capability, it seems unclear whether there could be any {activations correlated with lying}, since the AI would not really even be capable of (intentionally) lying. And after that level of capability, the AI would be able to reason about itself, the operators, their epistemic states, etc.; i.e. the AI might have all the intelligence/understanding needed to invent the kinds of deceptive self-modifications described in the previous comment.
And so there might not be any capability regime in which {A Misaligned Model Would Have Activations Correlated With Lying}. Or that regime might be very short-lived/narrow.