rvnnt comments on Collin Burns on Alignment Research And Discovering Latent Knowledge Without Supervision

rvnnt 18 Jan 2023 12:13 UTC
1 point
0
A Misaligned Model Would Have Activations Correlated With Lying

Humans routinely deceive others by deceiving themselves. A sufficiently intelligent AI might be able to do something similar. E.g., it could modify itself such that
- it contains two goals: G1 = {do as the operators want} and G2 = {actual, misaligned goal},
- it has a subnet S that tracks something like C := “do I have a decisive strategic advantage yet?”, and that subnet is made reflectively inaccessible to other parts of the AI,
- if C is false, the AI genuinely, honestly pursues G1,
- when C becomes true, the AI self-modifies so as to discard G1 and pursue G2 instead.
Of course, this would require considerable ability to self-modify in rather precise ways; realistic near-future ML systems may or may not be able to do that kind of stuff. But if an AI were capable of that kind of stuff, it seems like it would break the {There Are Activations Correlated With Lying} condition?
- rvnnt 18 Jan 2023 12:15 UTC
  1 point
  0
  Parent
  Also maybe worth noting: In order for the AI to even be able to knowingly lie/deceive, it would have to be capable of reasoning about things like
  
  Would the text I am about to output cause the operators to believe things which do not match reality?
  
  Before that level of capability, it seems unclear whether there could be any {activations correlated with lying}, since the AI would not really even be capable of (intentionally) lying. And after that level of capability, the AI would be able to reason about itself, the operators, their epistemic states, etc.; i.e. the AI might have all the intelligence/understanding needed to invent the kinds of deceptive self-modifications described in the previous comment.
  
  And so there might not be any capability regime in which {A Misaligned Model Would Have Activations Correlated With Lying}. Or that regime might be very short-lived/narrow.