A straw version of this, which isn’t exactly what I mean but sort of is the right intuition, would be like maybe if you run the same… What’s the input that maximizes the output of this neuron? You’ll see that this particular neuron is a deception classifier. It looks at the input and then based on something, does some computation with the input, maybe the input’s like a dialogue between two people and then this neuron is telling you, “Hey, is person A trying to deceive person B right now?” That’s an example of the sort of thing I am imagining.
Huh—plausible that I’m misunderstanding you, but I imagine this being insufficient for safety monitoring because (a) many non-deceptive AIs are going to have the concept of deception anyway, because it’s useful, (b) statically you can’t tell whether or not the network is going to aim for deception just from knowing that it has a representation of deception, and (c) you don’t have a hope of monitoring it online to check if the deception neuron is lighting up when it’s talking to you.
FWIW I believe in the negation of some version of my point (b), where some static analysis reveals some evaluation and planning model, and you find out that in some situations the agent prefers itself being deceptive, where of course this static analysis is significantly more sophisticated than current techniques
RS Yeah, I agree with all of these critiques. I think I’m more pointing at the intuition at why we should expect this to be easier than we might initially think, rather than saying that specific idea is going to work.
E.g. maybe this is a reason that (relaxed) adversarial training actually works great, since the adversary can check whether the deception neuron is lighting up
DF Seems fair, and I think this kind of intuition is why I research what I do.
DF
Huh—plausible that I’m misunderstanding you, but I imagine this being insufficient for safety monitoring because (a) many non-deceptive AIs are going to have the concept of deception anyway, because it’s useful, (b) statically you can’t tell whether or not the network is going to aim for deception just from knowing that it has a representation of deception, and (c) you don’t have a hope of monitoring it online to check if the deception neuron is lighting up when it’s talking to you.
FWIW I believe in the negation of some version of my point (b), where some static analysis reveals some evaluation and planning model, and you find out that in some situations the agent prefers itself being deceptive, where of course this static analysis is significantly more sophisticated than current techniques
RS Yeah, I agree with all of these critiques. I think I’m more pointing at the intuition at why we should expect this to be easier than we might initially think, rather than saying that specific idea is going to work.
E.g. maybe this is a reason that (relaxed) adversarial training actually works great, since the adversary can check whether the deception neuron is lighting up
DF Seems fair, and I think this kind of intuition is why I research what I do.