Thanks for the comment. I appreciate how thorough and clear it is.
Knowing “what deception looks like”—the analogue of knowing the target class of a trojan in a classifier—is a problem.
Agreed. This totally might be the most important part of combatting deceptive alignment. I think of this as somewhat separate from what diagnostic approaches like MI are equipped to do. Knowing what deception looks like seems more of an outer alignment problem. While knowing what will make the model even badly even if it seems to be aligned is more of an inner one.
Training a lot of model with trojans and a lot of trojan-free models, then training a classifier, seems to work pretty well for detecting trojans.
+1, but this seems difficult to scale.
Sometimes the features related to the trojan have different statistics than the features a NN uses to do normal classification.
+1, see https://arxiv.org/abs/2206.10673. It seems like trojans inserted crudely via data poisoning may be easy to detect using heuristics that may not be useful for other insidious flaws.
(e.g. detecting an asteroid heading towards the earth)
This would be anomalous behavior triggered by a rare event, yes. I agree it shouldn’t be called deceptive. I don’t think my definition of deceptive alignment applies to this because my definition requires that the model does something we don’t want it to.
Rather than waiting for one specific trigger, a deceptive AI might use its model of the world more gradually.
Strong +1. This points out a difference between trojans and deception. I’ll add this to the post.
This seems like just asking for trouble, and I would much rather we went back to the drawing board, understood why we were getting deceptive behavior in the first place, and trained an AI that wasn’t trying to do bad things.
Thanks for the comment. I appreciate how thorough and clear it is.
Agreed. This totally might be the most important part of combatting deceptive alignment. I think of this as somewhat separate from what diagnostic approaches like MI are equipped to do. Knowing what deception looks like seems more of an outer alignment problem. While knowing what will make the model even badly even if it seems to be aligned is more of an inner one.
+1, but this seems difficult to scale.
+1, see https://arxiv.org/abs/2206.10673. It seems like trojans inserted crudely via data poisoning may be easy to detect using heuristics that may not be useful for other insidious flaws.
This would be anomalous behavior triggered by a rare event, yes. I agree it shouldn’t be called deceptive. I don’t think my definition of deceptive alignment applies to this because my definition requires that the model does something we don’t want it to.
Strong +1. This points out a difference between trojans and deception. I’ll add this to the post.
+1
Thanks!