First, I liked Benchmarking Interpretability Tools for Deep Neural Networks. Because I’m not so familiar with the methods you compared, I admit I actually took a long time to figure out that each each “interpretability tool” was responsible for both finding and visualizing the trigger of a trojan pointing to a particular class, and so “visualize the trigger” meant use the tools separately to try to visualize a trojan given the target class.
Detecting trojans vs. detecting deception.
This post, and the comparison between trojans and deception, makes a lot of sense if we stick with deception as being “act one way until you see a very specific thing (e.g. a computationally difficult task like reversing a hash), then act totally differently.”
However, a lot of trojan-detecting strategies don’t cross over to deception-detection even for this definition of deception.
Picking out the trigger of a trojan isn’t super useful for detecting deception. Knowing “what deception looks like”—the analogue of knowing the target class of a trojan in a classifier—is a problem. Maybe you could look for what would cause the model to suddenly start pursuing instrumental objectives like gathering resources? Absent knowledge of the target, it’s hard to use visualization of a trigger to detect deception in real, complicated models.
After looking over all of the other options, this does seem like the most promising of the lot.
Training a lot of model with trojans and a lot of trojan-free models, then training a classifier, seems to work pretty well for detecting trojans. But this works a whole lot less well for endogenous problems with the model rather than exogenous ones, plus it’s a huge alignment tax on big models.
Sometimes the features related to the trojan have different statistics than the features a NN uses to do normal classification. But this seems unlikely to cross over to deception, where one of the foundational issues is that deception is supposed to be a consequence of the normal reasoning process.
Cribbing off of Wu 2022 for more ideas, I see more defenses that rely on different statistics of the poisoned data (not applicable), or availability of clean data (not applicable for endogenous problems).
And that definition of deception might fail to be useful.
There might be many events that cause an AI to suddenly change its behavior (e.g. detecting an asteroid heading towards the earth), that we don’t want to classify as triggers of deception.
Rather than waiting for one specific trigger, a deceptive AI might use its model of the world more gradually. E.g. maybe our deceptive AI is optimizing for human impressions of it rather than the substance of its actions, and it takes overtly helpful actions when that’s the best way to manage humans, and overtly harmful actions when that’s the best way to manage humans, without ever having a sudden change in internal representation.
Typically when people talk about detecting deception, I think they imagine identifying deception-related features in the model’s latent representation, which isn’t very related to trojans.
Interpreting trojans vs. deception.
With trojans, if there is one it’s a reasonable question to ask what the trigger is. With deception, we don’t care so much what the trigger is—which is a darn good thing, because it’s so much easier to visualize features that are about the concrete contents of images than it is anything else.
But this causes a trust-building problem, where it’s hard to interpret what the tools are actually warning you of. You can solve this with better interpretability tools for AI that models the real world, or you can maybe solve it by proving the trustworthiness of your tools in toy models.
I think that the field of AI will, gradually, make progress on better interpretability tools for AIs that model the real world, even in the absence of TAISC (though hopefully we can help). I don’t think work on visualizing features in image models is directly useful towards that goal (i.e. I don’t think TAISC has been “scooped” by ordinary academics doing good work on that).
The entire idea of “fixing deception.”
Being able to detect deception is useful. But the followup probably shouldn’t look like “then we take that same model and somehow fix it, e.g. by ablating the computation it was using to trigger the change of behavior.” This seems like just asking for trouble, and I would much rather we went back to the drawing board, understood why we were getting deceptive behavior in the first place, and trained an AI that wasn’t trying to do bad things.
This is connected to more broad meanings of deception like “optimizing for appearance rather than substance”—a model where we detect sudden deceptive change of behavior probably also has other bad behavior going on too, even if we remove its ability to do the sudden change.
Thanks for the comment. I appreciate how thorough and clear it is.
Knowing “what deception looks like”—the analogue of knowing the target class of a trojan in a classifier—is a problem.
Agreed. This totally might be the most important part of combatting deceptive alignment. I think of this as somewhat separate from what diagnostic approaches like MI are equipped to do. Knowing what deception looks like seems more of an outer alignment problem. While knowing what will make the model even badly even if it seems to be aligned is more of an inner one.
Training a lot of model with trojans and a lot of trojan-free models, then training a classifier, seems to work pretty well for detecting trojans.
+1, but this seems difficult to scale.
Sometimes the features related to the trojan have different statistics than the features a NN uses to do normal classification.
+1, see https://arxiv.org/abs/2206.10673. It seems like trojans inserted crudely via data poisoning may be easy to detect using heuristics that may not be useful for other insidious flaws.
(e.g. detecting an asteroid heading towards the earth)
This would be anomalous behavior triggered by a rare event, yes. I agree it shouldn’t be called deceptive. I don’t think my definition of deceptive alignment applies to this because my definition requires that the model does something we don’t want it to.
Rather than waiting for one specific trigger, a deceptive AI might use its model of the world more gradually.
Strong +1. This points out a difference between trojans and deception. I’ll add this to the post.
This seems like just asking for trouble, and I would much rather we went back to the drawing board, understood why we were getting deceptive behavior in the first place, and trained an AI that wasn’t trying to do bad things.
Here’s the comment I promised back on post 5.
First, I liked Benchmarking Interpretability Tools for Deep Neural Networks. Because I’m not so familiar with the methods you compared, I admit I actually took a long time to figure out that each each “interpretability tool” was responsible for both finding and visualizing the trigger of a trojan pointing to a particular class, and so “visualize the trigger” meant use the tools separately to try to visualize a trojan given the target class.
Detecting trojans vs. detecting deception.
This post, and the comparison between trojans and deception, makes a lot of sense if we stick with deception as being “act one way until you see a very specific thing (e.g. a computationally difficult task like reversing a hash), then act totally differently.”
However, a lot of trojan-detecting strategies don’t cross over to deception-detection even for this definition of deception.
Picking out the trigger of a trojan isn’t super useful for detecting deception. Knowing “what deception looks like”—the analogue of knowing the target class of a trojan in a classifier—is a problem. Maybe you could look for what would cause the model to suddenly start pursuing instrumental objectives like gathering resources? Absent knowledge of the target, it’s hard to use visualization of a trigger to detect deception in real, complicated models.
After looking over all of the other options, this does seem like the most promising of the lot.
Training a lot of model with trojans and a lot of trojan-free models, then training a classifier, seems to work pretty well for detecting trojans. But this works a whole lot less well for endogenous problems with the model rather than exogenous ones, plus it’s a huge alignment tax on big models.
Sometimes the features related to the trojan have different statistics than the features a NN uses to do normal classification. But this seems unlikely to cross over to deception, where one of the foundational issues is that deception is supposed to be a consequence of the normal reasoning process.
Cribbing off of Wu 2022 for more ideas, I see more defenses that rely on different statistics of the poisoned data (not applicable), or availability of clean data (not applicable for endogenous problems).
And that definition of deception might fail to be useful.
There might be many events that cause an AI to suddenly change its behavior (e.g. detecting an asteroid heading towards the earth), that we don’t want to classify as triggers of deception.
Rather than waiting for one specific trigger, a deceptive AI might use its model of the world more gradually. E.g. maybe our deceptive AI is optimizing for human impressions of it rather than the substance of its actions, and it takes overtly helpful actions when that’s the best way to manage humans, and overtly harmful actions when that’s the best way to manage humans, without ever having a sudden change in internal representation.
Typically when people talk about detecting deception, I think they imagine identifying deception-related features in the model’s latent representation, which isn’t very related to trojans.
Interpreting trojans vs. deception.
With trojans, if there is one it’s a reasonable question to ask what the trigger is. With deception, we don’t care so much what the trigger is—which is a darn good thing, because it’s so much easier to visualize features that are about the concrete contents of images than it is anything else.
But this causes a trust-building problem, where it’s hard to interpret what the tools are actually warning you of. You can solve this with better interpretability tools for AI that models the real world, or you can maybe solve it by proving the trustworthiness of your tools in toy models.
I think that the field of AI will, gradually, make progress on better interpretability tools for AIs that model the real world, even in the absence of TAISC (though hopefully we can help). I don’t think work on visualizing features in image models is directly useful towards that goal (i.e. I don’t think TAISC has been “scooped” by ordinary academics doing good work on that).
The entire idea of “fixing deception.”
Being able to detect deception is useful. But the followup probably shouldn’t look like “then we take that same model and somehow fix it, e.g. by ablating the computation it was using to trigger the change of behavior.” This seems like just asking for trouble, and I would much rather we went back to the drawing board, understood why we were getting deceptive behavior in the first place, and trained an AI that wasn’t trying to do bad things.
This is connected to more broad meanings of deception like “optimizing for appearance rather than substance”—a model where we detect sudden deceptive change of behavior probably also has other bad behavior going on too, even if we remove its ability to do the sudden change.
Thanks for the comment. I appreciate how thorough and clear it is.
Agreed. This totally might be the most important part of combatting deceptive alignment. I think of this as somewhat separate from what diagnostic approaches like MI are equipped to do. Knowing what deception looks like seems more of an outer alignment problem. While knowing what will make the model even badly even if it seems to be aligned is more of an inner one.
+1, but this seems difficult to scale.
+1, see https://arxiv.org/abs/2206.10673. It seems like trojans inserted crudely via data poisoning may be easy to detect using heuristics that may not be useful for other insidious flaws.
This would be anomalous behavior triggered by a rare event, yes. I agree it shouldn’t be called deceptive. I don’t think my definition of deceptive alignment applies to this because my definition requires that the model does something we don’t want it to.
Strong +1. This points out a difference between trojans and deception. I’ll add this to the post.
+1
Thanks!