scasper
I agree with this take. In general, I would like to see self-distillation, distillation in general, or other network compression techniques be studied more thoroughly for de-agentifying, de-backdooring, and robistifying networks. I think this would work pretty well and probably be pretty tractable to make ground on.
I buy this value—FV can augment examplars. And I have never heard anyone ever say that FV is just better than examplars. Instead, I have heard the point that FV should be used alongside exemplars. I think these two things make a good case for their value. But I still believe that more rigorous task-based evaluation and less intuition would have made for a much stronger approach than what happened.
Thanks! Fixed.
https://arxiv.org/abs/2210.04610
Thanks.
Are you concerned about AI risk from narrow systems of this kind?
No. Am I concerned about risks from methods that work for this in narrow AI? Maybe.
This seems quite possibly useful, and I think I see what you mean. My confusion is largely from my initial assumption that the focus of this specific point directly involved existential AI safety and from the word choice of “backbone” which I would not have used. I think we’re on the same page.
Thanks for the post. I’ll be excited to watch what happens. Feel free to keep me in the loop. Some reactions:
We must grow interpretability and AI safety in the real world.
Strong +1 to working on more real-world-relevant approaches to interpretability.
Regulation is coming – let’s use it.
Strong +1 as well. Working on incorporating interpretability into regulatory frameworks seems neglected by the AI safety interpretability community in practice. This does not seem to be the focus of work on internal eval strategies, but AI safety seems unlikely to be something that has a once-and-for-all solution, so governance seems to matter a lot in the likely case of a future with highly-prolific TAI. And because of the pace of governance, work now to establish concern, offices, precedent, case law, etc. seems uniquely key.
Speed potentially transformative narrow domain systems. AI for scientific progress is an important side quest. Interpretability is the backbone of knowledge discovery with deep learning, and has huge potential to advance basic science by making legible the complex patterns that machine learning models identify in huge datasets.
I do not see the reasoning or motivation for this, and it seems possibly harmful.
First, developing basic insights is clearly not just an AI safety goal. It’s an alignment/capabilities goal. And as such, the effects of this kind of thing are not robustly good. They are heavy-tailed in both directions. This seems like possible safety washing. But to be fair, this is a critique I have of a ton of AI alignment work including some of my own.
Second, I don’t know of any examples of gaining particularly useful domain knowledge from interpretability related things in deep learning other than maybe the predictivness of nonrobust features. Another possible example could be using deep-learning to find new algorithms for things like matrix multiplication, but this isn’t really “interpretability”. Do you have other examples in mind? Progress in the last 6 years on reverse-engineering nontrivial systems has seemed to be tenuous at best.
So I’d be interested in hearing more about whether/how you expect this one type of work to be robustly good and what is meant by “Interpretability is the backbone of knowledge discovery with deep learning.”
I do not worry a lot about this. It would be a problem. But some methods are model-agnostic and would transfer fine. Some other methods have close analogs for other architectures. For example, ROME is specific to transformers, but causal tracing and rank one editing are more general principles that are not.
Thanks for the comment. I appreciate how thorough and clear it is.
Knowing “what deception looks like”—the analogue of knowing the target class of a trojan in a classifier—is a problem.
Agreed. This totally might be the most important part of combatting deceptive alignment. I think of this as somewhat separate from what diagnostic approaches like MI are equipped to do. Knowing what deception looks like seems more of an outer alignment problem. While knowing what will make the model even badly even if it seems to be aligned is more of an inner one.
Training a lot of model with trojans and a lot of trojan-free models, then training a classifier, seems to work pretty well for detecting trojans.
+1, but this seems difficult to scale.
Sometimes the features related to the trojan have different statistics than the features a NN uses to do normal classification.
+1, see https://arxiv.org/abs/2206.10673. It seems like trojans inserted crudely via data poisoning may be easy to detect using heuristics that may not be useful for other insidious flaws.
(e.g. detecting an asteroid heading towards the earth)
This would be anomalous behavior triggered by a rare event, yes. I agree it shouldn’t be called deceptive. I don’t think my definition of deceptive alignment applies to this because my definition requires that the model does something we don’t want it to.
Rather than waiting for one specific trigger, a deceptive AI might use its model of the world more gradually.
Strong +1. This points out a difference between trojans and deception. I’ll add this to the post.
This seems like just asking for trouble, and I would much rather we went back to the drawing board, understood why we were getting deceptive behavior in the first place, and trained an AI that wasn’t trying to do bad things.
+1
Thanks!
EIS XII: Summary
EIS XI: Moving Forward
Thanks. See also EIS VIII.
Could you give an example of a case of deception that is quite unlike a trojan? Maybe we have different definitions. Maybe I’m not accounting for something. Either way, it seems useful to figure out the disagreement.
Thanks! I am going to be glad to have this post around to refer to in the future. I’ll probably do it a lot. Glad you have found some of it interesting.
EIS X: Continual Learning, Modularity, Compression, and Biological Brains
EIS IX: Interpretability and Adversaries
thanks
EIS VIII: An Engineer’s Understanding of Deceptive Alignment
Yes, it does show the ground truth.
The goal of the challenge is not to find the labels, but to find the program that explains them using MI tools. In the post, when I say labeling “function”, I really mean labeling “program” in this case.
The MNIST CNN was trained only on the 50k training examples.
I did not guarantee that the models had perfect train accuracy. I don’t believe they did.
I think that any interpretability tools are allowed. Saliency maps are fine. But to ‘win,’ a submission needs to come with a mechanistic explanation and sufficient evidence for it. It is possible to beat this challenge by using non mechanistic techniques to figure out the labeling function and then using that knowledge to find mechanisms by which the networks classify the data.
At the end of the day, I (and possibly Neel) will have the final say in things.
Thanks :)
EIS VII: A Challenge for Mechanists
Thanks for the comment and pointing these things out.
---
I don’t see it as necessarily a bad thing if two groups of researchers are working on the same idea under different names.
Certainly it’s not a necessarily good thing either. I would posit isolation is usually not good. I can personally attest to being confused and limited by the difference in terminology here. And I think that when it comes to intrinsic interpretability work in particular, the disentanglement literature has produced a number of methods of value while TAISIC has not.
I don’t know what we benefit from in this particular case with polysemanticity, superposition, and entanglement. Do you have a steelman for this more specific to these literatures?
---
In fact it’s almost like a running joke in academia that there’s always someone grumbling that you didn’t cite the right things (their favourite work on this topic, their fellow countryman, them etc.)...
Good point. I would not say that the issue with the feature visualization and zoom in papers were merely failing to cite related work. I would say that the issue is how they started a line of research that is causing confusion and redundant work. My stance here is based on how I see the isolation between the two types of work as needless.
---
I understand that your take is that it is closer to program synthesis or program induction and that these aren’t all the same thing but in the first subsection of the “TASIC has reinvented...” section, I’m a little confused why there’s no mention of reverse engineering programs from compiled binary? The analogy with reverse engineering programs is one that MI people have been actively thinking about, writing about and trying to understand ( see e.g. Olah, and Nanda, in which he consults an expert).
Thanks for pointing out these posts. They are examples of discussing a similar idea to MI’s dependency on programmatic hypothesis generation, but they don’t act on it. But they both serve to draw analogies instead of providing methods. The thing in the front of my mind when I talk about how TAISIC has not sufficiently engaged with neurosymbolic work is the kind of thing I mentioned in the paragraph about existing work outside of TAISIC. I pasted it below for convenience :)
If MI work is to be more engineering-relevant, we need automated ways of generating candidate programs to explain how neural networks work. The good news is that we don’t have to start from scratch. The program synthesis, induction, and language translation literatures have been around long enough that we have textbooks on them (Gulwani et al., 2017, Qiu, 1999). And there are also notable bodies of work in deep learning that focus on extracting decision trees from neural networks (e.g. Zhang et al., 2019), distilling networks into programs in domain specific languages (e.g. Verma et al., 2018; Verma et al., 2019; Trivedi et al., 2021), and translating neural network architectures into symbolic graphs that are mechanistically faithful (e.g. Ren et al., 2021). These are all automated ways of doing the type of MI work that people in TAISIC want to do. Currently, some of these works (and others in the neurosymbolic literature) seem to be outpacing TAISIC on its own goals.
One idea that comes to mind is to see if a chatbot who is vulnerable to DAN-type prompts could be made to be robust to them by self-distillation on non-DAN-type prompts.
I’d also really like to see if self-distillation or similar could be used to more effectively scrub away undetectable trojans. https://arxiv.org/abs/2204.06974