scasper
Thanks!
I intuit that what you mentioned as a feature might also be a bug. I think that practical forgetting/unlearning that might make us safer would probably involve subjects of expertise like biotech. And if so, then we would want benchmarks that measure a method’s ability to forget/unlearn just the things key to that domain and nothing else. For example, if a method succeeds in unlearning biotech but makes the target LM also unlearn math and physics, then we should be concerned about that, and we probably want benchmarks to help us quantify that.
I could imagine an unlearning benchmark, for example, with textbooks and ap tests. Then for each of different knowledge-recovery strategies, one could construct the grid of how well the model performs on each target test for each unlearning textbook.
+1, I’ll add this and credit you.
+1
Although the NeurIPS challenge and prior ML lit on forgetting and influence functions seem worth keeping on the radar because they’re still closely-related to challenges here.
Thanks! I edited the post to add a link to this.
Deep Forgetting & Unlearning for Safely-Scoped LLMs
Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation
A good critical paper about potentially risky industry norms is this one.
The 6D effect: When companies take risks, one email can be very powerful.
Announcing the CNN Interpretability Competition
Thanks
To use your argument, what does MI actually do here?
The inspiration, I would suppose. Analogous to the type claimed in the HHH and hyena papers.
And yes to your second point.
Nice post. I think it can serve as a good example about how the hand waviness of how interpretability can help us do good things with AI goes both ways.
I’m particularly worried about MI people studying instances of when LLMs do and don’t express types of situational awareness and then someone using these insights to give LLMs much stronger situational awareness abilities.
Lastly,
On the other hand, interpretability research is probably crucial for AI alignment.
I don’t think this is true and I especially hope it is not true because (1) mechanistic interpretability still fails to do impressive things by trying to reverse engineer networks and (2) it is entirely fungible from a safety standpoint with other techniques that often do better for various things.
Several people seem to be coming to similar conclusions recently (e.g., this recent post).
I’ll add that I have as well and wrote a sequence about it :)
Thanks for the reply. This sounds reasonable to me. On the last point, I tried my best to do that here, and I think there is a relatively high ratio of solid explanations to unsolid ones.
Overall, I think that the hopes you have for interpretability research are good, and I hope it works out. One of the biggest things that I think is a concern though is that people seem to have been making similar takes with little change for 7+ years. But I just don’t think there have been a number of wins from this research that are commensurate with the effort put into it. And I assume this is expected under your views, so probably not a crux.
I get the impression of a certain of motte and bailey in this comment and similar arguments. From a high-level, the notion of better understanding what neural networks are doing would be great. The problem though seems to be that most of the SOTA of research in interpretability does not seem to be doing a good job of this in a way that seems useful for safety anytime soon. In that sense, I think this comment talks past the points that this post is trying to make.
I think this is very exciting, and I’ll look forward to seeing how it goes!
Thanks, we will consider adding each of these. We appreciate that you took a look and took the time to help suggest these!
No, I don’t think the core advantages of transparency are really unique to RLHF, but in the paper, we list certain things that are specific to RLHF which we think should be disclosed. Thanks.
Thanks, and +1 to adding the resources. Also Charbel-Raphael who authored the in-depth post is one of the authors of this paper! That post in particular was something we paid attention to during the design of the paper.
I have been thinking a lot lately about evals and what differences black- vs. white-box access makes for evaluators. I was confused about the appendix on black-box evals. My best attempt to make sense of it hinges on two instances of the authors intending something different than what was written.
First, the appendix hinges somewhat on this point.
I agree with this, but saying that “control evals should be black-box” is a non sequitur. This is making the point that bad reverse-engineering of the model is bad and seems to assume that good reverse-engineering is hard/unlikely implicitly.
Second, even if good reverse-engineering is hard/unlikely, reverse-engineering is not the definition of a white-box eval. There are a lot of things that you can do with white-box access for evals other than interpretability-based things: gradient-based attacks, gradient-guided attacks, latent-space attacks, and finetuning—and all of these things actually seem plausibly useful for control evals. Plus I don’t see why interp should be off limits for evaluators who know the limitations of the tools they are using.
Ultimately, white-box attacks offer strictly more options for any type of eval compared to black-box ones. For any property that an AI system might have, if there are two teams of competent people trying to verify if the system has that property, all else equal, I’d bet on the success of the team that has more access.
So if I understand things correctly, and if the intended point of this appendix is that “Interpretability-based evals for control seem unreliable and possibly could lead to pitfalls,” I would recommend just saying that more directly.