I agree pretty strongly with Neel’s first point here, and I want to expand on it a bit: one of the biggest issues with interp is fooling yourself and thinking you’ve discovered something profound when in reality you’ve misinterpreted the evidence. Sure, you’ve “understood grokking”[1] or “found induction heads”, but why should anyone think that you’ve done something “real”, let alone something that will help with future dangerous AI systems? Getting rigorous results in deep learning in general is hard, and it seems empirically even harder in (mech) interp.
You can try to get around this by being extra rigorous and building from the ground up anyways. If you can present a ton of compelling evidence at every stage of resolution for your explanation, which in turn explains all of the behavior you care about (let alone a proof), then you can be pretty sure you’re not fooling yourself. (But that’s really hard, and deep learning especially has not been kind to this approach.) Or, you can try to do something hard and novel on a real system, that can’t be done with existing knowledge or techniques. If you succeed at this, then even if your specific theory is not necessarily true, you’ve at least shown that it’s real enough to produce something of value. (This is a fancy of way of saying, “new theories should make novel predictions/discoveries and test them if possible”.)
From this perspective, studying refusal in LLMs is not necessarily more x-risk relevant than studying say, studying why LLMs seem to hallucinate, why linear probes seem to be so good for many use cases(and where they break), or the effects of helpfulness/agency/tool-use finetuning in general. (And I suspect that poking hard at some of the weird results from the cyborgism crowd may be more relevant.) But it’s a hard topic that many people care about, and so succeeding here provides a better argument for the usefulness of their specific model internals based approach than studying something more niche.
It’s “easier”to study harmlessness than other comparably important or hard topics. Not only is there a lot of financial interest from companies, there’s a lot of supporting infrastructure already in place to study harmlessness. If you wanted to study the exact mechanism by which Gemini Ultra is e.g. so good at confabulating undergrad-level mathematical theorems, you’d immediately run into the problem that you don’t have Gemini internals access (and even if you do, the code is almost certainly not set up for easily poking around inside the model). But if you study a mechanism like refusal training, where there are open source models that are refusal trained and where datasets and prior work is plentiful, you’re able to leverage existing resources.
Many of the other things AI Labs are pushing hard on are just clear capability gains, which many people morally object to. For example, I’m sure many people would be very interested if mech interp could significantly improve pretraining, or suggest more efficient sparse architectures. But I suspect most x-risk focused people would not want to contribute to these topics.
Now, of course, there’s the standard reasons why it’s bad to study popular/trendy topics, including conflating your line of research with contingent properties of the topics (AI Alignment is just RLHF++, AI Safety is just harmlessness training), getting into a crowded field, being misled by prior work, etc. But I’m a fan of model internals researchers (esp mech interp researchers) apply their research to problems like harmlessness, even if it’s just to highlight the way in which mech interp is currently inadequate for these applications.
Also, I would be upset if people started going “the reason this work is x-risk relevant is because of preventing jailbreaks” unless they actually believed this, but this is more of a general distaste for dishonesty as opposed to jailbreaks or harmlessness training in general.
(Also, harmlessness training may be important under some catastrophic misuse scenarios, though I struggle to imagine a concrete case where end user-side jailbreak-style catastrophic misuse causes x-risk in practice, before we get more direct x-risk scenarios from e.g. people just finetuning their AIs to in dangerous ways.)
To be clear: I don’t think the results here are qualitatively more grounded than e.g. other work in the activation steering/linear probing/representation engineering space. My comment was defense of studying harmlessness in general and less so of this work in particular.
If the objection isn’t about this work vs other rep eng work, I may be confused about what you’re asking about. It feels pretty obvious that this general genre of work (studying non-cherry picked phenomena using basic linear methods) is as a whole more grounded than a lot of mech interp tends to be? And I feel like it’s pretty obvious that addressing issues with current harmlessness training, if they improve on state of the art, is “more grounded” than “we found a cool SAE feature that correlates with X and Y!”? In the same way that just doing AI control experiments is more grounded than circuit discovery on algorithmic tasks.
And I feel like it’s pretty obvious that addressing issues with current harmlessness training, if they improve on state of the art, is “more grounded” than “we found a cool SAE feature that correlates with X and Y!”?
Yeah definitely I agree with the implication, I was confused because I don’t think that these techniques do improve on state of the art.
If that were true, I’d expect the reactions to a subsequent LLAMA3 weight orthogonalization jailbreak to be more like “yawn we already have better stuff” and not “oh cool, this is quite effective!” Seems to me from reception that this is letting people either do new things or do it faster, but maybe you have a concrete counter-consideration here?
Being used by Simon Lerman, an author on Bad LLama (admittedly with help of Andy Arditi, our first author) to jailbreak LLaMA3 70B to help create data for some red-teaming research, (EDIT: rather than Simon choosing to fine-tune it, which he clearly knows how to do, being a Bad LLaMA author).
I don’t know what the “real story” is, but let me point at some areas where I think we were confused. At the time, we had some sort of hand-wavy result in our appendix saying “something something weight norm ergo generalizing”. Similarly, concurrent work from Ziming Liu and others (Omnigrok) had another claim based on the norm of generalizing and memorizing solutions, as well as a claim that representation is important.
One issue is that our picture doesn’t consider learning dynamics that seem actually important here. For example, it seems that one of the mechanisms that may explain why weight decay seems to matter so much in the Omnigrok paper is because fixing the norm to be large leads to an effectively tiny learning rate when you use Adam (which normalizes the gradients to be of fixed scale), especiallywhen there’s a substantial radial component (which there is, when the init is too small or too big). This both probably explains why they found that training error was high when they constrain the weights to be sufficiently large in all their non-toy cases (see e.g. the mod add landscape below) and probably explains why we had difficulty using SGD+momentum (which, given our bad initialization, led to gradients that were way too big at some parts of the model especially since we didn’t sweep the learning rate very hard). [1]
There’s also some theoretical results from SLT-related folk about how generalizing circuits achieve lower train loss per parameter (i.e. have higher circuit efficiency) than memorizing circuits (at least for large p), which seems to be a part of the puzzle that neither our work nor the Omnigrok touched on—why is it that generalizing solutions have lower norm? IIRC one of our explanations was that weight decay “favored more distributed solutions” (somewhat false) and “it sure seems empirically true”, but we didn’t have anything better than that.
There was also the really basic idea of how a relu/gelu network may do multiplication (by piecewise linear approximations of x^2, or by using the quadratic region of the gelu for x^2), which (I think) was first described in late 2022 in Ekin Ayurek’s “Transformers can implement Sherman-Morris for closed-form ridge regression” paper? (That’s not the name, just the headline result.)
Part of the story for grokking in general may also be related to the Tensor Program results that claim the gradient on the embedding is too small relative to the gradient on other parts of the model, with standard init. (Also the embed at init is too small relative to the unembed.) Because the embed is both too small and do, there’s no representation learning going on, as opposed to just random feature regression (which overfits in the same way that regression on random features overfits absent regularization).
In our case, it turns out not to be true (because our network is tiny? because our weight decay is set aggressively at lamba=1?), since the weights that directly contribute to logits (W_E, W_U, W_O, W_V, W_in, W_out) all quickly converge to the same size (weight decay encourages spreading out weight norm between things you multiply together), while the weights that do not all converge to zero.
Bringing it back to the topic at hand: There’s often a lot more “small” confusions that remain, even after doing good toy models work. It’s not clear how much any of these confusions matter (and do any of the grokking results our paper, Ziming Liu et al, or the GDM grokking paper found matter?).
I agree pretty strongly with Neel’s first point here, and I want to expand on it a bit: one of the biggest issues with interp is fooling yourself and thinking you’ve discovered something profound when in reality you’ve misinterpreted the evidence. Sure, you’ve “understood grokking”[1] or “found induction heads”, but why should anyone think that you’ve done something “real”, let alone something that will help with future dangerous AI systems? Getting rigorous results in deep learning in general is hard, and it seems empirically even harder in (mech) interp.
You can try to get around this by being extra rigorous and building from the ground up anyways. If you can present a ton of compelling evidence at every stage of resolution for your explanation, which in turn explains all of the behavior you care about (let alone a proof), then you can be pretty sure you’re not fooling yourself. (But that’s really hard, and deep learning especially has not been kind to this approach.) Or, you can try to do something hard and novel on a real system, that can’t be done with existing knowledge or techniques. If you succeed at this, then even if your specific theory is not necessarily true, you’ve at least shown that it’s real enough to produce something of value. (This is a fancy of way of saying, “new theories should make novel predictions/discoveries and test them if possible”.)
From this perspective, studying refusal in LLMs is not necessarily more x-risk relevant than studying say, studying why LLMs seem to hallucinate, why linear probes seem to be so good for many use cases(and where they break), or the effects of helpfulness/agency/tool-use finetuning in general. (And I suspect that poking hard at some of the weird results from the cyborgism crowd may be more relevant.) But it’s a hard topic that many people care about, and so succeeding here provides a better argument for the usefulness of their specific model internals based approach than studying something more niche.
It’s “easier”to study harmlessness than other comparably important or hard topics. Not only is there a lot of financial interest from companies, there’s a lot of supporting infrastructure already in place to study harmlessness. If you wanted to study the exact mechanism by which Gemini Ultra is e.g. so good at confabulating undergrad-level mathematical theorems, you’d immediately run into the problem that you don’t have Gemini internals access (and even if you do, the code is almost certainly not set up for easily poking around inside the model). But if you study a mechanism like refusal training, where there are open source models that are refusal trained and where datasets and prior work is plentiful, you’re able to leverage existing resources.
Many of the other things AI Labs are pushing hard on are just clear capability gains, which many people morally object to. For example, I’m sure many people would be very interested if mech interp could significantly improve pretraining, or suggest more efficient sparse architectures. But I suspect most x-risk focused people would not want to contribute to these topics.
Now, of course, there’s the standard reasons why it’s bad to study popular/trendy topics, including conflating your line of research with contingent properties of the topics (AI Alignment is just RLHF++, AI Safety is just harmlessness training), getting into a crowded field, being misled by prior work, etc. But I’m a fan of model internals researchers (esp mech interp researchers) apply their research to problems like harmlessness, even if it’s just to highlight the way in which mech interp is currently inadequate for these applications.
Also, I would be upset if people started going “the reason this work is x-risk relevant is because of preventing jailbreaks” unless they actually believed this, but this is more of a general distaste for dishonesty as opposed to jailbreaks or harmlessness training in general.
(Also, harmlessness training may be important under some catastrophic misuse scenarios, though I struggle to imagine a concrete case where end user-side jailbreak-style catastrophic misuse causes x-risk in practice, before we get more direct x-risk scenarios from e.g. people just finetuning their AIs to in dangerous ways.)
For example, I think our understanding of Grokking in late 2022 turned out to be importantly incomplete.
Lawrence, how are these results any more grounded than any other interp work?
To be clear: I don’t think the results here are qualitatively more grounded than e.g. other work in the activation steering/linear probing/representation engineering space. My comment was defense of studying harmlessness in general and less so of this work in particular.
If the objection isn’t about this work vs other rep eng work, I may be confused about what you’re asking about. It feels pretty obvious that this general genre of work (studying non-cherry picked phenomena using basic linear methods) is as a whole more grounded than a lot of mech interp tends to be? And I feel like it’s pretty obvious that addressing issues with current harmlessness training, if they improve on state of the art, is “more grounded” than “we found a cool SAE feature that correlates with X and Y!”? In the same way that just doing AI control experiments is more grounded than circuit discovery on algorithmic tasks.
Yeah definitely I agree with the implication, I was confused because I don’t think that these techniques do improve on state of the art.
If that were true, I’d expect the reactions to a subsequent LLAMA3 weight orthogonalization jailbreak to be more like “yawn we already have better stuff” and not “oh cool, this is quite effective!” Seems to me from reception that this is letting people either do new things or do it faster, but maybe you have a concrete counter-consideration here?
This is a very reasonable criticism. I don’t know, I’ll think about it. Thanks.
Thanks, I’d be very curious to hear if this meets your bar for being impressed, or what else it would take! Further evidence:
Passing the Twitter test (for at least one user)
Being used by Simon Lerman, an author on Bad LLama (admittedly with help of Andy Arditi, our first author) to jailbreak LLaMA3 70B to help create data for some red-teaming research, (EDIT: rather than Simon choosing to fine-tune it, which he clearly knows how to do, being a Bad LLaMA author).
Thanks! Broadly agreed
I’d be curious to hear more about what you meant by this
I don’t know what the “real story” is, but let me point at some areas where I think we were confused. At the time, we had some sort of hand-wavy result in our appendix saying “something something weight norm ergo generalizing”. Similarly, concurrent work from Ziming Liu and others (Omnigrok) had another claim based on the norm of generalizing and memorizing solutions, as well as a claim that representation is important.
One issue is that our picture doesn’t consider learning dynamics that seem actually important here. For example, it seems that one of the mechanisms that may explain why weight decay seems to matter so much in the Omnigrok paper is because fixing the norm to be large leads to an effectively tiny learning rate when you use Adam (which normalizes the gradients to be of fixed scale), especially when there’s a substantial radial component (which there is, when the init is too small or too big). This both probably explains why they found that training error was high when they constrain the weights to be sufficiently large in all their non-toy cases (see e.g. the mod add landscape below) and probably explains why we had difficulty using SGD+momentum (which, given our bad initialization, led to gradients that were way too big at some parts of the model especially since we didn’t sweep the learning rate very hard). [1]
There’s also some theoretical results from SLT-related folk about how generalizing circuits achieve lower train loss per parameter (i.e. have higher circuit efficiency) than memorizing circuits (at least for large p), which seems to be a part of the puzzle that neither our work nor the Omnigrok touched on—why is it that generalizing solutions have lower norm? IIRC one of our explanations was that weight decay “favored more distributed solutions” (somewhat false) and “it sure seems empirically true”, but we didn’t have anything better than that.
There was also the really basic idea of how a relu/gelu network may do multiplication (by piecewise linear approximations of x^2, or by using the quadratic region of the gelu for x^2), which (I think) was first described in late 2022 in Ekin Ayurek’s “Transformers can implement Sherman-Morris for closed-form ridge regression” paper? (That’s not the name, just the headline result.)
Part of the story for grokking in general may also be related to the Tensor Program results that claim the gradient on the embedding is too small relative to the gradient on other parts of the model, with standard init. (Also the embed at init is too small relative to the unembed.) Because the embed is both too small and do, there’s no representation learning going on, as opposed to just random feature regression (which overfits in the same way that regression on random features overfits absent regularization).
In our case, it turns out not to be true (because our network is tiny? because our weight decay is set aggressively at lamba=1?), since the weights that directly contribute to logits (W_E, W_U, W_O, W_V, W_in, W_out) all quickly converge to the same size (weight decay encourages spreading out weight norm between things you multiply together), while the weights that do not all converge to zero.
Bringing it back to the topic at hand: There’s often a lot more “small” confusions that remain, even after doing good toy models work. It’s not clear how much any of these confusions matter (and do any of the grokking results our paper, Ziming Liu et al, or the GDM grokking paper found matter?).
Haven’t checked, might do this later this week.