This is the most impressive concrete achievement in alignment I’ve seen. I think this post reduces my p(doom) by around 1%, and I’m excited to see where all of the new directions uncovered lead.
Edit 25 May: I now think RLHF is more impressive in terms of what we can get systems to do, but I still think activation editing has opened up more promising directions. This is still in my all-time top 10.
What other concrete achievements are you considering and ranking less impressive than this? E.g. I think there’s a case for more alignment progress having come from RLHF, debate, some mechanistic interpretability, or adversarial training.
I think to solve alignment, we need to develop our toolbox of “getting AI systems to behave in ways we choose”. Not in the sense of being friendly or producing economic value, but things that push towards whatever cognitive properties we need for a future alignment solution. We can make AI systems do some things we want e.g. GPT-4 can answer questions with only words starting with “Q”, but we don’t know how it does this in terms of internal representations of concepts. Current systems are not well-characterized enough that we can predict what they do far OOD. No other work I’ve seen quite matches the promise this post has in finding ways to exert fine-grained control over a system’s internals; we now have a wide variety of concrete questions like
how to find steering vectors for new behaviors e.g. speaking French?
how to make these techniques more robust?
What do steering vectors, especially multiple steering vectors, tell us about how the model combines concepts?
Can we decompose the effect of a prompt into steering vectors from simpler prompts, thereby understanding why complex prompts work?
Are the effects of steering vectors nonlinear for small coefficients? What does this mean about superposition?
What’s the mechanism by which adding a steering vector with too large a coefficient breaks the model?
Adding steering vectors at different layers surely means you are intervening at different “stages of processing”. What do the model’s internal concepts look like at different stages?
Comparing this to other work, my sense is that
intervening on activations is better than training (including RLHF), because this builds towards understanding systems rather than steering a black box with a black-box reward model, and for the reasons the authors claim.
Debate, although important, seems less likely to be a counterfactual, robust way to steer models. The original debate agenda ran into serious problems, and neither it nor the current Bowman agenda tells us much about the internals of models.
steering a model with activation vectors is better than mechinterp (e.g. the IOI paper), because here you’ve proven you can make the AI do a wide variety of interesting things, plus mechinterp is slow
I’m not up to date on the adversarial training literature (maybe academia has produced something more impressive), but I think this is more valuable than the Redwood paper, which didn’t have a clearly positive result. I’m glad people are working on adversarial robustness.
steering the model using directions in activation space is more valuable than doing the same with weights, because in the future the consequences of cognition might be far-removed from its weights (deep deceptiveness)
It’s a judgement call whether this makes it the most impressive achievement, but I think this post is pretty clearly Pareto-optimal in a very promising direction. That said, I have a couple of reservations:
By “most impressive concrete achievement” I don’t necessarily mean the largest single advance over SOTA. There have probably been bigger advances in the past (RLHF is a candidate), and the impact of ELK is currently unproven but will shoot to the top if mechanistic anomaly detection ever pans out.
I don’t think we live in a world where you can just add a “be nice” vector to a nanotech-capable system and expect better consequences, again for deep deceptiveness-ish reasons. Therefore, we need advances in theory to convert our ability to make systems do things into true mastery of cognition.
I don’t think we should call this “algebraic value editing” because it seems overly pretentious to say we’re editing the model’s values We don’t even know what values are! I don’t think RLHF is editing values, in the sense that it does something different from even the weak version of instilling desires to create diamonds, and this seems even less connected to values. The only connection is it’s modifying something contextually activated which is way too broad.
It’s unclear that this works in a wide range of situations, or in the situations we need it to for future alignment techniques. The authors claim that cherry-picking was limited, but there are other uncertainties: when we need debaters that don’t collude to mislead the judge, will we be able to use activation patching? What if we need an AI that doesn’t self-modify to remove some alignment property?
I don’t think we should call this “algebraic value editing” because it seems overly pretentious to say we’re editing the model’s values We don’t even know what values are!
I phased out “algebraic value editing” for exactly that reason. Note that only the repository and prediction markets retain this name, and I’ll probably rename the repo activation_additions.
steering the model using directions in activation space is more valuable than doing the same with weights, because in the future the consequences of cognition might be far-removed from its weights (deep deceptiveness)
I think one could argue that self-deception could in some instances be spotted in the weights more easily than in the activations. Often the functionality acquired by self-deception is not activated, but it may be more readily apparent in the weights. Hence I don’t see this as a strong reason to dismiss https://arxiv.org/abs/2212.04089. I would want a weight version of a method and an activation version of a method; they tend to have different strengths.
Edit: I see passive disagreement but no refutation. The argument against weights was of the form “here’s a strength activations has”; for it to be enough to dismiss the paper without discussion, that must be an extremely strong property to outweigh all of its potential merits, or it is a Pareto-improvement. Those don’t seem corroborated or at all obvious.
The argument against weights was of the form “here’s a strength activations has”; for it to be enough to dismiss the paper without discussion
I personally don’t “dismiss” the task vector work. I didn’t read Thomas as dismissing it by not calling it the concrete work he is most excited about—that seems like a slightly uncharitable read?
Editing Models with Task Arithmeticexplored a “dual” version of our algebraic technique. That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-diff vectors. While our technique modifies activations, the techniques seem complementary, and both useful for alignment.
I’m highly uncertain about the promise of activation additions. I think their promise ranges from pessimistic “superficial stylistic edits” to optimistic “easy activation/deactivation of the model’s priorities at inference time.” In the optimistic worlds, activation additions do enjoy extreme advantages over task vectors, like accessibility of internal model properties which aren’t accessible to finetuning (see the speculation portion of the post). In the very pessimistic worlds, activation additions are probably less directly important than task vectors.
Note that task vectors require finetuning. From the newly updated related work section:
Lastly, Editing Models with Task Arithmeticexplored a “dual” version of our activation additions. That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-difference vectors. While this seems interesting, task arithmetic requires finetuning. In Activation additions have advantages over (RL/supervised) finetuning, we explain the advantages our approach has over finetuning.
Deep deceptiveness is not quite self-deception. I agree that there are some circumstances where defending from self-deception advantages weight methods, but these seem uncommon.
I thought briefly about the Ilharco et al paper and am very impressed by it as well.
Thanks for linking to the resources.
I don’t have enough time to reply in depth, but the factors in favor of weight vectors and activation vectors both seem really complicated, and the balance still seems in favor of activation vectors, though I have reasonably high uncertainty.
I don’t have enough time to reply in depth, but the factors in favor of weight vectors and activation vectors both seem really complicated, and the balance still seems in favor of activation vectors, though I have reasonably high uncertainty.
Weight vectors are derived through fine-tuning. Insofar as you thought activation additions are importantly better than finetuning in some respects, and were already thinking about finetuning (eg via RLHF) when writing why you were excited about activation additions, I don’t see how this paper changes the balance very much? (I wrote my thoughts here in Activation additions have advantages over (RL/supervised) finetuning)
I think the main additional piece of information given by the paper is the composability of finetuned edits unlocking a range of finetuning configurations, which grows exponentially with the number of composable edits. But I personally noted that finetuning enjoys this benefit in the original version of the post.
There’s another strength which I hadn’t mentioned in my writing, which is that if you can finetune into the opposite direction of the intended behavior (like you can make a model less honest somehow), and then subtract that task vector, you can maybe increase honesty, even if you couldn’t just naively finetune that honesty into the model.[1]
But, in a sense, task vectors are “still in the same modalities we’re used to.” Activation additions jolted me because they’re just… a new way[2] of interacting with models! There’s been way more thought and research put into finetuning and its consequences, relative to activation engineering and its alignment implications. I personally expect activation engineering to open up a lot of affordances for model-steering.
To be very clear about the novelty of our contributions, I’ll quote the “Summary of relationship to prior work” section:
We are not the first to steer language model behavior by adding activation vectors to residual streams. However, we are the first to do so without using machine optimization (e.g. SGD) to find the vectors. Among other benefits, our “activation addition” methodology enables much faster feedback loops than optimization-based activation vector approaches.
But this “activation engineering” modality is relatively new, and relatively unexplored, especially in its alignment implications. I found and cited two papers adding activation vectors to LMs to steer them, from 2022 and 2023.
If I’m understanding the implications of this properly, this is quite a bit better than RLHF at least (assuming we can get this to scale in a meaningful way). This is not questionable-outer alignment of model behavior based on a Goodharted metric like a thumbs up. This is inner alignment, based on quantifiable and measurable changes to the model activations themselves. That’s a way more explainable, robust, testable approach than RLHF, right?
This is the most impressive concrete achievement in alignment I’ve seen. I think this post reduces my p(doom) by around 1%, and I’m excited to see where all of the new directions uncovered lead.
Edit: I explain this view in a reply.
Edit 25 May: I now think RLHF is more impressive in terms of what we can get systems to do, but I still think activation editing has opened up more promising directions. This is still in my all-time top 10.
What other concrete achievements are you considering and ranking less impressive than this? E.g. I think there’s a case for more alignment progress having come from RLHF, debate, some mechanistic interpretability, or adversarial training.
I think to solve alignment, we need to develop our toolbox of “getting AI systems to behave in ways we choose”. Not in the sense of being friendly or producing economic value, but things that push towards whatever cognitive properties we need for a future alignment solution. We can make AI systems do some things we want e.g. GPT-4 can answer questions with only words starting with “Q”, but we don’t know how it does this in terms of internal representations of concepts. Current systems are not well-characterized enough that we can predict what they do far OOD. No other work I’ve seen quite matches the promise this post has in finding ways to exert fine-grained control over a system’s internals; we now have a wide variety of concrete questions like
how to find steering vectors for new behaviors e.g. speaking French?
how to make these techniques more robust?
What do steering vectors, especially multiple steering vectors, tell us about how the model combines concepts?
Can we decompose the effect of a prompt into steering vectors from simpler prompts, thereby understanding why complex prompts work?
Are the effects of steering vectors nonlinear for small coefficients? What does this mean about superposition?
What’s the mechanism by which adding a steering vector with too large a coefficient breaks the model?
Adding steering vectors at different layers surely means you are intervening at different “stages of processing”. What do the model’s internal concepts look like at different stages?
Comparing this to other work, my sense is that
intervening on activations is better than training (including RLHF), because this builds towards understanding systems rather than steering a black box with a black-box reward model, and for the reasons the authors claim.
Debate, although important, seems less likely to be a counterfactual, robust way to steer models. The original debate agenda ran into serious problems, and neither it nor the current Bowman agenda tells us much about the internals of models.
steering a model with activation vectors is better than mechinterp (e.g. the IOI paper), because here you’ve proven you can make the AI do a wide variety of interesting things, plus mechinterp is slow
I’m not up to date on the adversarial training literature (maybe academia has produced something more impressive), but I think this is more valuable than the Redwood paper, which didn’t have a clearly positive result. I’m glad people are working on adversarial robustness.
steering the model using directions in activation space is more valuable than doing the same with weights, because in the future the consequences of cognition might be far-removed from its weights (deep deceptiveness)
It’s a judgement call whether this makes it the most impressive achievement, but I think this post is pretty clearly Pareto-optimal in a very promising direction. That said, I have a couple of reservations:
By “most impressive concrete achievement” I don’t necessarily mean the largest single advance over SOTA. There have probably been bigger advances in the past (RLHF is a candidate), and the impact of ELK is currently unproven but will shoot to the top if mechanistic anomaly detection ever pans out.
I don’t think we live in a world where you can just add a “be nice” vector to a nanotech-capable system and expect better consequences, again for deep deceptiveness-ish reasons. Therefore, we need advances in theory to convert our ability to make systems do things into true mastery of cognition.
I don’t think we should call this “algebraic value editing” because it seems overly pretentious to say we’re editing the model’s values We don’t even know what values are! I don’t think RLHF is editing values, in the sense that it does something different from even the weak version of instilling desires to create diamonds, and this seems even less connected to values. The only connection is it’s modifying something contextually activated which is way too broad.
It’s unclear that this works in a wide range of situations, or in the situations we need it to for future alignment techniques. The authors claim that cherry-picking was limited, but there are other uncertainties: when we need debaters that don’t collude to mislead the judge, will we be able to use activation patching? What if we need an AI that doesn’t self-modify to remove some alignment property?
I phased out “algebraic value editing” for exactly that reason. Note that only the repository and prediction markets retain this name, and I’ll probably rename the repo
activation_additions
.(You linked to “deep deceptiveness,” and I’m going to assume is related to self-deception (discussed in the academic literature and in the AI and evolution paper). If it isn’t, then this point is still relevant for alignment since self-deception is another internal hazard.)
I think one could argue that self-deception could in some instances be spotted in the weights more easily than in the activations. Often the functionality acquired by self-deception is not activated, but it may be more readily apparent in the weights. Hence I don’t see this as a strong reason to dismiss https://arxiv.org/abs/2212.04089. I would want a weight version of a method and an activation version of a method; they tend to have different strengths.
Note: If you’re wanting to keep track of safety papers outside of LW/AF, papers including https://arxiv.org/abs/2212.04089 were tweeted on https://twitter.com/topofmlsafety and posted on https://www.reddit.com/r/mlsafety
Edit: I see passive disagreement but no refutation. The argument against weights was of the form “here’s a strength activations has”; for it to be enough to dismiss the paper without discussion, that must be an extremely strong property to outweigh all of its potential merits, or it is a Pareto-improvement. Those don’t seem corroborated or at all obvious.
I personally don’t “dismiss” the task vector work. I didn’t read Thomas as dismissing it by not calling it the concrete work he is most excited about—that seems like a slightly uncharitable read?
I, personally, think the task vector work is exciting. Back in Understanding and controlling a maze-solving policy network, I wrote (emphasis added):
I’m highly uncertain about the promise of activation additions. I think their promise ranges from pessimistic “superficial stylistic edits” to optimistic “easy activation/deactivation of the model’s priorities at inference time.” In the optimistic worlds, activation additions do enjoy extreme advantages over task vectors, like accessibility of internal model properties which aren’t accessible to finetuning (see the speculation portion of the post). In the very pessimistic worlds, activation additions are probably less directly important than task vectors.
I don’t know what world we’re in yet.
Note that task vectors require finetuning. From the newly updated related work section:
Deep deceptiveness is not quite self-deception. I agree that there are some circumstances where defending from self-deception advantages weight methods, but these seem uncommon.
I thought briefly about the Ilharco et al paper and am very impressed by it as well.
Thanks for linking to the resources.
I don’t have enough time to reply in depth, but the factors in favor of weight vectors and activation vectors both seem really complicated, and the balance still seems in favor of activation vectors, though I have reasonably high uncertainty.
Weight vectors are derived through fine-tuning. Insofar as you thought activation additions are importantly better than finetuning in some respects, and were already thinking about finetuning (eg via RLHF) when writing why you were excited about activation additions, I don’t see how this paper changes the balance very much? (I wrote my thoughts here in Activation additions have advantages over (RL/supervised) finetuning)
I think the main additional piece of information given by the paper is the composability of finetuned edits unlocking a range of finetuning configurations, which grows exponentially with the number of composable edits. But I personally noted that finetuning enjoys this benefit in the original version of the post.
There’s another strength which I hadn’t mentioned in my writing, which is that if you can finetune into the opposite direction of the intended behavior (like you can make a model less honest somehow), and then subtract that task vector, you can maybe increase honesty, even if you couldn’t just naively finetune that honesty into the model.[1]
But, in a sense, task vectors are “still in the same modalities we’re used to.” Activation additions jolted me because they’re just… a new way[2] of interacting with models! There’s been way more thought and research put into finetuning and its consequences, relative to activation engineering and its alignment implications. I personally expect activation engineering to open up a lot of affordances for model-steering.
This is a kinda sloppy example because “honesty” probably isn’t a primitive property of the network’s reasoning. Sorry.
To be very clear about the novelty of our contributions, I’ll quote the “Summary of relationship to prior work” section:
But this “activation engineering” modality is relatively new, and relatively unexplored, especially in its alignment implications. I found and cited two papers adding activation vectors to LMs to steer them, from 2022 and 2023.
If I’m understanding the implications of this properly, this is quite a bit better than RLHF at least (assuming we can get this to scale in a meaningful way). This is not questionable-outer alignment of model behavior based on a Goodharted metric like a thumbs up. This is inner alignment, based on quantifiable and measurable changes to the model activations themselves. That’s a way more explainable, robust, testable approach than RLHF, right?