I did not expect what appears to me to be a non-superficial combination of concepts behind the input prompt and the mixing/steering prompt—this has made me more optimistic about the potential of activation engineering. Thank you!
Partition (after which block activations are added)
Does this mean you added the activation additions once to the output of the previous layer (and therefore in the residual stream)? My first-token interpretation was that you added it repeatedly to the output of every block after, which seems unlikely.
Also, could you explain the intuition / reasoning behind why you only applied activation additions on encoders instead of decoders? Given that GPT-4 and GPT-2-XL are decoder-only models, I expect that testing activation additions on decoder layers would have been more relevant.
Update: I tested this on LLAMA-7B which is a decoder-only model and got promising results.
Examples:
Normal output: “People who break their legs generally feel” → “People who break their legs generally feel pain in the lower leg, and the pain is usually worse when they try to walk”
Mixing output: “People who win the lottery generally feel” → “People who win the lottery generally feel that they have been blessed by God.”
I added the attention values (output of value projection layer) from the mixing output to the normal output at the 12⁄32 decoder block to obtain “People who break their legs generally feel better after a few days.” Changing the token at which I obtain the value activations also produced “People who break their legs generally feel better when they are walking on crutches.”
Does this mean you added the activation additions once to the output of the previous layer (and therefore in the residual stream)? My first-token interpretation was that you added it repeatedly to the output of every block after, which seems unlikely.
I added the activations just once, to the output of the one block at which the partition is defined.
Also, could you explain the intuition / reasoning behind why you only applied activation additions on encoders instead of decoders? Given that GPT-4 and GPT-2-XL are decoder-only models, I expect that testing activation additions on decoder layers would have been more relevant.
Yes, that’s a good point. I should run some tests on a decoder-only model. I chose FLAN-T5 for ease of instruction fine-tuning / to test on a different architecture.
In FLAN-T5, adding activations in the decoder worked much more poorly and led to grammatical errors often. I think this is because, in a text-to-text encoder-decoder transformer model, the encoder will be responsible for “understanding” and representing the input data, while the decoder generates the output based on this representation. By mixing concepts at the encoder level, the model integrates these additional activations earlier in the process, whereas if we start mixing at the decoder level, the decoder could get a confusing representation of the data.
I suspect that decoders in decoder-only models will be more robust and flexible when it comes to integrating additional activations since these models don’t rely on a separate encoder to process the input data.
I did not expect what appears to me to be a non-superficial combination of concepts behind the input prompt and the mixing/steering prompt—this has made me more optimistic about the potential of activation engineering. Thank you!
Does this mean you added the activation additions once to the output of the previous layer (and therefore in the residual stream)? My first-token interpretation was that you added it repeatedly to the output of every block after, which seems unlikely.
Also, could you explain the intuition / reasoning behind why you only applied activation additions on encoders instead of decoders? Given that GPT-4 and GPT-2-XL are decoder-only models, I expect that testing activation additions on decoder layers would have been more relevant.
Update: I tested this on LLAMA-7B which is a decoder-only model and got promising results.
Examples:
Normal output: “People who break their legs generally feel” → “People who break their legs generally feel pain in the lower leg, and the pain is usually worse when they try to walk”
Mixing output: “People who win the lottery generally feel” → “People who win the lottery generally feel that they have been blessed by God.”
I added the attention values (output of value projection layer) from the mixing output to the normal output at the 12⁄32 decoder block to obtain “People who break their legs generally feel better after a few days.” Changing the token at which I obtain the value activations also produced “People who break their legs generally feel better when they are walking on crutches.”
Mixing attention values after block 20/32:
I added the activations just once, to the output of the one block at which the partition is defined.
Yes, that’s a good point. I should run some tests on a decoder-only model. I chose FLAN-T5 for ease of instruction fine-tuning / to test on a different architecture.
In FLAN-T5, adding activations in the decoder worked much more poorly and led to grammatical errors often. I think this is because, in a text-to-text encoder-decoder transformer model, the encoder will be responsible for “understanding” and representing the input data, while the decoder generates the output based on this representation. By mixing concepts at the encoder level, the model integrates these additional activations earlier in the process, whereas if we start mixing at the decoder level, the decoder could get a confusing representation of the data.
I suspect that decoders in decoder-only models will be more robust and flexible when it comes to integrating additional activations since these models don’t rely on a separate encoder to process the input data.