Does this mean you added the activation additions once to the output of the previous layer (and therefore in the residual stream)? My first-token interpretation was that you added it repeatedly to the output of every block after, which seems unlikely.
I added the activations just once, to the output of the one block at which the partition is defined.
Also, could you explain the intuition / reasoning behind why you only applied activation additions on encoders instead of decoders? Given that GPT-4 and GPT-2-XL are decoder-only models, I expect that testing activation additions on decoder layers would have been more relevant.
Yes, that’s a good point. I should run some tests on a decoder-only model. I chose FLAN-T5 for ease of instruction fine-tuning / to test on a different architecture.
In FLAN-T5, adding activations in the decoder worked much more poorly and led to grammatical errors often. I think this is because, in a text-to-text encoder-decoder transformer model, the encoder will be responsible for “understanding” and representing the input data, while the decoder generates the output based on this representation. By mixing concepts at the encoder level, the model integrates these additional activations earlier in the process, whereas if we start mixing at the decoder level, the decoder could get a confusing representation of the data.
I suspect that decoders in decoder-only models will be more robust and flexible when it comes to integrating additional activations since these models don’t rely on a separate encoder to process the input data.
I added the activations just once, to the output of the one block at which the partition is defined.
Yes, that’s a good point. I should run some tests on a decoder-only model. I chose FLAN-T5 for ease of instruction fine-tuning / to test on a different architecture.
In FLAN-T5, adding activations in the decoder worked much more poorly and led to grammatical errors often. I think this is because, in a text-to-text encoder-decoder transformer model, the encoder will be responsible for “understanding” and representing the input data, while the decoder generates the output based on this representation. By mixing concepts at the encoder level, the model integrates these additional activations earlier in the process, whereas if we start mixing at the decoder level, the decoder could get a confusing representation of the data.
I suspect that decoders in decoder-only models will be more robust and flexible when it comes to integrating additional activations since these models don’t rely on a separate encoder to process the input data.