Thank you for reading and the suggestions. I enumerate for easier reference: 1. Find max activating examples 2. Understand which vectors are being found 3. Attempt to scale up 4. Finding useful applications once scaled up
For 1. do you mean:
Take an input (from a bank of random example prompts)
Do forward pass on unsteered model
Extract the activations at the target layer
Compute the dot product between these activations and the steering vector
Use this dot product value as a measure of how strongly this example activates the behavior associated with the steering vector
Thank you for reading and the suggestions. I enumerate for easier reference:
1. Find max activating examples
2. Understand which vectors are being found
3. Attempt to scale up
4. Finding useful applications once scaled up
For 1. do you mean:
Take an input (from a bank of random example prompts)
Do forward pass on unsteered model
Extract the activations at the target layer
Compute the dot product between these activations and the steering vector
Use this dot product value as a measure of how strongly this example activates the behavior associated with the steering vector
Am I following correctly?