Great work! I think this a good outcome for a week at the end of ARENA (Getting some results, publishing them, connecting with existing literature) and would be excited to see more done here. Specifically, even without using an SAE, you could search for max activating examples for each steering vectors you found if you use it as an encoder vector (just take dot product with activations).
In terms of more serious followup, I’d like to much better understand what vectors are being found (eg by comparing to SAEs or searching in the SAE basis with a sparsity penalty), how much we could get out of seriously scaling this up and whether we can find useful applications (eg: is this a faster / cheaper way to elicit model capabilities such as in the context of sandbagging).
Thank you for reading and the suggestions. I enumerate for easier reference: 1. Find max activating examples 2. Understand which vectors are being found 3. Attempt to scale up 4. Finding useful applications once scaled up
For 1. do you mean:
Take an input (from a bank of random example prompts)
Do forward pass on unsteered model
Extract the activations at the target layer
Compute the dot product between these activations and the steering vector
Use this dot product value as a measure of how strongly this example activates the behavior associated with the steering vector
Great work! I think this a good outcome for a week at the end of ARENA (Getting some results, publishing them, connecting with existing literature) and would be excited to see more done here. Specifically, even without using an SAE, you could search for max activating examples for each steering vectors you found if you use it as an encoder vector (just take dot product with activations).
In terms of more serious followup, I’d like to much better understand what vectors are being found (eg by comparing to SAEs or searching in the SAE basis with a sparsity penalty), how much we could get out of seriously scaling this up and whether we can find useful applications (eg: is this a faster / cheaper way to elicit model capabilities such as in the context of sandbagging).
Thank you for reading and the suggestions. I enumerate for easier reference:
1. Find max activating examples
2. Understand which vectors are being found
3. Attempt to scale up
4. Finding useful applications once scaled up
For 1. do you mean:
Take an input (from a bank of random example prompts)
Do forward pass on unsteered model
Extract the activations at the target layer
Compute the dot product between these activations and the steering vector
Use this dot product value as a measure of how strongly this example activates the behavior associated with the steering vector
Am I following correctly?