Joseph Bloom comments on ARENA4.0 Capstone: Hyperparameter tuning for MELBO + replication on Llama-3.2-1b-Instruct

Joseph Bloom 5 Oct 2024 20:10 UTC
8 points
0
Great work! I think this a good outcome for a week at the end of ARENA (Getting some results, publishing them, connecting with existing literature) and would be excited to see more done here. Specifically, even without using an SAE, you could search for max activating examples for each steering vectors you found if you use it as an encoder vector (just take dot product with activations).

In terms of more serious followup, I’d like to much better understand what vectors are being found (eg by comparing to SAEs or searching in the SAE basis with a sparsity penalty), how much we could get out of seriously scaling this up and whether we can find useful applications (eg: is this a faster / cheaper way to elicit model capabilities such as in the context of sandbagging).
- submarat 11 Oct 2024 21:18 UTC
  1 point
  0
  Parent
  Thank you for reading and the suggestions. I enumerate for easier reference:
  1. Find max activating examples
  2. Understand which vectors are being found
  3. Attempt to scale up
  4. Finding useful applications once scaled up
  
  For 1. do you mean:
  1. Take an input (from a bank of random example prompts)
  2. Do forward pass on unsteered model
  3. Extract the activations at the target layer
  4. Compute the dot product between these activations and the steering vector
  5. Use this dot product value as a measure of how strongly this example activates the behavior associated with the steering vector
  Am I following correctly?