My basic take is that recent work (mostly looking at sample efficiency and random generalization[1] properties), doesn’t seem very useful for reducing x-risk from misaligment (but seems net positive wrt. x-risk and probably good practice for safety research). But some yet unexplored usages for top-down interpretability could be decently good for reducing misalignment x-risk.
Here’s a more detailed explanation of my views:
In some cases, activation vectors might have better sample efficiency than other approaches (e.g. prompting, normal SFT, DPO) for modifying models in the small n regime. Better sample efficiency probably is mildly helpful for reducing misalignment x-risk though it doesn’t seem that clear. It’s also unclear why this work wouldn’t just happen by default and needs to be pushed along. (For this use case, we’re basically using activation vectors as a different training method with randomly different inductive biases. Probably it’s slightly good to build up a suite of mildly different fine-tuning methods with different known properties?)
Activation vectors could be used as a tool for doing top-down interpretability (getting some understanding of the algorithm a model is implementing); this usage of activation vectors would be similar to how people use activation patching for interp. I haven’t seen any work using activation vectors like this, but it is in principle possible and this seems as or more promising than other interp work if done well IMO.
The fact that activation vectors work tells us something interesting about how models work. The exact applications of this interesting fact are unclear, but getting a better understanding of what’s going on here seems probably net positive.
I’m not sure if Alex Turner’s[2] recent motivation for working on activation vectors is downstream of trying to reduce harms due to (unintended) misalignment of AIs; I think he’s skeptical of massive harm due to traditional misaligment concerns.
My takes were originally stated in a shortform I wrote a little while ago generally discussing my thoughts on activation vectors: short form.
By “random generalization”, I mean analyzing generalization across some distribution shift which isn’t picked for being particularly analogous to some problematic future case and are instead is just an arbitrary shift to test if generalization is robust or to generally learn about about the generalization properties.
My basic take is that recent work (mostly looking at sample efficiency and random generalization[1] properties), doesn’t seem very useful for reducing x-risk from misaligment (but seems net positive wrt. x-risk and probably good practice for safety research). But some yet unexplored usages for top-down interpretability could be decently good for reducing misalignment x-risk.
Here’s a more detailed explanation of my views:
In some cases, activation vectors might have better sample efficiency than other approaches (e.g. prompting, normal SFT, DPO) for modifying models in the small n regime. Better sample efficiency probably is mildly helpful for reducing misalignment x-risk though it doesn’t seem that clear. It’s also unclear why this work wouldn’t just happen by default and needs to be pushed along. (For this use case, we’re basically using activation vectors as a different training method with randomly different inductive biases. Probably it’s slightly good to build up a suite of mildly different fine-tuning methods with different known properties?)
Activation vectors could be used as a tool for doing top-down interpretability (getting some understanding of the algorithm a model is implementing); this usage of activation vectors would be similar to how people use activation patching for interp. I haven’t seen any work using activation vectors like this, but it is in principle possible and this seems as or more promising than other interp work if done well IMO.
The fact that activation vectors work tells us something interesting about how models work. The exact applications of this interesting fact are unclear, but getting a better understanding of what’s going on here seems probably net positive.
I’m not sure if Alex Turner’s[2] recent motivation for working on activation vectors is downstream of trying to reduce harms due to (unintended) misalignment of AIs; I think he’s skeptical of massive harm due to traditional misaligment concerns.
My takes were originally stated in a shortform I wrote a little while ago generally discussing my thoughts on activation vectors: short form.
By “random generalization”, I mean analyzing generalization across some distribution shift which isn’t picked for being particularly analogous to some problematic future case and are instead is just an arbitrary shift to test if generalization is robust or to generally learn about about the generalization properties.
Alex is one of the main people discussing this work on LW AFAICT.