The [Sparse Feature Circuits] approach can be seen as analogous to LoRA (Hu et al., 2021), in that you are constraining your model’s behavior
FWIW I consider SFC and LoRA pretty different, because in practice LoRA is practical, but it can be reversed very easily and has poor worst-case performance. Whereas Sparse Feature Circuits is very expensive, requires far more nodes in bigger models (forthcoming, I think), or requires only studying a subset of layers, but if it worked would likely have far better worst-case performance.
This makes LoRA a good baseline for some SFC-style tasks, but the research experience using both is pretty different.
Good note, thank you.
FWIW I consider SFC and LoRA pretty different, because in practice LoRA is practical, but it can be reversed very easily and has poor worst-case performance. Whereas Sparse Feature Circuits is very expensive, requires far more nodes in bigger models (forthcoming, I think), or requires only studying a subset of layers, but if it worked would likely have far better worst-case performance.
This makes LoRA a good baseline for some SFC-style tasks, but the research experience using both is pretty different.
Good note, thank you.