Thanks for the comment! We always use the pre-ReLU feature activation, which is equal to the post-ReLU activation (given that the feature is activate), and is purely linear function of z. Edited the post for clarity.
Connor Kissane
Karma: 255
Base LLMs refuse too
SAEs (usually) Transfer Between Base and Chat Models
Attention Output SAEs Improve Circuit Analysis
Amazing! We found your original library super useful for our Attention SAEs research, so thanks for making this!
We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To
Attention SAEs Scale to GPT-2 Small
Sparse Autoencoders Work on Attention Layer Outputs
These puzzles are great, thanks for making them!
Code for this token filtering can be found in the appendix and the exact token list is linked.
Maybe I just missed it, but I’m not seeing this. Is the code still available?
LLaMA 1 7B definitely seems to be a “pure base model”. I agree that we have less transparency into the pre-training of Gemma 2 and Qwen 1.5, and I’ll add this as a limitation, thanks!
I’ve checked that Pythia 12b deduped (pre-trained on the pile) also refuses harmful requests, although at a lower rate (13%). Here’s an example, using the following prompt template:
“”″User: {instruction}
Assistant:”″”
It’s pretty dumb though, and often just outputs nonsense. When I give it the Vicuna system prompt, it refuses 100% of harmful requests, though it has a bunch of “incompetent refusals”, similar to LLaMA 1 7B:
“”″A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user’s questions.
USER: {instruction}
ASSISTANT:”″”