Connor Kissane

Karma: 374

Connor Kissane Nov 9, 2024, 2:21 AM
2 points
0
in reply to: Jatin Nainani’s comment on: SAEs are highly dataset dependent: a case study on the refusal direction
Thanks! I’m not sure. My guess is that if you go super narrow, it may be more likely to result in an inconvenient level of “feature splitting”. Since there are only a few total concepts to learn, an SAE of equivalent width might exploit its greater relative capacity to learn niche combinations of features (to reduce sparsity loss).

SAEs are highly dataset dependent: a case study on the refusal direction

Connor Kissane, robertzk, Neel Nanda and Arthur Conmy

Nov 7, 2024, 5:22 AM

66 points

4 comments14 min readLW link

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

Oct 27, 2024, 6:46 PM

47 points

4 comments5 min readLW link

Connor Kissane Sep 30, 2024, 1:38 AM
3 points
1
in reply to: wassname’s comment on: Base LLMs refuse too
LLaMA 1 7B definitely seems to be a “pure base model”. I agree that we have less transparency into the pre-training of Gemma 2 and Qwen 1.5, and I’ll add this as a limitation, thanks!
I’ve checked that Pythia 12b deduped (pre-trained on the pile) also refuses harmful requests, although at a lower rate (13%). Here’s an example, using the following prompt template:
“”″User: {instruction}
Assistant:”″”
It’s pretty dumb though, and often just outputs nonsense. When I give it the Vicuna system prompt, it refuses 100% of harmful requests, though it has a bunch of “incompetent refusals”, similar to LLaMA 1 7B:
“”″A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user’s questions.
USER: {instruction}
ASSISTANT:”″”

Base LLMs refuse too

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

Sep 29, 2024, 4:04 PM

60 points

20 comments10 min readLW link

SAEs (usually) Transfer Between Base and Chat Models

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

Jul 18, 2024, 10:29 AM

67 points

0 comments10 min readLW link

Attention Output SAEs Improve Circuit Analysis

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

Jun 21, 2024, 12:56 PM

33 points

3 comments19 min readLW link

Connor Kissane May 17, 2024, 8:53 AM
1 point
1
in reply to: Ali Shehper’s comment on: Sparse Autoencoders Work on Attention Layer Outputs
Thanks for the comment! We always use the pre-ReLU feature activation, which is equal to the post-ReLU activation (given that the feature is activate), and is purely linear function of z. Edited the post for clarity.

Connor Kissane Mar 31, 2024, 4:35 PM
7 points
0
on: SAE-VIS: Announcement Post
Amazing! We found your original library super useful for our Attention SAEs research, so thanks for making this!

We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To

robertzk, Connor Kissane, Arthur Conmy and Neel Nanda

Mar 6, 2024, 5:03 AM

63 points

0 comments12 min readLW link

Attention SAEs Scale to GPT-2 Small

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

Feb 3, 2024, 6:50 AM

78 points

4 comments8 min readLW link

Sparse Autoencoders Work on Attention Layer Outputs

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

Jan 16, 2024, 12:26 AM

84 points

9 comments18 min readLW link

Connor Kissane Aug 14, 2023, 2:20 PM
1 point
0
on: Mech Interp Puzzle 1: Suspiciously Similar Embeddings in GPT-Neo
These puzzles are great, thanks for making them!

Connor Kissane Jul 19, 2023, 7:57 PM
1 point
0
on: Causal scrubbing: results on induction heads
Code for this token filtering can be found in the appendix and the exact token list is linked.
Maybe I just missed it, but I’m not seeing this. Is the code still available?