Really cool work! Some general thoughts from training SAEs on LLMs that might not carry over:
L0 vs reconstruction
Your variance explained metric only makes sense for a given sparsity (ie L0). For example, you can get near perfect variance explained if you set your sparsity term to 0, but then the SAE is like an identity function & the features aren’t meaningful.
In my initial experiments, we swept over various L0s & checked which ones looked most monosemantic (ie had the same meaning across all activations), when sampling 30 features. We found 20-100 being a good L0 for LLMs of d_model=512. I’m curious how this translates to text.
Dead Features
I believe you can do Leo Gao’s topk w/ tied-initialization scheme to (1) tightly control your L0 & (2) have less dead features w/o doing ghost grads. Gao et al notice that this tied-init (ie setting encoder to decoder transposed initially) led to little dead features for small models, which your d_model of 1k is kind of small.
Nora has an implementation here (though you’d need to integrate w/ your vision models)
Icon Explanation
I didn’t really understand your icon image… Oh I get it. The template is the far left & the other three are three different features that you clamp to a large value generating from that template. Cool idea! (maybe separate the 3 features from the template or do a 1x3 matrix-table for clarity?)
Other Interp Ideas
Feature ablation—Take the top-activating images for a feature, ablate the feature ( by reconstructing w/o that feature & do the same residual add-in thing you found useful), and see the resulting generation.
Relevant Inputs—What pixels or something are causally responsible for activating this feature? There’s gotta be some literature on input-causal attribution on the output class in image models, then you’re just applying that to your features instead.
For instance, the subject of one photo could be transferred to another. We could adjust the time of day, and the quantity of the subject. We could add entirely new features to images to sculpt and finely control them. We could pick two photos that had a semantic difference, and precisely transfer over the difference by transferring the features. We could also stack hundreds of edits together.
This really sounds amazing. Did you patch the features from one image to another specifically? Details on how you transferred a subject from one to the other would be appreciated.
Really cool work! Some general thoughts from training SAEs on LLMs that might not carry over:
L0 vs reconstruction
Your variance explained metric only makes sense for a given sparsity (ie L0). For example, you can get near perfect variance explained if you set your sparsity term to 0, but then the SAE is like an identity function & the features aren’t meaningful.
In my initial experiments, we swept over various L0s & checked which ones looked most monosemantic (ie had the same meaning across all activations), when sampling 30 features. We found 20-100 being a good L0 for LLMs of d_model=512. I’m curious how this translates to text.
Dead Features
I believe you can do Leo Gao’s topk w/ tied-initialization scheme to (1) tightly control your L0 & (2) have less dead features w/o doing ghost grads. Gao et al notice that this tied-init (ie setting encoder to decoder transposed initially) led to little dead features for small models, which your d_model of 1k is kind of small.
Nora has an implementation here (though you’d need to integrate w/ your vision models)
Icon Explanation
I didn’t really understand your icon image… Oh I get it. The template is the far left & the other three are three different features that you clamp to a large value generating from that template. Cool idea! (maybe separate the 3 features from the template or do a 1x3 matrix-table for clarity?)
Other Interp Ideas
Feature ablation—Take the top-activating images for a feature, ablate the feature ( by reconstructing w/o that feature & do the same residual add-in thing you found useful), and see the resulting generation.
Relevant Inputs—What pixels or something are causally responsible for activating this feature? There’s gotta be some literature on input-causal attribution on the output class in image models, then you’re just applying that to your features instead.
This really sounds amazing. Did you patch the features from one image to another specifically? Details on how you transferred a subject from one to the other would be appreciated.
These are all great ideas, thanks Logan! Investigating different values of L0 seems especially promising.