Logan Riggs

Karma: 2,292

Logan Riggs 17 May 2024 21:06 UTC
LW: 3 AF: 1
0
AF
on: Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
What a cool paper! Congrats!:)

What’s cool:
1. e2e saes learn very different features every seed. I’m glad y’all checked! This seems bad.
2. e2e SAEs have worse intermediate reconstruction loss than local. I would’ve predicted the opposite actually.
3. e2e+downstream seems to get all the benefits of the e2e one (same perf at lower L0) at the same compute cost, w/o the “intermediate activations aren’t similar” problem.
It looks like you’ve left for future work postraining SAE_local on KL or downstream loss as future work, but that’s a very interesting part! Specifically the approximation of SAE_e2e+downstream as you train on number of tokens.
Did y’all try ablations on SAE_e2e+downstream? For example, only training on the next layers Reconstruction loss or next N-layers rec loss?

Logan Riggs 16 Apr 2024 19:13 UTC
4 points
0
on: Experiments with an alternative method to promote sparsity in sparse autoencoders
Great work!
Did you ever run just the L0-approx & sparsity-frequency penalty separately? It’s unclear if you’re getting better results because the L0 function is better or because there are less dead features.
Also, a feature frequency of 0.2 is very large! ¹⁄₅ tokens activating is large even for positional (because your context length is 128). It’d be bad if the improved results are because polysemanticity is sneaking back in through these activations. Sampling datapoints across a range of activations should show where the meaning becomes polysemantic. Is it the bottom 10% (or 10% of max-activating example is my preferred method)

Logan Riggs 10 Apr 2024 14:32 UTC
3 points
0
on: Normalizing Sparse Autoencoders
For comparing CE-difference (or the mean reconstruction score), did these have similar L0′s? If not, it’s an unfair comparison (higher L0 is usually higher reconstruction accuracy).

Logan Riggs 4 Apr 2024 16:54 UTC
4 points
1
in reply to: Aidan Ewart’s comment on: SAE reconstruction errors are (empirically) pathological
Seems tangential. I interpreted loss recovered is CE-related (not reconstruction related).

Logan Riggs 4 Apr 2024 13:56 UTC
3 points
1
in reply to: Chris_Leong’s comment on: Was Releasing Claude-3 Net-Negative?
Could you go into more details on how this would work? For example, Sam Altman wants to raise more money, but can’t raise as much since Claude-3 is better. So he waits to raise more money after releasing GPT-5 (so no change in behavior except when to raise money).
If you argue releasing GPT-5 sooner, that time has to come from somewhere. For example, suppose GPT-4 was release ready by February, but they wanted to wait until Pi day for fun. Capability researchers are still researching capabilities in the meantime regardless ff they were pressured & instead relased 1 month earlier.
Maybe arguing that earlier access allows more API access so more time finagling w/ scaffolding?

Logan Riggs 2 Apr 2024 20:19 UTC
LW: 2 AF: 1
0
AF
on: SAE reconstruction errors are (empirically) pathological
I’ve only done replications on the mlp_out & attn_out for layers 0 & 1 for gpt2 small & pythia-70M

I chose same cos-sim instead of epsilon perturbations. My KL divergence is log plot, because one KL is ~2.6 for random perturbations.
I’m getting different results for GPT-2 attn_out Layer 0. My random perturbation is very large KL. This was replicated last week when I was checking how robust GPT2 vs Pythia is to perturbations in input (picture below). I think both results are actually correct, but my perturbation is for a low cos-sim (which if you see below shoots up for very small cos-sim diff). This is further substantiated by my SAE KL divergence for that layer being 0.46 which is larger than the SAE you show.
Your main results were on the residual stream, so I can try to replicate there next.
For my perturbation graph:
I add noise to change the cos-sim, but keep the norm at around 0.9 (which is similar to my SAE’s). GPT2 layer 0 attn_out really is an outlier in non-robustness compared to other layers. The results here show that different layers have different levels of robustness to noise for downstream CE loss. Combining w/ your results, it would be nice to add points for the SAE’s cos-sim/CE.
An alternative hypothesis to yours is that SAE’s outperform random perturbation at lower cos-sim, but suck at higher-cos-sim (which we care more about).

Logan Riggs 27 Mar 2024 18:39 UTC
6 points
0
on: On attunement
Throughout this post, I kept thinking about Soul-Making dharma (which I’m familier with, but not very good at!)
AFAIK, it’s about building up the skill of having a full body awareness (ie instead of the breath at the nose as an object, you place attention on the full body + some extra space, like your “aura”) which gives you a much more complete information about the felt sense of different things. For example, when you think of different people, they have different “vibes” that come up as physical sense in the body which you can access more fully by paying attention to full body awareness.
The teachers then went on a lot about sacredness & beauty, which seemed most relevant to attunement (although I didn’t personally practice those methods due to lack of commitment)
However, having full-body awareness was critical for me to have any success in any of the soul-making meditation methods & is mentioned as a pre-requisite for the course. Likewise, attunement may require skills in feeling your body/ noticing felt senses.

Was Releasing Claude-3 Net-Negative?

Logan Riggs27 Mar 2024 17:41 UTC

42 points

5 comments4 min readLW link

Logan Riggs 22 Mar 2024 4:17 UTC
3 points
1
in reply to: Evan Anders’s comment on: Sparse autoencoders find composed features in small toy models
Agreed. You would need to change the correlation code to hardcode feature correlations, then you can zoom in on those two features when doing the max cosine sim.

Logan Riggs 20 Mar 2024 22:09 UTC
1 point
0
on: Sparse autoencoders find composed features in small toy models
Hey! Thanks for doing this research.
Lee Sharkey et al did a similar experiment a while back w/ much larger number of features & dimensions, & there were hyperaparameters that perfectly reconstructed the original dataset (this was as you predicted as N increases).
Hoagy still hosts a version of our replication here (though I haven’t looked at that code in a year!).

Logan Riggs 15 Mar 2024 18:41 UTC
2 points
0
in reply to: Logan Riggs’s comment on: Improving SAE’s by Sqrt()-ing L1 & Removing Lowest Activating Features
Yep, there are similar results when evaluating on the Pile with lower CE (except at the low L0-end)
Thanks for pointing this out! I’ll swap the graphs out w/ their Pile-evaluated ones when it runs [Updated: all images are updated except the one comparing the 4 different “lowest features” values]
We could also train SAE’s on Pythia-70M (non-deduped), but that would take a couple days to run & re-evaluate.

Logan Riggs 15 Mar 2024 18:29 UTC
2 points
0
in reply to: Arthur Conmy’s comment on: Improving SAE’s by Sqrt()-ing L1 & Removing Lowest Activating Features
There actually is a problem with Pythia-70M-deduped on data that doesn’t start at the initial position. This is the non-deduped vs deduped over training (Note: they’re similar CE if you do evaluate on text that starts on the first position of the document).
We get similar performing SAE’s when training on non-deduped (ie the cos-sim & l2-ratio are similar, though of course the CE will be different if the baseline model is different).
However, I do think the SAE’s were trained on the Pile & I evaluated on OWT, which would lead to some CE-difference as well. Let me check.
Edit: Also the seq length is 256.

Logan Riggs 15 Mar 2024 18:23 UTC
2 points
0
in reply to: Neel Nanda’s comment on: Improving SAE’s by Sqrt()-ing L1 & Removing Lowest Activating Features
Ah, you’re right. I’ve updated it.

Improving SAE’s by Sqrt()-ing L1 & Removing Lowest Activating Features

Logan Riggs and Jannik Brinkmann

15 Mar 2024 16:30 UTC

17 points

5 comments4 min readLW link

Logan Riggs 27 Feb 2024 19:19 UTC
2 points
0
in reply to: Logan Riggs’s comment on: Do sparse autoencoders find “true features”?
Additionally, we can train w/ a negative orthogonality regularizer for the purpose of intentionally generating feature-combinatorics. In other words, we train for the features to point in more of the same direction to at least generate for-sure examples of feature combination.

Logan Riggs 27 Feb 2024 0:20 UTC
6 points
0
on: Do sparse autoencoders find “true features”?
I’ve been looking into your proposed solution (inspired by @Charlie Steiner ’s comment). For small models (Pythia-70M is d_model=512) w/ 2k features doesn’t take long to calculate naively, so it’s viable for initial testing & algorithmic improvements can be stacked later.

There are a few choices regardless of optimal solution:
1. Cos-sim of closest neighbor only or average of all vectors?
  1. If closest neighbor, should this be calculated as unique closest neighbor? (I’ve done hungarian algorithm before to calculate this). If not, we’re penalizing features that are close (or more “central”) to many other features more than others.
2. Per batch, only a subset of features activate. Should the cos-sim only be on the features that activate? The orthogonality regularizer would be trading off L1 & MSE, so it might be too strong if it’s calculated on all features.
  1. Gradient question: is there still gradient updates on the decoder weights of feature vectors that didn’t activate.
3. Loss function. Do we penalize high cos-sim more? There’s also a base-random cos-sim of ~.2 for the 500x2k vectors.
I’m currently thinking cos-sim of closest neighbor only, not unique & only on features that activate (we can also do ablations to check). For loss function, we could modify a sigmoid function:
$\frac{1}{1 + e^{- (10 (x - 0.5))}}$
This makes the loss centered between 0 & 1 & higher cos-sim penalized more & lower cos-sim penalized less.
Metrics:
During training, we can periodically check the max mean cos-sim (MMCS). This is the average cos-sim of the non-unique nearest neighbors. Alternatively pushing the histogram (histograms are nice, but harder to compare across runs in wandb). I would like to see normal (w/o an orthogonality regularizer) training run’s histogram for setting the hyperparams for the loss function.
Algorithmic Improvements:
The wiki for Closest Pair of Points (h/t Oam) & Nearest neighbor search seem relevant if one computes the nearest neighbor to create an index as Charlie suggested.
Faiss seems SOTA AFAIK for fast nearest neighbors on a gpu although:
adding or searching a large number of vectors should be done in batches. This is not done automatically yet. Typical batch sizes are powers of two around 8192, see this example.
I believe this is for GPU-memory constraints.
I had trouble installing it using
conda install pytorch::faiss-gpu
but it works if you do
conda install -c pytorch -c nvidia faiss-gpu=1.7.4 mkl=2021 blas=1.0=mkl
I also was unsuccessful installing it w/ just pip w/o conda & conda is their offical supported way to install from here.
An additional note is that the cosine similarity is the dot-product for our case, since all feature vectors are normalized by default.
I’m currently ignoring the algorithmic improvements due to the additional complexity, but should be doable if it produces good results.

Logan Riggs 26 Feb 2024 16:20 UTC
2 points
0
in reply to: jacobcd52’s comment on: Do sparse autoencoders find “true features”?
Hey Jacob! My comment has a coded example with biases:
```
import torch
W = torch.tensor([[-1, 1],[1,1],[1,-1]])
x = torch.tensor([[0,1], [1,1],[1,0]])
b = torch.tensor([0, -1, 0])
y = torch.nn.functional.relu(x@W.T + b)
```
This is for the encoder, where y will be the identity (which is sparse for the hidden dimension).

Logan Riggs 26 Feb 2024 16:11 UTC
2 points
0
in reply to: Demian Till’s comment on: Do sparse autoencoders find “true features”?
Ah, you’re correct. Thanks!
I’m now very interested in implementing this method.

Logan Riggs 23 Feb 2024 23:38 UTC
2 points
0
in reply to: Demian Till’s comment on: Do sparse autoencoders find “true features”?
Thanks for saying the link is broken!
If the True Features are located at:
A: (0,1)
B: (1,0)
[So A^B: (1,1)]

Given 3 SAE hidden-dimensions, a ReLU & bias, the model could learn 3 sparse features
1. A^~B (-1, 1)
2. A^B (1,1)
3. ~A^B(1,-1)
that output 1-hot vectors for each feature. These are also are orthogonal to each other.
Concretely:
```
import torch
W = torch.tensor([[-1, 1],[1,1],[1,-1]])
x = torch.tensor([[0,1], [1,1],[1,0]])
b = torch.tensor([0, -1, 0])
y = torch.nn.functional.relu(x@W.T + b)
```
What links here?
- Logan Riggs's comment on Do sparse autoencoders find “true features”? by Demian Till (26 Feb 2024 16:20 UTC; 2 points)

Logan Riggs 22 Feb 2024 21:16 UTC
7 points
0
on: Do sparse autoencoders find “true features”?
This is a very good explanation of why SAE’s incentivize feature combinatorics. Nice! I hadn’t thought about the tradeoff between the MSE-reduction for learning a rare feature & the L1-reduction for learning a common feature combination.
Freezing already learned features to iteratively learn more and more features could work. In concrete details, I think you would:
1. Learn an initial SAE w/ a much lower L0 (higher l1-alpha) than normally desired.
2. Learn a new SAE to predict the residual of (1), so the MSE would be only on what (1) messed up predicting. The l1 would also only be on this new SAE (since the other is frozen). You would still learn a new decoder-bias which should just be added on to the old one.
3. Combine & repeat until desired losses are obtained
There are at least 3 hyperparameters here to tune:
L1-alpha (and do you keep it the same or try to have smaller number of features per iteration?), how many tokens to train on each (& I guess if you should repeat data?), & how many new features to add each iteration.

I believe the above should avoid problems. For example, suppose your first iteration perfectly reconstructs a datapoint, then the new SAE is incentivized to have low L1 but not activating at all for those datapoints.

Logan Riggs

Was Re­leas­ing Claude-3 Net-Nega­tive?

Im­prov­ing SAE’s by Sqrt()-ing L1 & Re­mov­ing Low­est Ac­ti­vat­ing Fea­tures

Metrics:

Algorithmic Improvements:

Was Releasing Claude-3 Net-Negative?

Improving SAE’s by Sqrt()-ing L1 & Removing Lowest Activating Features