We could instead pre-commit to not engage with any nuker’s future posts/comments (and at worse comment to encourage others to not engage) until end-of-year.
Or only include nit-picking comments.
We could instead pre-commit to not engage with any nuker’s future posts/comments (and at worse comment to encourage others to not engage) until end-of-year.
Or only include nit-picking comments.
Could you dig into why you think it’s great inter work?
But through gradient descent, shards act upon the neural networks by leaving imprints of themselves, and these imprints have no reason to be concentrated in any one spot of the network (whether activation-space or weight-space). So studying weights and activations is pretty doomed.
This paragraph sounded like you’re claiming LLMs do have concepts, but they’re not in specific activations or weights, but distributed across them instead.
But from your comment, you mean that LLMs themselves don’t learn the true simple-compressed features of reality, but a mere shadow of them.
This interpretation also matches the title better!
But are you saying the “true features” in the dataset + network? Because SAEs are trained on a dataset! (ignoring the problem pointed out in footnote 1).
Possibly clustering the data points by their network gradients would be a way to put some order into this mess?
Eric Michaud did cluster datapoints by their gradients here. From the abstract:
...Using language model gradients, we automatically decompose model behavior into a diverse set of skills (quanta).
The one we checked last year was just Pythia-70M, which I don’t expect the LLM itself to have a gender feature that generalizes to both pronouns and anisogamy.
But again, the task is next-token prediction. Do you expect e.g. GPT 4 to have learned a gender concept that affects both knowledge about anisogamy and pronouns while trained on next-token prediction?
Sparse autoencoders finds features that correspond to abstract features of words and text. That’s not the same as finding features that correspond to reality.
(Base-model) LLMs are trained to minimize prediction error, and SAEs do seem to find features that sparsely predict error, such as a gender feature that, when removed, affects the probability of pronouns. So pragmatically, for the goal of “finding features that explain next-word-prediction”, which LLMs are directly trained for, SAEs find good examples![1]
I’m unsure what goal you have in mind for “features that correspond to reality”, or what that’d mean.
Not claiming that all SAE latents are good in this way though.
Is there code available for this?
I’m mainly interested in the loss fuction. Specifically from footnote 4:
We also need to add a term to capture the interaction effect between the key-features and the query-transcoder bias, but we omit this for simplicity
I’m unsure how this is implemented or the motivation.
Some MLPs or attention layers may implement a simple linear transformation in addition to actual computation.
@Lucius Bushnaq , why would MLPs compute linear transformations?
Because two linear transformations can be combined into one linear transformation, why wouldn’t downstream MLPs/Attns that rely on this linearly transformed vector just learn the combined function?
What is the activation name for the resid SAEs? hook_resid_post or hook_resid_pre?
I found https://github.com/ApolloResearch/e2e_sae/blob/main/e2e_sae/scripts/train_tlens_saes/run_train_tlens_saes.py#L220
to suggest _post
but downloading the SAETransformer from wandb shows:(saes):
ModuleDict( (blocks-6-hook_resid_pre):
SAE( (encoder): Sequential( (0):...
which suggests _pre.
3. Those who are more able to comprehend and use these models are therefore of a higher agency/utility and higher moral priority than those who cannot. [emphasis mine]
This (along with saying “dignity” implies “moral worth” in Death w/ Dignity post), is confusing to me. Could you give a specific example of how you’d treat differently someone who has more or less moral worth (e.g. give them more money, attention, life-saving help, etc)?
One thing I could understand from your Death w/ Dignity excerpt is he’s definitely implying a metric that scores everyone, and some people will score higher on this metric than others. It’s also common to want to score high on these metrics or feel emotionally bad if you don’t score high on these metrics (see my post for more). This could even have utility, like having more “dignity” gets you a thumbs up from Yudowsky or have your words listened to more in this community. Is this close to what you mean at all?
I was a little confused on this section. Is this saying that human’s goals and options (including options that come to mind) change depending on the environment, so rational choice theory doesn’t apply?
I believe the thesis here is that game theory doesn’t really apply in real life, that there are usually extra constraints or freedoms in real situations that change the payoffs.
I do think this criticism is already handled by trying to “actually win” and “trying to try”; though I’ve personally benefitted specifically from trying to try and David Chapman’s meta-rationality post.
The idea of deference (and when to defer) isn’t novel (which is fine! Novelty is another metric I’m bringing up, but not important for everything one writes to be). It’s still useful to apply Bayes theorem to deference. Specifically evidence that convinces you to trust someone should imply that there’s evidence that convinces you to not trust them.
This is currently all I have time for; however, my current understanding is that there is a common interpretation of Yudowsky’s writings/The sequences/LW/etc that leads to an over-reliance on formal systems that will invevitably fail people. I think you had this interpretation (do correct me if I’m wrong!), and this is your “attempt to renegotiate rationalism ”.
There is the common response of “if you re-read the sequences, you’ll see how it actually handles all the flaws you mentioned”; however, it’s still true that it’s at least a failure in communication that many people consistently mis-interpret it.
Glad to hear you’re synthesizing and doing pretty good now:)
I think copy-pasting the whole thing will make it more likely to be read! I enjoyed it and will hopefully leave a more substantial comment later.
I’ve really enjoyed these posts; thanks for cross posting!
Kind of confused on why the KL-only e2e SAE have worse CE than e2e+downstream across dictionary size:
This is true for layers 2 & 6. I’m unsure if this means that training for KL directly is harder/unstable, and the intermediate MSE is a useful prior, or if this is a difference in KL vs CE (ie the e2e does in fact do better on KL but worse on CE than e2e+downstream).
I finally checked!
Here is the Jaccard similarity (ie similarity of input-token activations) across seeds
The e2e ones do indeed have a much lower jaccard sim (there normally is a spike at 1.0, but this is removed when you remove features that only activate <10 times).
I also (mostly) replicated the decoder similarity chart:
And calculated the encoder sim:
[I, again, needed to remove dead features (< 10 activations) to get the graphs here.]
So yes, I believe the original paper’s claim that e2e features learn quite different features across seeds is substantiated.
And here’s the code to convert it to NNsight (Thanks Caden for writing this awhile ago!)
import torch
from transformers import GPT2LMHeadModel
from transformer_lens import HookedTransformer
from nnsight.models.UnifiedTransformer import UnifiedTransformer
model = GPT2LMHeadModel.from_pretrained("apollo-research/gpt2_noLN").to("cpu")
# Undo my hacky LayerNorm removal
for block in model.transformer.h:
block.ln_1.weight.data = block.ln_1.weight.data / 1e6
block.ln_1.eps = 1e-5
block.ln_2.weight.data = block.ln_2.weight.data / 1e6
block.ln_2.eps = 1e-5
model.transformer.ln_f.weight.data = model.transformer.ln_f.weight.data / 1e6
model.transformer.ln_f.eps = 1e-5
# Properly replace LayerNorms by Identities
def removeLN(transformer_lens_model):
for i in range(len(transformer_lens_model.blocks)):
transformer_lens_model.blocks[i].ln1 = torch.nn.Identity()
transformer_lens_model.blocks[i].ln2 = torch.nn.Identity()
transformer_lens_model.ln_final = torch.nn.Identity()
hooked_model = HookedTransformer.from_pretrained("gpt2", hf_model=model, fold_ln=True, center_unembed=False).to("cpu")
removeLN(hooked_model)
model_nnsight = UnifiedTransformer(model="gpt2", hf_model=model, fold_ln=True, center_unembed=False).to("cpu")
removeLN(model_nnsight)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
prompt = torch.tensor([1,2,3,4], device=device)
logits = hooked_model(prompt)
with torch.no_grad(), model_nnsight.trace(prompt) as runner:
logits2 = model_nnsight.unembed.output.save()
logits, cache = hooked_model.run_with_cache(prompt)
torch.allclose(logits, logits2)
Maybe this should be like Anthropic’s shared decoder bias? Essentially subtract off the per-token bias at the beginning, let the SAE reconstruct this “residual”, then add the per-token bias back to the reconstructed x.
The motivation is that the SAE has a weird job in this case. It sees x, but needs to reconstruct x—per-token-bias, which means it needs to somehow learn what that per-token-bias is during training.
However, if you just subtract it first, then the SAE sees x’, and just needs to reconstruct x’.
So I’m just suggesting changing here:
w/ remaining the same:
That’s great thanks!
My suggested experiment to really get at this question (which if I were in your shoes, I wouldn’t want to run cause you’ve already done quite a bit of work on this project!, lol):
Compare
1. Baseline 80x expansion (56k features) at k=30
2. Tokenized-learned 8x expansion (50k vocab + 6k features) at k=29 (since the token adds 1 extra feature)
for 300M tokens (I usually don’t see improvements past this amount) showing NMSE and CE.
If tokenized-SAEs are still better in this experiment, then that’s a pretty solid argument to use these!
If they’re equivalent, then tokenized-SAEs are still way faster to train in this lower expansion range, while having 50k “features” already interpreted.
If tokenized-SAEs are worse, then these tokenized features aren’t a good prior to use. Although both sets of features are learned, the difference would be the tokenized always has the same feature per token (duh), and baseline SAEs allow whatever combination of features (e.g. features shared across different tokens).
About similar tokenized features, maybe I’m misunderstanding, but this seems like a problem for any decoder-like structure.
I didn’t mean to imply it’s a problem, but the intepretation should be different. For example, if at layer N, all the number tokens have cos-sim=1 in the tokenized-feature set, then if we find a downstream feature reading from ” 9″ token on a specific task, then we should conclude it’s reading from a more general number direction than a specific number direction.
I agree this argument also applies to the normal SAE decoder (if the cos-sim=1)
Although, tokenized features are dissimilar to normal features in that they don’t vary in activation strength. Tokenized features are either 0 or 1 (or norm of the vector). So it’s not exactly an apples-to-apples comparison w/ a similar sized dictionary of normal SAE features, although that plot would be nice!
It’d be important to cache the karma of all users > 1000 atm, in order to credibly signal you know which generals were part of the nuking/nuked side. Would anyone be willing to do that in the next 2 & 1⁄2 hours? (ie the earliest we could be nuked)