Thanks for raising this! I had wanted to find a comparison in terms of different model performances to help me quantify this so I’m glad to have this as a reference.
And do I see it right that this is the CE increase maximum for adding in one SAE, rather than all of them at the same time? So unless there is some very kind correlation in these errors where every SAE is failing to reconstruct roughly the same variance, and that variance at early layers is not used to compute the variance SAEs at later layers are capturing, the errors would add up? Possibly even worse than linearly? What CE loss do you get then?
Have you tried talking to the patched models a bit and compared to what the original model sounds like? Any discernible systematic differences in where that CE increase is changing the answers?
While I have explored model performance with SAEs at different layers, I haven’t done so with more than one SAE or explored sampling from the model with the SAE. I’ve been curious about systematic errors induced by the SAE but a few brief experiments with earlier SAEs/smaller models didn’t reveal any obvious patterns. I have once or twice looked at the divergence in the activations after an SAE has been added and found that errors in earlier layers propagated.
One thought I have on this is that if we take the analogy to DNA sequencing seriously, relatively minor errors in DNA sequencing make the resulting sequences useless. If you get one or two base pairs wrong then try to make bacteria express the printed gene (based on your sequencing) then you’ll kill that bacteria. This gives me the intuition that I absolutely expect we could have fairly accurate measurements with some error and that the resulting error is large.
To bring it back to what I suspect is the main point here: We should amend the statement to say “Our reconstruction scores were pretty good as compared to our previous results”.
It bothers me quite a bit that SAEs don’t recover performance better, but I think this is a fairly well defined and that the community can iterate on both via improvements to SAEs and looking for nearby alternatives. For example, I’m quite excited to experiment with any alternative architectures/training procedures that come out of the theory of computation in superposition line of work.
One productive direction inspired by thinking of this as sequencing is that we should have lots of SAEs trained on the same model and show that they get very similar results (to give us more confidence we have a better estimate of the true underlying features). It’s standard in DNA/RNA/Protein sequencing to run methods many times over. I think once we see evidence that we get good results along those lines, we should be more interested in / raise our standards for model performance with reconstructed SAEs.
Thanks for raising this! I had wanted to find a comparison in terms of different model performances to help me quantify this so I’m glad to have this as a reference.
While I have explored model performance with SAEs at different layers, I haven’t done so with more than one SAE or explored sampling from the model with the SAE. I’ve been curious about systematic errors induced by the SAE but a few brief experiments with earlier SAEs/smaller models didn’t reveal any obvious patterns. I have once or twice looked at the divergence in the activations after an SAE has been added and found that errors in earlier layers propagated.
One thought I have on this is that if we take the analogy to DNA sequencing seriously, relatively minor errors in DNA sequencing make the resulting sequences useless. If you get one or two base pairs wrong then try to make bacteria express the printed gene (based on your sequencing) then you’ll kill that bacteria. This gives me the intuition that I absolutely expect we could have fairly accurate measurements with some error and that the resulting error is large.
To bring it back to what I suspect is the main point here: We should amend the statement to say “Our reconstruction scores were pretty good as compared to our previous results”.
It bothers me quite a bit that SAEs don’t recover performance better, but I think this is a fairly well defined and that the community can iterate on both via improvements to SAEs and looking for nearby alternatives. For example, I’m quite excited to experiment with any alternative architectures/training procedures that come out of the theory of computation in superposition line of work.
One productive direction inspired by thinking of this as sequencing is that we should have lots of SAEs trained on the same model and show that they get very similar results (to give us more confidence we have a better estimate of the true underlying features). It’s standard in DNA/RNA/Protein sequencing to run methods many times over. I think once we see evidence that we get good results along those lines, we should be more interested in / raise our standards for model performance with reconstructed SAEs.