Lee Sharkey comments on [Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey 16 Dec 2022 2:14 UTC
LW: 1 AF: 1
0
AF
That’s correct. ‘Correlated features’ could ambiguously mean “Feature x tends to activate when feature y activates” OR “When we generate feature direction x, its distribution is correlated with feature y’s”. I don’t know if both happen in LMs. The former almost certainly does. The second doesn’t really make sense in the context of LMs since features are learned, not sampled from a distribution.