Great work! Love the push for intuitions especially in the working notes.
My understanding of superposition hypothesis from TMS paper has been(feel free to correct me!):
When there’s no privileged basis polysemanticity is the default as there’s no reason to expect interpretable neurons.
When there’s a privileged basis either because of non linearity on the hidden layer or L1 regularisation, default is monosemanticity and superposition pushes towards polysemanticity when there’s enough sparsity.
Is it possible that the features here are not enough basis aligned and is closer to case 1? As you already commented demonstrating polysemanticity when the hidden layer has a non linearity and m>n would be principled imo.
Sorry for the late answer! I agree with your assessment of the TMS paper. In our case, the L1 regularization is strong enough that the encodings do completely align with the canonical basis: in the experiments that gave the “Polysemantic neurons vs hidden neurons” graph, we observe that all weights are either 0 or close to 1 or −1. And I think that all solutions which minimize the loss (with L1-regularization included) align with the canonical basis.
Great work! Love the push for intuitions especially in the working notes.
My understanding of superposition hypothesis from TMS paper has been(feel free to correct me!):
When there’s no privileged basis polysemanticity is the default as there’s no reason to expect interpretable neurons.
When there’s a privileged basis either because of non linearity on the hidden layer or L1 regularisation, default is monosemanticity and superposition pushes towards polysemanticity when there’s enough sparsity.
Is it possible that the features here are not enough basis aligned and is closer to case 1? As you already commented demonstrating polysemanticity when the hidden layer has a non linearity and m>n would be principled imo.
Sorry for the late answer! I agree with your assessment of the TMS paper. In our case, the L1 regularization is strong enough that the encodings do completely align with the canonical basis: in the experiments that gave the “Polysemantic neurons vs hidden neurons” graph, we observe that all weights are either 0 or close to 1 or −1. And I think that all solutions which minimize the loss (with L1-regularization included) align with the canonical basis.