samshap comments on Engineering Monosemanticity in Toy Models

samshap 22 Nov 2022 1:35 UTC
1 point
0
My weak prediction is that adding low levels of noise would change the polysemantic activations, but not the monosemantic ones.

Adding L1 to the loss allows the network to converge on solutions that are more monosemantic than otherwise, at the cost of some estimation error. Basically, the network is less likely to lean on polysemantic neurons to make up small errors. I think your best bet is to apply the L1 loss on the hidden layer and the output later activations.