a take I’ve expressed a bunch irl but haven’t written up yet: feature sparsity might be fundamentally the wrong thing for disentangling superposition; circuit sparsity might be more correct to optimize for. in particular, circuit sparsity doesn’t have problems with feature splitting/absorption
a take I’ve expressed a bunch irl but haven’t written up yet: feature sparsity might be fundamentally the wrong thing for disentangling superposition; circuit sparsity might be more correct to optimize for. in particular, circuit sparsity doesn’t have problems with feature splitting/absorption
Yeah my view is that as long as our features/intermediate variables form human understandable circuits, it doesn’t matter how “atomic” they are.