Yes I did do a fair bit of literature searching (though maybe not enough tbf) but very focused on sparse coding and approaches to learning decompositions of model activation spaces rather than approaches to learning models which are monosemantic by default which I’ve never had much confidence in, and it seems that there’s not a huge amount beyond Yun et al’s work, at least as far as I’ve seen.
Still though, I’ve not seen almost any of these which suggests a big hole in my knowledge, and in the paper I’ll go through and add a lot more background to attempts to make more interpretable models.
Hi Scott, thanks for this!
Yes I did do a fair bit of literature searching (though maybe not enough tbf) but very focused on sparse coding and approaches to learning decompositions of model activation spaces rather than approaches to learning models which are monosemantic by default which I’ve never had much confidence in, and it seems that there’s not a huge amount beyond Yun et al’s work, at least as far as I’ve seen.
Still though, I’ve not seen almost any of these which suggests a big hole in my knowledge, and in the paper I’ll go through and add a lot more background to attempts to make more interpretable models.