This is cool! I wonder if it can be fixed. I imagine it could be improved some amount by nudging the prefix distribution, but it doesn’t seem like that will solve it properly. Curious if this is a large issue in real LMs. It’s frustrating that there aren’t ground-truth features we have access to in language models.
I think how large of a problem this is can probably be inferred from a description of the feature distribution. It’d be nice to have a better sense of what that distribution is (assuming the paradigm is correct enough).
This is cool! I wonder if it can be fixed. I imagine it could be improved some amount by nudging the prefix distribution, but it doesn’t seem like that will solve it properly. Curious if this is a large issue in real LMs. It’s frustrating that there aren’t ground-truth features we have access to in language models.
I think how large of a problem this is can probably be inferred from a description of the feature distribution. It’d be nice to have a better sense of what that distribution is (assuming the paradigm is correct enough).
It might also be an artifact of using MSE loss. Maybe a different loss term for reconstruction loss might not have this problem?