My core request is that I want (SAE-)features to be a property of the model, rather than the dataset.
This can be misunderstood in the sense of taking issue with “If a concept is missing from the SAE training set, the SAE won’t find the corresponding feature.”—no, this is fine, the model-feature exists but simply isn’t found by the SAE.
What I mean to say is I take issue if “SAEs find a feature only because this concept is common in the dataset rather than because the model uses this concept.”[1] -- in my books this is SAEs making up features and that won’t help us understand models
Of course a concept being common in the model-training-data makes it likely (?) to be a concept the model uses, but I don’t think this is a 1:1 correspondence. (So just making the SAE training set equal to the model training set wouldn’t solve the issue.)
My core request is that I want (SAE-)features to be a property of the model, rather than the dataset.
This can be misunderstood in the sense of taking issue with “If a concept is missing from the SAE training set, the SAE won’t find the corresponding feature.”—no, this is fine, the model-feature exists but simply isn’t found by the SAE.
What I mean to say is I take issue if “SAEs find a feature only because this concept is common in the dataset rather than because the model uses this concept.”[1] -- in my books this is SAEs making up features and that won’t help us understand models
Of course a concept being common in the model-training-data makes it likely (?) to be a concept the model uses, but I don’t think this is a 1:1 correspondence. (So just making the SAE training set equal to the model training set wouldn’t solve the issue.)