Had a chat with @Logan Riggs about this. My main takeaway was that if SAEs aren’t learning the features for separate squares, it’s plausibly because in the data distribution there’s some even-more-sparse pattern going on that they can exploit. E.g. if big runs of same-color stones show up regularly, it might be lower-loss to represent runs directly than to represent them as made up of separate squares.
If this is the bulk of the story, then messing around with training might not change much (but training on different data might change a lot).
Had a chat with @Logan Riggs about this. My main takeaway was that if SAEs aren’t learning the features for separate squares, it’s plausibly because in the data distribution there’s some even-more-sparse pattern going on that they can exploit. E.g. if big runs of same-color stones show up regularly, it might be lower-loss to represent runs directly than to represent them as made up of separate squares.
If this is the bulk of the story, then messing around with training might not change much (but training on different data might change a lot).