Yeah, this makes a ton of sense. Thx for taking the time to give it a closer look and also your detailed response :)
So then in order for the SAE to be useful I’d have to train it on a lot of sentiment data and then I could maybe discover some interpretable sentiment related features that could help me understand why a model thinks a review is positive/negative...
The relative difference in the train accuracies looks pretty similar. But yeah, @SenR already pointed to the low number of active features in the SAE, so that explains this nicely.