Sam Marks comments on SHIFT relies on token-level features to de-bias Bias in Bios probes

Sam Marks 20 Mar 2025 0:28 UTC
5 points
0
I agree with Tim’s top-line critique: Given the same affordances used by SHIFT, you can match SHIFT’s performance on the Bias in Bios task. In the rest of this comment, I’ll reflect on the update I make from this.
First, let me disambiguate two things you can learn from testing a method (like SHIFT) on a downstream task:
1. Whether the method works-as-intended. E.g. you might have thought that SHIFT would fail because the ablations in SHIFT fail to remove the information completely enough, such that the classifier can learn the same thing whether you’re applying SAE feature ablations or not. But we learned that SHIFT did not fail in this way.
2. Whether the method outperforms other baseline techniques. Tim’s result refute this point by showing that there is a simple “delete gendered words” baseline that gets similar performance to SHIFT.
I think that people often underrate the importance of point (2). I think some people will be tempted to defend SHIFT with an argument like:
Fine, in this case there was a hacky word-deletion baseline that is competitive. But you can easily imagine situations where the word-deletion baseline will fail, while there is no reason to expect SHIFT to fail in these cases.
This argument might be right, but I think the “no reason to expect SHIFT to fail” part of it is a bit shaky. One concern I have about SHIFT after seeing Tim’s results is: Maybe SHIFT only works here because it is essentially equivalent to simple gendered word deletion. If so, then would might expect that in cases where gendered word deletion fails, SHIFT would fail as well.
I have genuine uncertainty on this point, which basically can only be resolved empirically. Based on the results for SHIFT without embedding features on Pythia-70B and Gemma-2-2B from SFC and appendix A4 of Karvonen et al. (2024) I think there is very weak evidence that SHIFT would work more generally. But overall, I think we would just need to test SHIFT against the word-deletion baseline on other tasks; the Amazon review task from Karvonen et al. (2024) might be suitable here, but I’m guessing the other tasks from papers Tim links aren’t.
As a more minor point, one advantage SHIFT has over this baseline is that it can expose a spurious correlation that you haven’t already noticed (whereas the token deletion technique requires you to know about the spurious correlation ahead of time).