Sam Marks comments on Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

Sam Marks 22 Apr 2024 20:23 UTC
LW: 4 AF: 2
0
AF
I’m pretty sure that you’re not correct that the interpretation step from our SHIFT experiments essentially relies on using data from the Pile. I strongly expect that if we were to only use inputs from $D_{a}$ then we would be able to interpret the SAE features about as well. E.g. some of the SAE features only activate on female pronouns, and we would be able to notice this. Technically, we wouldn’t be able to rule out the hypothesis “this feature activates on female pronouns only when their antecedent is a nurse,” but that would be a bit of a crazy hypothesis anyway.
In more realistic settings (larger models and subtler behaviors) we might have more serious problems ruling out hypotheses like this. But I don’t see any fundamental reason that using disambiguating datapoints is strictly necessary.