quick take: Against Almost Every Theory of Impact of Interpretability should be required reading for ~anyone starting in AI safety (e.g. it should be in the AGISF curriculum), especially if they’re considering any model internals work (and of course even more so if they’re specifically considering mech interp)
quick take: Against Almost Every Theory of Impact of Interpretability should be required reading for ~anyone starting in AI safety (e.g. it should be in the AGISF curriculum), especially if they’re considering any model internals work (and of course even more so if they’re specifically considering mech interp)