Seems a bit in the spirit of what’s called ‘high-level cross-validation’ here, between SAEs + auto-interp on the one hand, and activation engineering on the other.
Although not “circuit-style,” this could also be considered one of these attempts outlined by Mack et al. 2024.
https://www.lesswrong.com/posts/ioPnHKFyy4Cw2Gr2x/#:~:text=Unsupervised%20steering%20as,more%20distributed%20circuits.
Seems a bit in the spirit of what’s called ‘high-level cross-validation’ here, between SAEs + auto-interp on the one hand, and activation engineering on the other.
Although not “circuit-style,” this could also be considered one of these attempts outlined by Mack et al. 2024.
https://www.lesswrong.com/posts/ioPnHKFyy4Cw2Gr2x/#:~:text=Unsupervised%20steering%20as,more%20distributed%20circuits.