Hey, I’m the first author of INLP and RLACE. The observation you point out to was highly surprising to us as well. The RLACE paper started as an attempt to prove the optimality of INLP, which turned out to be not true for classification problems. For classification, it’s just not true that the subspace that does not encode a concept is the orthogonal complement of the subspace that encodes the concept most saliently (the subspace spanned by the classifier’s parameter vector). In RLACE we do prove that this property holds for certain objectives: e.g if you want to find the subspace whose removal decreases explained variance (PCA) or correlation (CCA) the most, it’s exactly the subspace that maximizes explained variance or correlation. But that’s empirically not true for the logistic loss.
Sidenote: Follow-up work to INLP and RLACE [1] pointed out that those methods are based on correlation, and are not causal in nature. This was already acknowledged by us in other works within the “concept erasure” line [2], but they provide a theoretical analysis and also show specific constructions where the subspace found by INLP Is substantially different from the “true” subspace (under the generative model). Some of the comments here refer to that. It’s certainly true that such constructions exist, although the methods are effective in practice, and there are ways to try quantifying the robustness of the captured subspace (see e.g. the experiments in [2]). At any rate, I don’t think this has any direct connection to the discussed phenomenon (note that INLP needs to remove a high number of dimensions also for the training set, not just for the held-out data). See also this preprint for a recent discussion of the subtle limitations of linear concept identification.
[1] Kumar, Abhinav, Chenhao Tan, and Amit Sharma. “Probing Classifiers are Unreliable for Concept Removal and Detection.” arXiv preprint arXiv:2207.04153 (2022).
[2] Elazar, Y., Ravfogel, S., Jacovi, A., & Goldberg, Y. (2021). Amnesic probing: Behavioral explanation with amnesic counterfactuals. Transactions of the Association for Computational Linguistics, 9, 160-175.
Actually, RLACE has a lesser impact on the representation space, since it removes just a rank-1 subspace.
Note that if we train a linear classifier w to convergence (as done in the first iteration of INLP), then by definition we can project the entire representation space over the direction w and retain the very same accuracy—because that subspace that is spanned by w is the only thing the linear classifier is “looking at”. We performed experiments similar in spirit to what you suggest with INLP in [this](https://arxiv.org/pdf/2105.06965.pdf) paper. In the attached image you can see the effect of a positive/negative intervention across layers: