Shauli Ravfogel comments on A Mystery About High Dimensional Concept Encoding

Shauli Ravfogel 4 Nov 2022 10:35 UTC
3 points
0
1. Actually, RLACE has a lesser impact on the representation space, since it removes just a rank-1 subspace.
2. Note that if we train a linear classifier w to convergence (as done in the first iteration of INLP), then by definition we can project the entire representation space over the direction w and retain the very same accuracy—because that subspace that is spanned by w is the only thing the linear classifier is “looking at”. We performed experiments similar in spirit to what you suggest with INLP in [this](https://arxiv.org/pdf/2105.06965.pdf) paper. In the attached image you can see the effect of a positive/negative intervention across layers:
- Fabien Roger 4 Nov 2022 14:30 UTC
  3 points
  0
  Parent
  In the experiments I ran with GPT-2, RLACE and INLP are both used with a rank-1 projection. So RLACE could have “more impact” if it removed a more important direction, which I think it does.
  I know it’s not the intended use of INLP, but I got my inspiration from this technique, and that’s why I write INLP (Ravfogel, 2020) (the original technique removes multiple directions to obtain a measurable effect)
  [Edit] Tell me if you prefer that I avoid calling the “linear classifier method” INLP (it isn’t actually iterated in the experiments I ran, but it is where I discovered the idea of using a linear classifier to project data to remove information)!