Clément Dumas comments on Activation space interpretability may be doomed

Clément Dumas 11 Jan 2025 18:20 UTC
4 points
0
This is also a concern I have but I feel like steering / project out is kinda sufficient to understand if the model uses this feature.
- bilalchughtai 12 Jan 2025 19:49 UTC
  6 points
  0
  Parent
  How do you know what “ideal behaviour” is after you steer or project out your feature? How would you differentiate a feature with sufficiently high cosine sim to a “true model feature” and a “true model feature”? I agree you can get some signal on whether a feature is causal, but would argue this is not ambitious enough.