How do you know what “ideal behaviour” is after you steer or project out your feature? How would you differentiate a feature with sufficiently high cosine sim to a “true model feature” and a “true model feature”? I agree you can get some signal on whether a feature is causal, but would argue this is not ambitious enough.
This is also a concern I have but I feel like steering / project out is kinda sufficient to understand if the model uses this feature.
How do you know what “ideal behaviour” is after you steer or project out your feature? How would you differentiate a feature with sufficiently high cosine sim to a “true model feature” and a “true model feature”? I agree you can get some signal on whether a feature is causal, but would argue this is not ambitious enough.