Jaehyuk Lim comments on An Interpretability Illusion from Population Statistics in Causal Analysis

Jaehyuk Lim 31 Jul 2024 2:32 UTC
1 point
0
Do you also conclude that the causal role of the circuit you discovered was spurious? What’s a better way to incorporate the mentioned sample-level variance in measuring the effectiveness of an SAE feature or SV? (i.e. should a good metric of causal importance satisfy both sample- and population-level increase?)

Could you also link to an example where causal intervention satisfied the above-mentioned (or your own alternative that was not mentioned in this post) criteria?
- Daniel Tan 2 Aug 2024 11:26 UTC
  2 points
  0
  Parent
  What’s a better way to incorporate the mentioned sample-level variance in measuring the effectiveness of an SAE feature or SV?
  In the steering vectors work I linked, we looked at how much of the variance in the metric was explained by a spurious factor, and I think that could be a useful technique if you have some a priori intuition about what the variance might be due to. However, this doesn’t mean we can just test a bunch of hypotheses, because that looks like p-hacking.
  Generally, I do think that ‘population variance’ should be a metric that’s reported alongside ‘population mean’ in order to contextualize findings. But again this doesn’t tell a very clean picture; variance being high could be due to heteroscedasticity, among other things
  I don’t have great solutions for this illusion outside of those two recommendations. One naive way we might try to solve this is to remove things from the dataset until the variance is minimal, but it’s hard to do this in a right way that doesn’t eventually look like p-hacking.
  Do you also conclude that the causal role of the circuit you discovered was spurious?
  an example where causal intervention satisfied the above-mentioned (or your own alternative that was not mentioned in this post) criteria
  I would guess that the IOI SAE circuit we found is not unduly influenced by spurious factors, and that the analysis using (variance in the metric difference explained by ABBA / BABA) would corroborate this. I haven’t rigorously tested this, but I’d be very surprised if this turned out not to be the case