Charlie Steiner comments on Extracting and Evaluating Causal Direction in LLMs’ Activations

Charlie Steiner 14 Dec 2022 22:51 UTC
LW: 4 AF: 3
1
AF
Nice! It took me a few takes to understand the graph showing that removing the RLACE direction causes more loss of capability on the politics and facts tests. Maybe add more explanation there for people like me?
In a way, the surprising thing to me is that RLACE does so well, not that it does badly when applied to layer 10.
- Fabien Roger 15 Dec 2022 8:36 UTC
  LW: 3 AF: 2
  0
  AF Parent
  I agree, this wasn’t very clear. I’ll add a few words.
  It also surprised me! It’s so slow to run that I wasn’t able to experiment with it a lot, but it’s definitely interesting that it performs so well. Also, earlier experiments showed that RLACE isn’t very consistent and running it multiple times yielded different results (while CDE is much more consistent), so what’s happening at layer 7 might be a fluke, RLACE getting unlucky. I’ll de-emphasize the “CDE outperforming RLACE” claims.