Tom McGrath

Karma: 107

[Linkpost] Play with SAEs on Llama 3

Sep 25, 2024, 10:35 PM

40 points

Tom McGrath Jan 24, 2024, 7:28 PM
3 points
0
in reply to: niplav’s comment on: Safety as a Scientific Pursuit
Thanks! I really like inductive vs deductive and would probably have used them if I’d thought of it.

Oct 13, 2023, 6:32 PM

82 points

Tom McGrath Nov 19, 2021, 4:06 PM
5 points
on: “Acquisition of Chess Knowledge in AlphaZero”: probing AZ over time
I’m one of the authors on this paper—happy to answer any questions/discuss if anyone is interested.

Tom McGrath Nov 19, 2021, 4:06 PM
4 points
in reply to: Zac Hatfield-Dodds’s comment on: “Acquisition of Chess Knowledge in AlphaZero”: probing AZ over time
Thanks for the summary! Your first bullet point was my motivation for doing this. I think it’s important to test out interpretability ideas in more challenging domains.
We didn’t really do much interpretability in this paper, this is more meta-interpretability in a sense (i.e. studying whether interpretability should in principle be possible). I’d say section 4 is worth a look, especially section 4.5 which covers fundamental and practical challenges to probing. Section 7 has some NMF analysis, and we open-sourced NMF factors which you might find interesting.