RSS

Nicholas Goldowsky-Dill

Karma: 886

Interpretability Researcher at Apollo Research

Claude Son­net 3.7 (of­ten) knows when it’s in al­ign­ment evaluations

Mar 17, 2025, 7:11 PM
177 points
7 comments6 min readLW link

De­tect­ing Strate­gic De­cep­tion Us­ing Lin­ear Probes

Feb 6, 2025, 3:46 PM
102 points
9 comments2 min readLW link
(arxiv.org)

Ni­cholas Goldowsky-Dill’s Shortform

Nicholas Goldowsky-DillNov 6, 2024, 12:37 PM
5 points
2 commentsLW link

A List of 45+ Mech In­terp Pro­ject Ideas from Apollo Re­search’s In­ter­pretabil­ity Team

Jul 18, 2024, 2:15 PM
121 points
18 comments18 min readLW link

Apollo Re­search 1-year update

May 29, 2024, 5:44 PM
93 points
0 comments7 min readLW link

The Lo­cal In­ter­ac­tion Ba­sis: Iden­ti­fy­ing Com­pu­ta­tion­ally-Rele­vant and Sparsely In­ter­act­ing Fea­tures in Neu­ral Networks

May 20, 2024, 5:53 PM
105 points
4 comments3 min readLW link

Iden­ti­fy­ing Func­tion­ally Im­por­tant Fea­tures with End-to-End Sparse Dic­tionary Learning

May 17, 2024, 4:25 PM
57 points
20 comments4 min readLW link
(arxiv.org)

Causal scrub­bing: re­sults on in­duc­tion heads

Dec 3, 2022, 12:59 AM
34 points
1 comment17 min readLW link

Causal scrub­bing: re­sults on a paren bal­ance checker

Dec 3, 2022, 12:59 AM
34 points
2 comments30 min readLW link

Causal scrub­bing: Appendix

Dec 3, 2022, 12:58 AM
18 points
4 comments20 min readLW link