RSS

Adrià Garriga-alonso

Karma: 1,068

Fea­ture Hedg­ing: Another way cor­re­lated fea­tures break SAEs

Mar 25, 2025, 2:33 PM
10 points
0 comments17 min readLW link

Illu­sory Safety: Redteam­ing Deep­Seek R1 and the Strongest Fine-Tun­able Models of OpenAI, An­thropic, and Google

Feb 7, 2025, 3:57 AM
29 points
0 comments10 min readLW link

Craft­ing Poly­se­man­tic Trans­former Bench­marks with Known Circuits

Aug 23, 2024, 10:03 PM
10 points
0 comments25 min readLW link

Pac­ing Out­side the Box: RNNs Learn to Plan in Sokoban

Jul 25, 2024, 10:00 PM
59 points
8 comments2 min readLW link
(arxiv.org)

Com­pact Proofs of Model Perfor­mance via Mechanis­tic Interpretability

Jun 24, 2024, 7:27 PM
96 points
4 comments8 min readLW link
(arxiv.org)

Catas­trophic Good­hart in RL with KL penalty

May 15, 2024, 12:58 AM
62 points
10 comments7 min readLW link

An eval­u­a­tion of cir­cuit eval­u­a­tion metrics

Apr 15, 2024, 7:38 PM
18 points
0 comments4 min readLW link

Ophiol­ogy (or, how the Mamba ar­chi­tec­ture works)

Apr 9, 2024, 7:31 PM
67 points
8 comments10 min readLW link