RSS

EIS XIII: Reflec­tions on An­thropic’s SAE Re­search Circa May 2024

scasper21 May 2024 20:15 UTC
91 points
6 comments3 min readLW link

An­nounc­ing Hu­man-al­igned AI Sum­mer School

22 May 2024 8:55 UTC
15 points
0 comments1 min readLW link
(humanaligned.ai)

The Lo­cal In­ter­ac­tion Ba­sis: Iden­ti­fy­ing Com­pu­ta­tion­ally-Rele­vant and Sparsely In­ter­act­ing Fea­tures in Neu­ral Networks

20 May 2024 17:53 UTC
77 points
2 comments3 min readLW link

The Prob­lem With the Word ‘Align­ment’

21 May 2024 3:48 UTC
53 points
4 comments6 min readLW link

Towards Guaran­teed Safe AI: A Frame­work for En­sur­ing Ro­bust and Reli­able AI Systems

Joar Skalse17 May 2024 19:13 UTC
63 points
6 comments2 min readLW link

In­fra-Bayesian haggling

hannagabor20 May 2024 12:23 UTC
17 points
0 comments20 min readLW link

Iden­ti­fy­ing Func­tion­ally Im­por­tant Fea­tures with End-to-End Sparse Dic­tionary Learning

17 May 2024 16:25 UTC
53 points
2 comments4 min readLW link
(arxiv.org)

Trans­form­ers Rep­re­sent Belief State Geom­e­try in their Resi­d­ual Stream

Adam Shai16 Apr 2024 21:16 UTC
380 points
91 comments12 min readLW link

Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

30 Apr 2024 18:51 UTC
166 points
37 comments45 min readLW link

In­struc­tion-fol­low­ing AGI is eas­ier and more likely than value al­igned AGI

Seth Herd15 May 2024 19:38 UTC
35 points
21 comments12 min readLW link

Re­fusal in LLMs is me­di­ated by a sin­gle direction

27 Apr 2024 11:13 UTC
191 points
79 comments10 min readLW link

Un­cov­er­ing De­cep­tive Ten­den­cies in Lan­guage Models: A Si­mu­lated Com­pany AI Assistant

6 May 2024 7:07 UTC
82 points
4 comments1 min readLW link
(arxiv.org)

AXRP Epi­sode 31 - Sin­gu­lar Learn­ing The­ory with Daniel Murfet

DanielFilan7 May 2024 3:50 UTC
73 points
4 comments71 min readLW link

Sim­ple probes can catch sleeper agents

23 Apr 2024 21:10 UTC
119 points
17 comments1 min readLW link
(www.anthropic.com)

Lin­ear in­fra-Bayesian Bandits

Vanessa Kosoy10 May 2024 6:41 UTC
38 points
5 comments1 min readLW link
(arxiv.org)

AI Safety Strate­gies Landscape

Charbel-Raphaël9 May 2024 17:33 UTC
31 points
1 comment42 min readLW link

Modern Trans­form­ers are AGI, and Hu­man-Level

abramdemski26 Mar 2024 17:46 UTC
213 points
89 comments5 min readLW link

Transcoders en­able fine-grained in­ter­pretable cir­cuit anal­y­sis for lan­guage models

30 Apr 2024 17:58 UTC
59 points
14 comments17 min readLW link

Discrim­i­nat­ing Be­hav­iorally Iden­ti­cal Clas­sifiers: a model prob­lem for ap­ply­ing in­ter­pretabil­ity to scal­able oversight

Sam Marks18 Apr 2024 16:17 UTC
101 points
7 comments12 min readLW link

Con­structabil­ity: Plainly-coded AGIs may be fea­si­ble in the near future

27 Apr 2024 16:04 UTC
66 points
12 comments13 min readLW link