RSS

Re­fusal in LLMs is me­di­ated by a sin­gle direction

27 Apr 2024 11:13 UTC
19 points
0 comments9 min readLW link

Su­per­po­si­tion is not “just” neu­ron polysemanticity

LawrenceC26 Apr 2024 23:22 UTC
25 points
0 comments13 min readLW link

An In­tro­duc­tion to AI Sandbagging

26 Apr 2024 13:40 UTC
28 points
0 comments8 min readLW link

AXRP Epi­sode 29 - Science of Deep Learn­ing with Vikrant Varma

DanielFilan25 Apr 2024 19:10 UTC
18 points
1 comment63 min readLW link

Im­prov­ing Dic­tionary Learn­ing with Gated Sparse Autoencoders

25 Apr 2024 18:43 UTC
60 points
23 comments1 min readLW link
(arxiv.org)

Sim­ple probes can catch sleeper agents

23 Apr 2024 21:10 UTC
118 points
14 comments1 min readLW link
(www.anthropic.com)

De­quan­tify­ing first-or­der theories

jessicata23 Apr 2024 19:04 UTC
39 points
8 comments8 min readLW link
(unstableontology.com)

ProLU: A Non­lin­ear­ity for Sparse Autoencoders

Glen Taggart23 Apr 2024 14:09 UTC
29 points
2 comments8 min readLW link

Time com­plex­ity for de­ter­minis­tic string machines

alcatal21 Apr 2024 22:35 UTC
14 points
0 comments21 min readLW link

In­duc­ing Un­prompted Misal­ign­ment in LLMs

19 Apr 2024 20:00 UTC
35 points
6 comments16 min readLW link

[Full Post] Progress Up­date #1 from the GDM Mech In­terp Team

19 Apr 2024 19:06 UTC
70 points
8 comments8 min readLW link

[Sum­mary] Progress Up­date #1 from the GDM Mech In­terp Team

19 Apr 2024 19:06 UTC
68 points
0 comments3 min readLW link

Discrim­i­nat­ing Be­hav­iorally Iden­ti­cal Clas­sifiers: a model prob­lem for ap­ply­ing in­ter­pretabil­ity to scal­able oversight

Sam Marks18 Apr 2024 16:17 UTC
99 points
5 comments12 min readLW link

LLM Eval­u­a­tors Rec­og­nize and Fa­vor Their Own Generations

17 Apr 2024 21:09 UTC
43 points
1 comment3 min readLW link
(tiny.cc)

Trans­form­ers Rep­re­sent Belief State Geom­e­try in their Resi­d­ual Stream

Adam Shai16 Apr 2024 21:16 UTC
304 points
63 comments12 min readLW link

Speedrun ru­iner re­search idea

lukehmiles13 Apr 2024 23:42 UTC
4 points
11 comments2 min readLW link

AXRP Epi­sode 27 - AI Con­trol with Buck Sh­legeris and Ryan Greenblatt

DanielFilan11 Apr 2024 21:30 UTC
67 points
10 comments107 min readLW link

The the­ory of Prox­i­mal Policy Op­ti­mi­sa­tion implementations

salman.mohammadi11 Apr 2024 13:00 UTC
3 points
1 comment6 min readLW link
(salmanmohammadi.github.io)

How I se­lect al­ign­ment re­search projects

10 Apr 2024 4:33 UTC
34 points
4 comments24 min readLW link

PIBBSS is hiring in a va­ri­ety of roles (al­ign­ment re­search and in­cu­ba­tion pro­gram)

9 Apr 2024 8:12 UTC
47 points
0 comments3 min readLW link