RSS

Mechanis­tic In­ter­pretabil­ity Work­shop Hap­pen­ing at ICML 2024!

3 May 2024 1:18 UTC
47 points
3 comments1 min readLW link

Take SCIFs, it’s dan­ger­ous to go alone

1 May 2024 8:02 UTC
33 points
1 comment3 min readLW link

AXRP Epi­sode 30 - AI Se­cu­rity with Jeffrey Ladish

DanielFilan1 May 2024 2:50 UTC
25 points
0 comments79 min readLW link

Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

30 Apr 2024 18:51 UTC
143 points
26 comments45 min readLW link

Transcoders en­able fine-grained in­ter­pretable cir­cuit anal­y­sis for lan­guage models

30 Apr 2024 17:58 UTC
54 points
11 comments17 min readLW link

Towards a for­mal­iza­tion of the agent struc­ture problem

Alex_Altair29 Apr 2024 20:28 UTC
46 points
2 comments14 min readLW link

AISC9 has ended and there will be an AISC10

Linda Linsefors29 Apr 2024 10:53 UTC
61 points
2 comments2 min readLW link

[Aspira­tion-based de­signs] Out­look: deal­ing with complexity

28 Apr 2024 13:06 UTC
11 points
3 comments2 min readLW link

[Aspira­tion-based de­signs] 3. Perfor­mance and safety crite­ria, and as­pira­tion intervals

Jobst Heitzig28 Apr 2024 13:04 UTC
10 points
0 comments12 min readLW link

[Aspira­tion-based de­signs] 2. For­mal frame­work, ba­sic algorithm

28 Apr 2024 13:02 UTC
16 points
2 comments16 min readLW link

[Aspira­tion-based de­signs] 1. In­for­mal in­tro­duc­tion

28 Apr 2024 13:00 UTC
40 points
4 comments8 min readLW link

Re­fusal in LLMs is me­di­ated by a sin­gle direction

27 Apr 2024 11:13 UTC
176 points
66 comments10 min readLW link

Su­per­po­si­tion is not “just” neu­ron polysemanticity

LawrenceC26 Apr 2024 23:22 UTC
50 points
4 comments13 min readLW link

An In­tro­duc­tion to AI Sandbagging

26 Apr 2024 13:40 UTC
41 points
1 comment8 min readLW link

AXRP Epi­sode 29 - Science of Deep Learn­ing with Vikrant Varma

DanielFilan25 Apr 2024 19:10 UTC
19 points
1 comment63 min readLW link

Im­prov­ing Dic­tionary Learn­ing with Gated Sparse Autoencoders

25 Apr 2024 18:43 UTC
61 points
35 comments1 min readLW link
(arxiv.org)

Sim­ple probes can catch sleeper agents

23 Apr 2024 21:10 UTC
117 points
15 comments1 min readLW link
(www.anthropic.com)

De­quan­tify­ing first-or­der theories

jessicata23 Apr 2024 19:04 UTC
39 points
9 comments8 min readLW link
(unstableontology.com)

ProLU: A Non­lin­ear­ity for Sparse Autoencoders

Glen Taggart23 Apr 2024 14:09 UTC
36 points
2 comments8 min readLW link

Time com­plex­ity for de­ter­minis­tic string machines

alcatal21 Apr 2024 22:35 UTC
14 points
0 comments21 min readLW link