RSS

scasper

Karma: 1,900

https://​​stephencasper.com/​​

EIS XIV: Is mechanis­tic in­ter­pretabil­ity about to be prac­ti­cally use­ful?

scasper11 Oct 2024 22:13 UTC
68 points
4 comments7 min readLW link

Can Gen­er­al­ized Ad­ver­sar­ial Test­ing En­able More Ri­gor­ous LLM Safety Evals?

scasper30 Jul 2024 14:57 UTC
25 points
0 comments4 min readLW link

EIS XIII: Reflec­tions on An­thropic’s SAE Re­search Circa May 2024

scasper21 May 2024 20:15 UTC
157 points
16 comments3 min readLW link

Analo­gies be­tween scal­ing labs and mis­al­igned su­per­in­tel­li­gent AI

scasper21 Feb 2024 19:29 UTC
75 points
5 comments4 min readLW link

Deep For­get­ting & Un­learn­ing for Safely-Scoped LLMs

scasper5 Dec 2023 16:48 UTC
123 points
30 comments13 min readLW link

Scal­able And Trans­fer­able Black-Box Jailbreaks For Lan­guage Models Via Per­sona Modulation

7 Nov 2023 17:59 UTC
36 points
2 comments2 min readLW link
(arxiv.org)

The 6D effect: When com­pa­nies take risks, one email can be very pow­er­ful.

scasper4 Nov 2023 20:08 UTC
275 points
42 comments3 min readLW link

An­nounc­ing the CNN In­ter­pretabil­ity Competition

scasper26 Sep 2023 16:21 UTC
22 points
0 comments4 min readLW link

Open Prob­lems and Fun­da­men­tal Limi­ta­tions of RLHF

scasper31 Jul 2023 15:31 UTC
66 points
6 comments2 min readLW link
(arxiv.org)

A Short Memo on AI In­ter­pretabil­ity Rain­bows

scasper27 Jul 2023 23:05 UTC
18 points
0 comments2 min readLW link

Ex­am­ples of Prompts that Make GPT-4 Out­put Falsehoods

22 Jul 2023 20:21 UTC
21 points
5 comments6 min readLW link

Eight Strate­gies for Tack­ling the Hard Part of the Align­ment Problem

scasper8 Jul 2023 18:55 UTC
42 points
11 comments7 min readLW link

Take­aways from the Mechanis­tic In­ter­pretabil­ity Challenges

scasper8 Jun 2023 18:56 UTC
94 points
5 comments6 min readLW link

Ad­vice for En­ter­ing AI Safety Research

scasper2 Jun 2023 20:46 UTC
26 points
2 comments5 min readLW link

GPT-4 is eas­ily con­trol­led/​ex­ploited with tricky de­ci­sion the­o­retic dilem­mas.

scasper14 Apr 2023 19:39 UTC
6 points
4 comments2 min readLW link

EIS XII: Sum­mary

scasper23 Feb 2023 17:45 UTC
18 points
0 comments6 min readLW link

EIS XI: Mov­ing Forward

scasper22 Feb 2023 19:05 UTC
19 points
2 comments9 min readLW link

EIS X: Con­tinual Learn­ing, Mo­du­lar­ity, Com­pres­sion, and Biolog­i­cal Brains

scasper21 Feb 2023 16:59 UTC
14 points
4 comments3 min readLW link

EIS IX: In­ter­pretabil­ity and Adversaries

scasper20 Feb 2023 18:25 UTC
30 points
8 comments8 min readLW link

EIS VIII: An Eng­ineer’s Un­der­stand­ing of De­cep­tive Alignment

scasper19 Feb 2023 15:25 UTC
30 points
5 comments4 min readLW link