RSS

scasper(Stephen Casper)

Karma: 1,565

https://​​stephencasper.com/​​

Analo­gies be­tween scal­ing labs and mis­al­igned su­per­in­tel­li­gent AI

scasper21 Feb 2024 19:29 UTC
72 points
4 comments4 min readLW link

Deep For­get­ting & Un­learn­ing for Safely-Scoped LLMs

scasper5 Dec 2023 16:48 UTC
109 points
29 comments13 min readLW link

Scal­able And Trans­fer­able Black-Box Jailbreaks For Lan­guage Models Via Per­sona Modulation

7 Nov 2023 17:59 UTC
36 points
2 comments2 min readLW link
(arxiv.org)

The 6D effect: When com­pa­nies take risks, one email can be very pow­er­ful.

scasper4 Nov 2023 20:08 UTC
261 points
40 comments3 min readLW link

An­nounc­ing the CNN In­ter­pretabil­ity Competition

scasper26 Sep 2023 16:21 UTC
22 points
0 comments4 min readLW link

Open Prob­lems and Fun­da­men­tal Limi­ta­tions of RLHF

scasper31 Jul 2023 15:31 UTC
66 points
6 comments2 min readLW link
(arxiv.org)

A Short Memo on AI In­ter­pretabil­ity Rain­bows

scasper27 Jul 2023 23:05 UTC
18 points
0 comments2 min readLW link

Ex­am­ples of Prompts that Make GPT-4 Out­put Falsehoods

22 Jul 2023 20:21 UTC
21 points
5 comments6 min readLW link

Eight Strate­gies for Tack­ling the Hard Part of the Align­ment Problem

scasper8 Jul 2023 18:55 UTC
42 points
11 comments7 min readLW link

Take­aways from the Mechanis­tic In­ter­pretabil­ity Challenges

scasper8 Jun 2023 18:56 UTC
93 points
5 comments6 min readLW link

Ad­vice for En­ter­ing AI Safety Research

scasper2 Jun 2023 20:46 UTC
25 points
2 comments5 min readLW link

GPT-4 is eas­ily con­trol­led/​ex­ploited with tricky de­ci­sion the­o­retic dilem­mas.

scasper14 Apr 2023 19:39 UTC
6 points
4 comments2 min readLW link

EIS XII: Sum­mary

scasper23 Feb 2023 17:45 UTC
17 points
0 comments6 min readLW link

EIS XI: Mov­ing Forward

scasper22 Feb 2023 19:05 UTC
19 points
2 comments9 min readLW link

EIS X: Con­tinual Learn­ing, Mo­du­lar­ity, Com­pres­sion, and Biolog­i­cal Brains

scasper21 Feb 2023 16:59 UTC
14 points
4 comments3 min readLW link

EIS IX: In­ter­pretabil­ity and Adversaries

scasper20 Feb 2023 18:25 UTC
30 points
7 comments8 min readLW link

EIS VIII: An Eng­ineer’s Un­der­stand­ing of De­cep­tive Alignment

scasper19 Feb 2023 15:25 UTC
20 points
5 comments4 min readLW link

EIS VII: A Challenge for Mechanists

scasper18 Feb 2023 18:27 UTC
35 points
4 comments3 min readLW link

EIS VI: Cri­tiques of Mechanis­tic In­ter­pretabil­ity Work in AI Safety

scasper17 Feb 2023 20:48 UTC
48 points
9 comments12 min readLW link

EIS V: Blind Spots In AI Safety In­ter­pretabil­ity Research

scasper16 Feb 2023 19:09 UTC
54 points
23 comments13 min readLW link