RSS

scasper

Karma: 2,006

https://​​stephencasper.com/​​

Refram­ing AI Safety as a Nev­erend­ing In­sti­tu­tional Challenge

scasperMar 23, 2025, 12:13 AM
52 points
12 comments5 min readLW link

EIS XV: A New Proof of Con­cept for Use­ful Interpretability

scasperMar 17, 2025, 8:05 PM
30 points
2 comments3 min readLW link

EIS XIV: Is mechanis­tic in­ter­pretabil­ity about to be prac­ti­cally use­ful?

scasperOct 11, 2024, 10:13 PM
68 points
4 comments7 min readLW link

Can Gen­er­al­ized Ad­ver­sar­ial Test­ing En­able More Ri­gor­ous LLM Safety Evals?

scasperJul 30, 2024, 2:57 PM
25 points
0 comments4 min readLW link

EIS XIII: Reflec­tions on An­thropic’s SAE Re­search Circa May 2024

scasperMay 21, 2024, 8:15 PM
157 points
16 comments3 min readLW link

Analo­gies be­tween scal­ing labs and mis­al­igned su­per­in­tel­li­gent AI

scasperFeb 21, 2024, 7:29 PM
77 points
5 comments4 min readLW link

Deep For­get­ting & Un­learn­ing for Safely-Scoped LLMs

scasperDec 5, 2023, 4:48 PM
125 points
30 comments13 min readLW link

Scal­able And Trans­fer­able Black-Box Jailbreaks For Lan­guage Models Via Per­sona Modulation

Nov 7, 2023, 5:59 PM
38 points
2 comments2 min readLW link
(arxiv.org)

The 6D effect: When com­pa­nies take risks, one email can be very pow­er­ful.

scasperNov 4, 2023, 8:08 PM
278 points
42 comments3 min readLW link

An­nounc­ing the CNN In­ter­pretabil­ity Competition

scasperSep 26, 2023, 4:21 PM
22 points
0 comments4 min readLW link

Open Prob­lems and Fun­da­men­tal Limi­ta­tions of RLHF

scasperJul 31, 2023, 3:31 PM
66 points
6 comments2 min readLW link
(arxiv.org)

A Short Memo on AI In­ter­pretabil­ity Rain­bows

scasperJul 27, 2023, 11:05 PM
18 points
0 comments2 min readLW link

Ex­am­ples of Prompts that Make GPT-4 Out­put Falsehoods

Jul 22, 2023, 8:21 PM
21 points
5 comments6 min readLW link

Eight Strate­gies for Tack­ling the Hard Part of the Align­ment Problem

scasperJul 8, 2023, 6:55 PM
42 points
11 comments7 min readLW link

Take­aways from the Mechanis­tic In­ter­pretabil­ity Challenges

scasperJun 8, 2023, 6:56 PM
94 points
5 comments6 min readLW link

Ad­vice for En­ter­ing AI Safety Research

scasperJun 2, 2023, 8:46 PM
26 points
2 comments5 min readLW link

GPT-4 is eas­ily con­trol­led/​ex­ploited with tricky de­ci­sion the­o­retic dilem­mas.

scasperApr 14, 2023, 7:39 PM
6 points
4 comments2 min readLW link

EIS XII: Sum­mary

scasperFeb 23, 2023, 5:45 PM
18 points
0 comments6 min readLW link

EIS XI: Mov­ing Forward

scasperFeb 22, 2023, 7:05 PM
19 points
2 comments9 min readLW link

EIS X: Con­tinual Learn­ing, Mo­du­lar­ity, Com­pres­sion, and Biolog­i­cal Brains

scasperFeb 21, 2023, 4:59 PM
14 points
4 comments3 min readLW link