RSS

scasper

Karma: 1,900

https://​​stephencasper.com/​​

EIS XIV: Is mechanis­tic in­ter­pretabil­ity about to be prac­ti­cally use­ful?

scasper11 Oct 2024 22:13 UTC
68 points
4 comments7 min readLW link

Can Gen­er­al­ized Ad­ver­sar­ial Test­ing En­able More Ri­gor­ous LLM Safety Evals?

scasper30 Jul 2024 14:57 UTC
25 points
0 comments4 min readLW link

EIS XIII: Reflec­tions on An­thropic’s SAE Re­search Circa May 2024

scasper21 May 2024 20:15 UTC
157 points
16 comments3 min readLW link

Analo­gies be­tween scal­ing labs and mis­al­igned su­per­in­tel­li­gent AI

scasper21 Feb 2024 19:29 UTC
75 points
5 comments4 min readLW link

Deep For­get­ting & Un­learn­ing for Safely-Scoped LLMs

scasper5 Dec 2023 16:48 UTC
123 points
30 comments13 min readLW link

Scal­able And Trans­fer­able Black-Box Jailbreaks For Lan­guage Models Via Per­sona Modulation

7 Nov 2023 17:59 UTC
36 points
2 comments2 min readLW link
(arxiv.org)