Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
scasper
Karma:
1,900
https://stephencasper.com/
All
Posts
Comments
New
Top
Old
Page
1
EIS XIV: Is mechanistic interpretability about to be practically useful?
scasper
11 Oct 2024 22:13 UTC
68
points
4
comments
7
min read
LW
link
Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?
scasper
30 Jul 2024 14:57 UTC
25
points
0
comments
4
min read
LW
link
EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024
scasper
21 May 2024 20:15 UTC
157
points
16
comments
3
min read
LW
link
Analogies between scaling labs and misaligned superintelligent AI
scasper
21 Feb 2024 19:29 UTC
75
points
5
comments
4
min read
LW
link
Deep Forgetting & Unlearning for Safely-Scoped LLMs
scasper
5 Dec 2023 16:48 UTC
123
points
30
comments
13
min read
LW
link
Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation
Soroush Pour
,
rusheb
,
Quentin FEUILLADE--MONTIXI
,
Arush
and
scasper
7 Nov 2023 17:59 UTC
36
points
2
comments
2
min read
LW
link
(arxiv.org)
The 6D effect: When companies take risks, one email can be very powerful.
scasper
4 Nov 2023 20:08 UTC
275
points
42
comments
3
min read
LW
link
Announcing the CNN Interpretability Competition
scasper
26 Sep 2023 16:21 UTC
22
points
0
comments
4
min read
LW
link
Open Problems and Fundamental Limitations of RLHF
scasper
31 Jul 2023 15:31 UTC
66
points
6
comments
2
min read
LW
link
(arxiv.org)
A Short Memo on AI Interpretability Rainbows
scasper
27 Jul 2023 23:05 UTC
18
points
0
comments
2
min read
LW
link
Examples of Prompts that Make GPT-4 Output Falsehoods
scasper
and
Luke Bailey
22 Jul 2023 20:21 UTC
21
points
5
comments
6
min read
LW
link
Eight Strategies for Tackling the Hard Part of the Alignment Problem
scasper
8 Jul 2023 18:55 UTC
42
points
11
comments
7
min read
LW
link
Takeaways from the Mechanistic Interpretability Challenges
scasper
8 Jun 2023 18:56 UTC
94
points
5
comments
6
min read
LW
link
Advice for Entering AI Safety Research
scasper
2 Jun 2023 20:46 UTC
26
points
2
comments
5
min read
LW
link
GPT-4 is easily controlled/exploited with tricky decision theoretic dilemmas.
scasper
14 Apr 2023 19:39 UTC
6
points
4
comments
2
min read
LW
link
EIS XII: Summary
scasper
23 Feb 2023 17:45 UTC
18
points
0
comments
6
min read
LW
link
EIS XI: Moving Forward
scasper
22 Feb 2023 19:05 UTC
19
points
2
comments
9
min read
LW
link
EIS X: Continual Learning, Modularity, Compression, and Biological Brains
scasper
21 Feb 2023 16:59 UTC
14
points
4
comments
3
min read
LW
link
EIS IX: Interpretability and Adversaries
scasper
20 Feb 2023 18:25 UTC
30
points
8
comments
8
min read
LW
link
EIS VIII: An Engineer’s Understanding of Deceptive Alignment
scasper
19 Feb 2023 15:25 UTC
30
points
5
comments
4
min read
LW
link
Back to top
Next