Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
scasper
(Stephen Casper)
Karma:
1,565
https://stephencasper.com/
All
Posts
Comments
New
Top
Old
Page
1
Analogies between scaling labs and misaligned superintelligent AI
scasper
21 Feb 2024 19:29 UTC
72
points
4
comments
4
min read
LW
link
Deep Forgetting & Unlearning for Safely-Scoped LLMs
scasper
5 Dec 2023 16:48 UTC
109
points
29
comments
13
min read
LW
link
Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation
Soroush Pour
,
rusheb
,
Quentin FEUILLADE--MONTIXI
,
Arush
and
scasper
7 Nov 2023 17:59 UTC
36
points
2
comments
2
min read
LW
link
(arxiv.org)
The 6D effect: When companies take risks, one email can be very powerful.
scasper
4 Nov 2023 20:08 UTC
261
points
40
comments
3
min read
LW
link
Announcing the CNN Interpretability Competition
scasper
26 Sep 2023 16:21 UTC
22
points
0
comments
4
min read
LW
link
Open Problems and Fundamental Limitations of RLHF
scasper
31 Jul 2023 15:31 UTC
66
points
6
comments
2
min read
LW
link
(arxiv.org)
A Short Memo on AI Interpretability Rainbows
scasper
27 Jul 2023 23:05 UTC
18
points
0
comments
2
min read
LW
link
Examples of Prompts that Make GPT-4 Output Falsehoods
scasper
and
Luke Bailey
22 Jul 2023 20:21 UTC
21
points
5
comments
6
min read
LW
link
Eight Strategies for Tackling the Hard Part of the Alignment Problem
scasper
8 Jul 2023 18:55 UTC
42
points
11
comments
7
min read
LW
link
Takeaways from the Mechanistic Interpretability Challenges
scasper
8 Jun 2023 18:56 UTC
93
points
5
comments
6
min read
LW
link
Advice for Entering AI Safety Research
scasper
2 Jun 2023 20:46 UTC
25
points
2
comments
5
min read
LW
link
GPT-4 is easily controlled/exploited with tricky decision theoretic dilemmas.
scasper
14 Apr 2023 19:39 UTC
6
points
4
comments
2
min read
LW
link
EIS XII: Summary
scasper
23 Feb 2023 17:45 UTC
17
points
0
comments
6
min read
LW
link
EIS XI: Moving Forward
scasper
22 Feb 2023 19:05 UTC
19
points
2
comments
9
min read
LW
link
EIS X: Continual Learning, Modularity, Compression, and Biological Brains
scasper
21 Feb 2023 16:59 UTC
14
points
4
comments
3
min read
LW
link
EIS IX: Interpretability and Adversaries
scasper
20 Feb 2023 18:25 UTC
30
points
7
comments
8
min read
LW
link
EIS VIII: An Engineer’s Understanding of Deceptive Alignment
scasper
19 Feb 2023 15:25 UTC
20
points
5
comments
4
min read
LW
link
EIS VII: A Challenge for Mechanists
scasper
18 Feb 2023 18:27 UTC
35
points
4
comments
3
min read
LW
link
EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety
scasper
17 Feb 2023 20:48 UTC
48
points
9
comments
12
min read
LW
link
EIS V: Blind Spots In AI Safety Interpretability Research
scasper
16 Feb 2023 19:09 UTC
54
points
23
comments
13
min read
LW
link
Back to top
Next