scasper

Karma: 1,900

https://stephencasper.com/

EIS XIV: Is mechanistic interpretability about to be practically useful?

scasper11 Oct 2024 22:13 UTC

68 points

4 comments7 min readLW link

Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?

scasper30 Jul 2024 14:57 UTC

25 points

0 comments4 min readLW link

EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024

scasper21 May 2024 20:15 UTC

157 points

16 comments3 min readLW link

Analogies between scaling labs and misaligned superintelligent AI

scasper21 Feb 2024 19:29 UTC

75 points

5 comments4 min readLW link

Deep Forgetting & Unlearning for Safely-Scoped LLMs

scasper5 Dec 2023 16:48 UTC

123 points

30 comments13 min readLW link

Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

Soroush Pour, rusheb, Quentin FEUILLADE--MONTIXI, Arush and scasper

7 Nov 2023 17:59 UTC

36 points

2 comments2 min readLW link

(arxiv.org)

The 6D effect: When companies take risks, one email can be very powerful.

scasper4 Nov 2023 20:08 UTC

275 points

42 comments3 min readLW link

Announcing the CNN Interpretability Competition

scasper26 Sep 2023 16:21 UTC

22 points

0 comments4 min readLW link

Open Problems and Fundamental Limitations of RLHF

scasper31 Jul 2023 15:31 UTC

66 points

6 comments2 min readLW link

(arxiv.org)

A Short Memo on AI Interpretability Rainbows

scasper27 Jul 2023 23:05 UTC

18 points

0 comments2 min readLW link

Examples of Prompts that Make GPT-4 Output Falsehoods

scasper and Luke Bailey

22 Jul 2023 20:21 UTC

21 points

5 comments6 min readLW link

Eight Strategies for Tackling the Hard Part of the Alignment Problem

scasper8 Jul 2023 18:55 UTC

42 points

11 comments7 min readLW link

Takeaways from the Mechanistic Interpretability Challenges

scasper8 Jun 2023 18:56 UTC

94 points

5 comments6 min readLW link

Advice for Entering AI Safety Research

scasper2 Jun 2023 20:46 UTC

26 points

2 comments5 min readLW link

GPT-4 is easily controlled/exploited with tricky decision theoretic dilemmas.

scasper14 Apr 2023 19:39 UTC

6 points

4 comments2 min readLW link

EIS XII: Summary

scasper23 Feb 2023 17:45 UTC

18 points

0 comments6 min readLW link

EIS XI: Moving Forward

scasper22 Feb 2023 19:05 UTC

19 points

2 comments9 min readLW link

EIS X: Continual Learning, Modularity, Compression, and Biological Brains

scasper21 Feb 2023 16:59 UTC

14 points

4 comments3 min readLW link

EIS IX: Interpretability and Adversaries

scasper20 Feb 2023 18:25 UTC

30 points

8 comments8 min readLW link

EIS VIII: An Engineer’s Understanding of Deceptive Alignment

scasper19 Feb 2023 15:25 UTC

30 points

5 comments4 min readLW link