Refusal in LLMs is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib111, wesg and Neel Nanda

27 Apr 2024 11:13 UTC

191 points

82 comments10 min readLW link

Open Thread Spring 2024

habryka11 Mar 2024 19:17 UTC

22 points

144 comments1 min readLW link

Language Models Model Us

eggsyntax17 May 2024 21:00 UTC

134 points

43 comments7 min readLW link

[Question] Are most people deeply confused about “love”, or am I missing a human universal?

SpectrumDT23 May 2024 13:22 UTC

2 points

3 comments3 min readLW link

The Button (Short Comic)

milanrosko22 May 2024 23:28 UTC

2 points

1 comment1 min readLW link

Let’s make the truth easier to find

DPiepgrass20 Mar 2023 4:28 UTC

24 points

44 comments1 min readLW link

Why entropy means you might not have to worry as much about superintelligent AI

Ron J23 May 2024 3:52 UTC

−13 points

1 comment2 min readLW link

A Bi-Modal Brain Model

Johannes C. Mayer22 May 2024 20:10 UTC

12 points

2 comments2 min readLW link

What will the first human-level AI look like, and how might things go wrong?

EuanMcLean23 May 2024 11:17 UTC

5 points

1 comment15 min readLW link

“Reframing Superintelligence” + LLMs + 4 years

Eric Drexler10 Jul 2023 13:42 UTC

117 points

9 comments12 min readLW link

How to train your own “Sleeper Agents”

evhub7 Feb 2024 0:31 UTC

91 points

11 comments2 min readLW link

[Question] SAE sparse feature graph using only residual layers

crayhippo23 May 2024 13:32 UTC

0 points

0 comments1 min readLW link

Power Law Policy

Ben Turtel23 May 2024 5:28 UTC

9 points

2 comments6 min readLW link

(bturtel.substack.com)

[Question] Which skincare products are evidence-based?

Vanessa Kosoy2 May 2024 15:22 UTC

108 points

44 comments1 min readLW link

Each Llama3-8b text uses a different “random” subspace of the activation space

tailcalled22 May 2024 7:31 UTC

3 points

4 comments7 min readLW link

What’s Going on With OpenAI’s Messaging?

ozziegooen21 May 2024 2:22 UTC

158 points

12 comments1 min readLW link

Executive Dysfunction 101

DaystarEld23 May 2024 12:43 UTC

7 points

0 comments3 min readLW link

(daystareld.com)

AI #65: I Spy With My AI

Zvi23 May 2024 12:40 UTC

12 points

0 comments43 min readLW link

(thezvi.wordpress.com)

The predictive power of dissipative adaptation

dr_s17 Dec 2023 14:01 UTC

45 points

13 comments19 min readLW link

What mistakes has the AI safety movement made?

EuanMcLean23 May 2024 11:19 UTC

18 points

0 comments12 min readLW link