Refusal in LLMs is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib111, wesg and Neel Nanda

27 Apr 2024 11:13 UTC

112 points

24 comments9 min readLW link

Constructability: Plainly-coded AGIs may be feasible in the near future

Épiphanie Gédéon and Charbel-Raphaël

27 Apr 2024 16:04 UTC

52 points

7 comments13 min readLW link

On Not Pulling The Ladder Up Behind You

Screwtape26 Apr 2024 21:58 UTC

95 points

5 comments9 min readLW link

So What’s Up With PUFAs Chemically?

J Bostock27 Apr 2024 13:32 UTC

35 points

16 comments6 min readLW link

Duct Tape security

Isaac King26 Apr 2024 18:57 UTC

66 points

8 comments5 min readLW link

Superposition is not “just” neuron polysemanticity

LawrenceC26 Apr 2024 23:22 UTC

46 points

3 comments13 min readLW link

[Question] Examples of Highly Counterfactual Discoveries?

johnswentworth23 Apr 2024 22:19 UTC

156 points

83 comments1 min readLW link

The first future and the best future

KatjaGrace25 Apr 2024 6:40 UTC

97 points

9 comments1 min readLW link

(worldspiritsockpuppet.com)

Spatial attention as a “tell” for empathetic simulation?

Steven Byrnes26 Apr 2024 15:10 UTC

49 points

9 comments8 min readLW link

D&D.Sci Long War: Defender of Data-mocracy

aphyer26 Apr 2024 22:30 UTC

38 points

8 comments3 min readLW link

Thoughts on seed oil

dynomight20 Apr 2024 12:29 UTC

260 points

80 comments17 min readLW link

(dynomight.net)

We are headed into an extreme compute overhang

devrandom26 Apr 2024 21:38 UTC

33 points

12 comments2 min readLW link

Simple probes can catch sleeper agents

Monte M, Carson Denison, Zac Hatfield-Dodds, David Duvenaud, Sam Bowman, Ethan Perez and evhub

23 Apr 2024 21:10 UTC

118 points

14 comments1 min readLW link

(www.anthropic.com)

Improving Dictionary Learning with Gated Sparse Autoencoders

Senthooran Rajamanoharan, Arthur Conmy, lsgos, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah and Neel Nanda

25 Apr 2024 18:43 UTC

60 points

23 comments1 min readLW link

(arxiv.org)

Scaling of AI training runs will slow down after GPT-5

Maxime Riché26 Apr 2024 16:05 UTC

34 points

5 comments3 min readLW link

Two Vernor Vinge Book Reviews

Maxwell Tabarrok27 Apr 2024 12:14 UTC

13 points

0 comments2 min readLW link

(www.maximum-progress.com)

Link: Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models by Jacob Pfau, William Merrill & Samuel R. Bowman

Chris_Leong27 Apr 2024 13:22 UTC

11 points

0 comments1 min readLW link

(twitter.com)

An Introduction to AI Sandbagging

Teun van der Weij, Felix Hofstätter and Francis Rhys Ward

26 Apr 2024 13:40 UTC

34 points

0 comments8 min readLW link

Mercy to the Machine: Thoughts & Rights

False Name27 Apr 2024 16:36 UTC

8 points

5 comments17 min readLW link

Transformers Represent Belief State Geometry in their Residual Stream

Adam Shai16 Apr 2024 21:16 UTC

310 points

64 comments12 min readLW link