All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 202220232024

All Jan Feb Mar Apr May Jun Jul AugSepOct Nov Dec

All 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 252627 28 29 30

Impact stories for model internals: an exercise for interpretability researchers

jenny25 Sep 2023 23:15 UTC

29 points

3 comments7 min readLW link

Autonomic Sanity

Sable25 Sep 2023 22:37 UTC

20 points

9 comments4 min readLW link

(affablyevil.substack.com)

[Question] What is wrong with this “utility switch button problem” approach?

Donald Hobson25 Sep 2023 21:36 UTC

14 points

3 comments1 min readLW link

You should just smile at strangers a lot

chaosmage25 Sep 2023 20:12 UTC

13 points

10 comments1 min readLW link

The King and the Golem

Richard_Ngo25 Sep 2023 19:51 UTC

186 points

18 comments5 min readLW link 1 review

(narrativeark.substack.com)

Public Opinion on AI Safety: AIMS 2023 and 2021 Summary

Jacy Reese Anthis, Janet Pauketat and Ali

25 Sep 2023 18:55 UTC

3 points

2 comments3 min readLW link

(www.sentienceinstitute.org)

Welcome to Apply: The 2024 Vitalik Buterin Fellowships in AI Existential Safety by FLI!

Zhijing Jin25 Sep 2023 18:42 UTC

5 points

2 comments2 min readLW link

Evaluating hidden directions on the utility dataset: classification, steering and removal

Annah and shash42

25 Sep 2023 17:19 UTC

25 points

3 comments7 min readLW link

Linkpost: A model of biases as arising from meta-beliefs

JuanGarcia25 Sep 2023 17:14 UTC

5 points

0 comments1 min readLW link

[Question] What causes a decision theory to be used?

Dagon25 Sep 2023 16:33 UTC

8 points

2 comments1 min readLW link

Understanding strategic deception and deceptive alignment

Marius Hobbhahn, Mikita Balesni, Jérémy Scheurer and Dan Braun

25 Sep 2023 16:27 UTC

64 points

16 comments7 min readLW link

(www.apolloresearch.ai)

The Merits of Contrarianism & Why I hate Chatbots. [My Experience with the Ideological Turing Test @ a Less Wrong meetup]

Amina V.25 Sep 2023 16:13 UTC

4 points

1 comment1 min readLW link

(bimbollectual.com)

Inside Views, Impostor Syndrome, and the Great LARP

johnswentworth25 Sep 2023 16:08 UTC

331 points

53 comments5 min readLW link

“X distracts from Y” as a thinly-disguised fight over group status / politics

Steven Byrnes25 Sep 2023 15:18 UTC

108 points

14 comments8 min readLW link

Amazon to invest up to $4 billion in Anthropic

Davis_Kingsley25 Sep 2023 14:55 UTC

44 points

8 comments1 min readLW link

(twitter.com)

Should Effective Altruists be Valuists instead of utilitarians?

spencerg and AmberDawn

25 Sep 2023 14:03 UTC

1 point

3 comments6 min readLW link

Feedly Breaks MathML

jefftk25 Sep 2023 13:40 UTC

15 points

3 comments1 min readLW link

(www.jefftk.com)

[Question] How have you become more hard-working?

Chi Nguyen25 Sep 2023 12:37 UTC

80 points

42 comments1 min readLW link

Automating Intelligence: A Cursory Glance at How AutoML Brings Precision to AI Development

RoscoHunter25 Sep 2023 9:39 UTC

3 points

0 comments3 min readLW link

Interpreting OpenAI’s Whisper

EllenaR24 Sep 2023 17:53 UTC

114 points

13 comments7 min readLW link

Contradiction Appeal Bias

onur24 Sep 2023 17:03 UTC

3 points

2 comments1 min readLW link

RAIN: Your Language Models Can Align Themselves without Finetuning—Microsoft Research 2023 - Reduces the adversarial prompt attack success rate from 94% to 19%!

Singularian250124 Sep 2023 16:48 UTC

5 points

0 comments1 min readLW link

Honor System for Vaccination?

jefftk24 Sep 2023 11:50 UTC

17 points

22 comments1 min readLW link

(www.jefftk.com)

Far-Future Commitments as a Policy Consensus Strategy

FCCC24 Sep 2023 6:34 UTC

7 points

40 comments1 min readLW link

Five neglected work areas that could reduce AI risk

CharlotteS and Aaron_Scher

24 Sep 2023 2:03 UTC

17 points

5 comments9 min readLW link

[Question] Are the other Rationality: A-Z sequences coming out as books?

caffeinated_dissonance24 Sep 2023 0:38 UTC

7 points

3 comments1 min readLW link

The Dick Kick’em Paradox

Augs SMSHacks23 Sep 2023 22:22 UTC

−5 points

21 comments1 min readLW link

I designed an AI safety course (for a philosophy department)

Eleni Angelou23 Sep 2023 22:03 UTC

37 points

15 comments2 min readLW link

Paper: LLMs trained on “A is B” fail to learn “B is A”

lberglund, Owain_Evans, Meg, Maximilian Kaufmann, Mikita Balesni, Asa Cooper Stickland and Tomek Korbak

23 Sep 2023 19:55 UTC

120 points

74 comments4 min readLW link

(arxiv.org)

Sparse Coding, for Mechanistic Interpretability and Activation Engineering

David Udell23 Sep 2023 19:16 UTC

42 points

7 comments34 min readLW link

[Question] Places to meet interesting middle-aged men?

anon_girl23 Sep 2023 19:06 UTC

18 points

7 comments1 min readLW link

Taking features out of superposition with sparse autoencoders more quickly with informed initialization

Pierre Peigné23 Sep 2023 16:21 UTC

30 points

8 comments5 min readLW link

A quick remark on so-called “hallucinations” in LLMs and humans

Bill Benzon23 Sep 2023 12:17 UTC

4 points

4 comments1 min readLW link

Hand-writing MathML

jefftk23 Sep 2023 11:20 UTC

16 points

40 comments1 min readLW link

(www.jefftk.com)

Musk, Starlink, and Crimea

Nicholas / Heather Kross23 Sep 2023 2:35 UTC

−13 points

0 comments5 min readLW link

[Linkpost/Video] All The Times We Nearly Blew Up The World

Jacob G-W23 Sep 2023 1:18 UTC

6 points

1 comment1 min readLW link

(www.youtube.com)

Luck based medicine: inositol for anxiety and brain fog

Elizabeth22 Sep 2023 20:10 UTC

40 points

5 comments3 min readLW link

(acesounderglass.com)

If influence functions are not approximating leave-one-out, how are they supposed to help?

Fabien Roger22 Sep 2023 14:23 UTC

66 points

5 comments3 min readLW link

Modeling p(doom) with TrojanGDP

K. Liam Smith22 Sep 2023 14:19 UTC

−2 points

2 comments13 min readLW link

Let’s talk about Impostor syndrome in AI safety

Igor Ivanov22 Sep 2023 13:51 UTC

29 points

4 comments3 min readLW link

Fund Transit With Development

jefftk22 Sep 2023 11:10 UTC

47 points

22 comments3 min readLW link

(www.jefftk.com)

Atoms to Agents Proto-Lectures

johnswentworth22 Sep 2023 6:22 UTC

93 points

14 comments2 min readLW link

(www.youtube.com)

Would You Work Harder In The Least Convenient Possible World?

Firinn22 Sep 2023 5:17 UTC

104 points

98 comments9 min readLW link 2 reviews

Contra Kevin Dorst’s Rational Polarization

azsantosk22 Sep 2023 4:28 UTC

8 points

2 comments9 min readLW link

ACX Boston—Petrov Day 2023

duck_master22 Sep 2023 1:13 UTC

2 points

0 comments1 min readLW link

What social science research do you want to see reanalyzed?

Michael Wiebe22 Sep 2023 0:03 UTC

14 points

9 comments1 min readLW link

Immortality or death by AGI

ImmortalityOrDeathByAGI21 Sep 2023 23:59 UTC

47 points

30 comments4 min readLW link

(forum.effectivealtruism.org)

Neel Nanda on the Mechanistic Interpretability Researcher Mindset

Michaël Trazzi21 Sep 2023 19:47 UTC

37 points

1 comment3 min readLW link

(theinsideview.ai)

Require AGI to be Explainable

PeterMcCluskey21 Sep 2023 16:11 UTC

5 points

0 comments6 min readLW link

(bayesianinvestor.com)

Update to “Dominant Assurance Contract Platform”

moyamo21 Sep 2023 16:09 UTC

32 points

1 comment1 min readLW link