All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 202220232024

All Jan Feb Mar Apr May Jun JulAugSep Oct Nov Dec

All12 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

The “no sandbagging on checkable tasks” hypothesis

Joe Carlsmith31 Jul 2023 23:06 UTC

55 points

13 comments9 min readLW link

A Social History of Truth

Vaniver31 Jul 2023 22:49 UTC

64 points

2 comments14 min readLW link

Watermarking considered overrated?

DanielFilan31 Jul 2023 21:36 UTC

19 points

4 comments1 min readLW link

What The Lord of the Rings Teaches Us About AI Alignment

Jeffrey Heninger31 Jul 2023 20:16 UTC

24 points

12 comments7 min readLW link

The “spelling miracle”: GPT-3 spelling abilities and glitch tokens revisited

mwatkins31 Jul 2023 19:47 UTC

85 points

29 comments20 min readLW link

“Building a House” Review

jefftk31 Jul 2023 19:20 UTC

62 points

6 comments1 min readLW link

(www.jefftk.com)

The Meaning of Shoggoth AI Memes

Dan Smith31 Jul 2023 18:52 UTC

−5 points

5 comments2 min readLW link

[Question] Is there any existing term summarizing non-scalable oversight methods in outer alignment?

Allen Shen31 Jul 2023 17:31 UTC

1 point

0 comments1 min readLW link

Lack of Social Grace Is an Epistemic Virtue

Zack_M_Davis31 Jul 2023 16:38 UTC

41 points

104 comments4 min readLW link 2 reviews

Thoughts on sharing information about language model capabilities

paulfchristiano31 Jul 2023 16:04 UTC

208 points

44 comments11 min readLW link 1 review

Trading off compute in training and inference (Overview)

Pablo Villalobos31 Jul 2023 16:03 UTC

42 points

2 comments7 min readLW link

(epochai.org)

Open Problems and Fundamental Limitations of RLHF

scasper31 Jul 2023 15:31 UTC

66 points

6 comments2 min readLW link

(arxiv.org)

“Not Necessarily”

Benjamin Hendricks31 Jul 2023 15:19 UTC

24 points

2 comments2 min readLW link

How to find AI alignment researchers to collaborate with?

Florian Dietz31 Jul 2023 9:05 UTC

2 points

2 comments1 min readLW link

[Question] Is Kennedy a Nazi?

Pee Doom31 Jul 2023 8:51 UTC

−12 points

10 comments2 min readLW link

Is Light Drinking Protective?

jefftk31 Jul 2023 3:00 UTC

45 points

8 comments2 min readLW link

(www.jefftk.com)

EU’s AI ambitions at risk as US pushes to water down international treaty (linkpost)

mic31 Jul 2023 0:34 UTC

10 points

0 comments4 min readLW link

(www.euractiv.com)

The rise of AI in cybercrime

BobyResearcher30 Jul 2023 20:19 UTC

−15 points

1 comment2 min readLW link

(riseofAIincybercryme)

SSA vs. SIA: how future population may provide evidence for or against the foundations of political liberalism

j30 Jul 2023 20:18 UTC

−6 points

10 comments55 min readLW link

Rationalization Maximizes Expected Value

Kevin Dorst30 Jul 2023 20:11 UTC

19 points

10 comments7 min readLW link

(kevindorst.substack.com)

Apollo Neuro Results

Elizabeth30 Jul 2023 18:40 UTC

85 points

17 comments3 min readLW link

(acesounderglass.com)

Hilbert’s Triumph, Church and Turing’s failure, and what it means (Post #2)

Noosphere8930 Jul 2023 14:33 UTC

−5 points

16 comments15 min readLW link

[Question] Specific Arguments against open source LLMs?

Iknownothing30 Jul 2023 14:27 UTC

4 points

2 comments1 min readLW link

Socialism in large organizations

Adam Zerner30 Jul 2023 7:25 UTC

7 points

16 comments2 min readLW link

How to make real-money prediction markets on arbitrary topics (Outdated)

yutaka30 Jul 2023 2:11 UTC

57 points

13 comments3 min readLW link

[Question] Does decidability of a theory imply completeness of the theory?

Noosphere8929 Jul 2023 23:53 UTC

6 points

12 comments1 min readLW link

[Question] If I showed the EQ-SQ theory’s findings to be due to measurement bias, would anyone change their minds about it?

tailcalled29 Jul 2023 19:38 UTC

23 points

13 comments1 min readLW link

Self-driving car bets

paulfchristiano29 Jul 2023 18:10 UTC

234 points

43 comments5 min readLW link

(sideways-view.com)

The Parable of the Dagger—The Animation

Writer29 Jul 2023 14:03 UTC

20 points

6 comments1 min readLW link

(youtu.be)

Are Guitars Obsolete?

jefftk29 Jul 2023 13:20 UTC

11 points

8 comments2 min readLW link

(www.jefftk.com)

NAMSI: A promising approach to alignment

Georgeo5729 Jul 2023 7:03 UTC

−6 points

6 comments1 min readLW link

Understanding and Aligning a Human-like Inductive Bias with Cognitive Science: a Review of Related Literature

Claire Short29 Jul 2023 6:10 UTC

26 points

0 comments12 min readLW link

Why You Should Never Update Your Beliefs

Arjun Panickssery29 Jul 2023 0:27 UTC

76 points

18 comments4 min readLW link 1 review

(arjunpanickssery.substack.com)

Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2)

RGRGRG28 Jul 2023 20:44 UTC

23 points

5 comments20 min readLW link

Because of LayerNorm, Directions in GPT-2 MLP Layers are Monosemantic

ojorgensen28 Jul 2023 19:43 UTC

13 points

3 comments13 min readLW link

When can we trust model evaluations?

evhub28 Jul 2023 19:42 UTC

160 points

10 comments10 min readLW link 1 review

Yes, It’s Subjective, But Why All The Crabs?

johnswentworth28 Jul 2023 19:35 UTC

248 points

15 comments6 min readLW link

Semaglutide and Muscle

5hout28 Jul 2023 18:36 UTC

15 points

14 comments5 min readLW link

Double Crux in a Box

Screwtape28 Jul 2023 17:55 UTC

8 points

3 comments1 min readLW link

AI Safety 101 : Introduction to Vision Interpretability

jeanne_ and Charbel-Raphaël

28 Jul 2023 17:32 UTC

41 points

0 comments1 min readLW link

(github.com)

Visible loss landscape basins don’t correspond to distinct algorithms

Mikhail Samin28 Jul 2023 16:19 UTC

68 points

13 comments4 min readLW link

Progress links digest, 2023-07-28: The decadent opulence of modern capitalism

jasoncrawford28 Jul 2023 14:36 UTC

16 points

3 comments3 min readLW link

(rootsofprogress.org)

AI Awareness through Interaction with Blatantly Alien Models

VojtaKovarik28 Jul 2023 8:41 UTC

7 points

5 comments3 min readLW link

You don’t get to have cool flaws

Neil 28 Jul 2023 5:37 UTC

59 points

22 comments2 min readLW link 3 reviews

Reducing sycophancy and improving honesty via activation steering

Nina Panickssery28 Jul 2023 2:46 UTC

122 points

17 comments9 min readLW link

Mech Interp Puzzle 2: Word2Vec Style Embeddings

Neel Nanda28 Jul 2023 0:50 UTC

40 points

4 comments2 min readLW link

ETFE windows

bhauth28 Jul 2023 0:46 UTC

30 points

4 comments2 min readLW link

(www.bhauth.com)

A Short Memo on AI Interpretability Rainbows

scasper27 Jul 2023 23:05 UTC

18 points

0 comments2 min readLW link

Pulling the Rope Sideways: Empirical Test Results

Daniel Kokotajlo27 Jul 2023 22:18 UTC

61 points

18 comments1 min readLW link

A $10k retroactive grant for VaccinateCA

Austin Chen27 Jul 2023 18:14 UTC

82 points

0 comments1 min readLW link

(manifund.org)