All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 20232024

All Jan Feb Mar Apr MayJunJul Aug Sep Oct Nov

All 1 2 3 4 5 6 7 8 9 10 11 12 13 141516 17 18 19 20 21 22 23 24 25 26 27 28 29 30

MIRI’s June 2024 Newsletter

Harlan14 Jun 2024 23:02 UTC

74 points

18 comments2 min readLW link

(intelligence.org)

Language for Goal Misgeneralization: Some Formalisms from my MSc Thesis

Giulio14 Jun 2024 19:35 UTC

4 points

0 comments8 min readLW link

(www.giuliostarace.com)

Shard Theory—is it true for humans?

Rishika14 Jun 2024 19:21 UTC

68 points

7 comments15 min readLW link

When fine-tuning fails to elicit GPT-3.5′s chess abilities

Theodore Chapman14 Jun 2024 18:50 UTC

42 points

3 comments9 min readLW link

Results from the AI x Democracy Research Sprint

Esben Kran, jordine and Jason Hoelscher-Obermaier

14 Jun 2024 16:40 UTC

13 points

0 comments6 min readLW link

Rational Animations’ intro to mechanistic interpretability

Writer14 Jun 2024 16:10 UTC

45 points

1 comment11 min readLW link

(youtu.be)

Why keep a diary, and why wish for large language models

DanielFilan14 Jun 2024 16:10 UTC

9 points

1 comment2 min readLW link

(danielfilan.com)

The Leopold Model: Analysis and Reactions

Zvi14 Jun 2024 15:10 UTC

108 points

19 comments57 min readLW link

(thezvi.wordpress.com)

[Question] Thoughts on Francois Chollet’s belief that LLMs are far away from AGI?

O O14 Jun 2024 6:32 UTC

26 points

17 comments1 min readLW link

Research Report: Alternative sparsity methods for sparse autoencoders with OthelloGPT.

Andrew Quaisley14 Jun 2024 0:57 UTC

17 points

5 comments12 min readLW link

Slowed ASI—a possible technical strategy for alignment

Lester Leong14 Jun 2024 0:57 UTC

5 points

2 comments3 min readLW link

Conceptual Typography “spells it out”

milanrosko14 Jun 2024 0:39 UTC

15 points

0 comments1 min readLW link

Safety isn’t safety without a social model (or: dispelling the myth of per se technical safety)

Andrew_Critch14 Jun 2024 0:16 UTC

338 points

38 comments4 min readLW link

OpenAI appoints Retired U.S. Army General Paul M. Nakasone to Board of Directors

Joel Burget13 Jun 2024 21:28 UTC

35 points

10 comments1 min readLW link

(openai.com)

AI #68: Remarkably Reasonable Reactions

Zvi13 Jun 2024 16:30 UTC

46 points

11 comments50 min readLW link

(thezvi.wordpress.com)

Four Futures For Cognitive Labor

Maxwell Tabarrok13 Jun 2024 12:56 UTC

14 points

10 comments4 min readLW link

(www.maximum-progress.com)

Underrated Proverbs

Arjun Panickssery13 Jun 2024 12:30 UTC

10 points

9 comments1 min readLW link

(arjunpanickssery.substack.com)

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Teun van der Weij, Felix Hofstätter, Ollie J, Sam F. Brown and Francis Rhys Ward

13 Jun 2024 10:04 UTC

84 points

10 comments2 min readLW link

(arxiv.org)

AI-created simulations, nature of DOOM

amelia13 Jun 2024 3:44 UTC

1 point

0 comments1 min readLW link

Probably Not a Ghost Story

George Ingebretsen12 Jun 2024 22:55 UTC

27 points

4 comments3 min readLW link

AiPhone

Zvi12 Jun 2024 22:20 UTC

63 points

4 comments14 min readLW link

(thezvi.wordpress.com)

microwave drilling is impractical

bhauth12 Jun 2024 22:16 UTC

58 points

14 comments4 min readLW link

(www.bhauth.com)

Phonosemantic Duplication

bitcoinssg12 Jun 2024 20:19 UTC

5 points

0 comments1 min readLW link

My AI Model Delta Compared To Christiano

johnswentworth12 Jun 2024 18:19 UTC

190 points

73 comments4 min readLW link

AI: 4 levels of impact [micropost]

Mati_Roy12 Jun 2024 16:58 UTC

8 points

0 comments1 min readLW link

Aggregative principles approximate utilitarian principles

Cleo Nardo12 Jun 2024 16:27 UTC

28 points

3 comments23 min readLW link

Sticker Shortcut Fallacy — The Real Worst Argument in the World

ymeskhout12 Jun 2024 14:52 UTC

25 points

15 comments4 min readLW link

(www.ymeskhout.com)

Long-Term Future Fund: May 2023 to March 2024 Payout recommendations

Linch12 Jun 2024 13:46 UTC

40 points

0 comments1 min readLW link

Anthropic’s Certificate of Incorporation

Zach Stein-Perlman12 Jun 2024 13:00 UTC

115 points

4 comments4 min readLW link

Calculance: A “Core” Ability

milanrosko12 Jun 2024 7:21 UTC

4 points

0 comments1 min readLW link

AXRP Episode 33 - RLHF Problems with Scott Emmons

DanielFilan12 Jun 2024 3:30 UTC

34 points

0 comments56 min readLW link

[New Feature] Your Subscribed Feed

Ruby and RobertM

11 Jun 2024 22:45 UTC

69 points

8 comments4 min readLW link

Open Thread Summer 2024

habryka11 Jun 2024 20:57 UTC

22 points

98 comments1 min readLW link

Can efficiency-adjustable reporting thresholds close a loophole in Biden’s executive order on AI?

Jemal Young11 Jun 2024 20:56 UTC

4 points

1 comment2 min readLW link

“Full Automation” is a Slippery Metric

ozziegooen11 Jun 2024 19:56 UTC

30 points

1 comment1 min readLW link

AI takeoff and nuclear war

owencb11 Jun 2024 19:36 UTC

76 points

6 comments11 min readLW link

(strangecities.substack.com)

[Question] What do people think about the polymarket Eth Etf resolution?

edge_retainer11 Jun 2024 18:34 UTC

1 point

0 comments1 min readLW link

Let’s Design A School, Part 3.1: Bringing it all together with the Sieve Model

Sable11 Jun 2024 17:03 UTC

13 points

2 comments7 min readLW link

(affablyevil.substack.com)

How to eliminate cut?

jessicata11 Jun 2024 15:54 UTC

21 points

0 comments14 min readLW link

(unstableontology.com)

my favourite Scott Sumner blog posts

DMMF11 Jun 2024 14:40 UTC

26 points

0 comments3 min readLW link

(danfrank.ca)

[Question] Is anyone developing optimisation-robust interpretability methods?

Jono11 Jun 2024 13:14 UTC

6 points

0 comments1 min readLW link

Keep the Grass Guessing

JackOfAllTrades11 Jun 2024 7:29 UTC

4 points

0 comments2 min readLW link

“Metastrategic Brainstorming”, a core building-block skill

Raemon11 Jun 2024 4:27 UTC

57 points

5 comments6 min readLW link

AI Debate Stability: Addressing Self-Defeating Responses

Annie Sorkin11 Jun 2024 3:03 UTC

9 points

0 comments3 min readLW link

Corrigibility could make things worse

ThomasCederborg11 Jun 2024 0:55 UTC

9 points

6 comments6 min readLW link

Emotional issues often have an immediate payoff

Chipmonk10 Jun 2024 23:39 UTC

26 points

2 comments4 min readLW link

(chrislakin.blog)

DPO/PPO-RLHF on LLMs incentivizes sycophancy, exaggeration and deceptive hallucination, but not misaligned powerseeking

tailcalled10 Jun 2024 21:20 UTC

29 points

13 comments2 min readLW link

Plop! Goes the Concept

Jonathan Moregård10 Jun 2024 19:23 UTC

6 points

0 comments8 min readLW link

(honestliving.substack.com)

What can we learn from orcas?

Jonasb10 Jun 2024 18:01 UTC

1 point

0 comments8 min readLW link

(www.denominations.io)

How to build a data center, by Construction Physics

TheManxLoiner10 Jun 2024 17:38 UTC

2 points

0 comments1 min readLW link

(www.construction-physics.com)