MIRI’s June 2024 Newsletter

Harlan14 Jun 2024 23:02 UTC
74 points
18 comments2 min readLW link
(intelligence.org)

Lan­guage for Goal Mis­gen­er­al­iza­tion: Some For­mal­isms from my MSc Thesis

Giulio14 Jun 2024 19:35 UTC
4 points
0 comments8 min readLW link
(www.giuliostarace.com)

Shard The­ory—is it true for hu­mans?

Rishika14 Jun 2024 19:21 UTC
68 points
7 comments15 min readLW link

When fine-tun­ing fails to elicit GPT-3.5′s chess abilities

Theodore Chapman14 Jun 2024 18:50 UTC
42 points
3 comments9 min readLW link

Re­sults from the AI x Democ­racy Re­search Sprint

14 Jun 2024 16:40 UTC
13 points
0 comments6 min readLW link

Ra­tional An­i­ma­tions’ in­tro to mechanis­tic interpretability

Writer14 Jun 2024 16:10 UTC
45 points
1 comment11 min readLW link
(youtu.be)

Why keep a di­ary, and why wish for large lan­guage models

DanielFilan14 Jun 2024 16:10 UTC
9 points
1 comment2 min readLW link
(danielfilan.com)

The Leopold Model: Anal­y­sis and Reactions

Zvi14 Jun 2024 15:10 UTC
108 points
19 comments57 min readLW link
(thezvi.wordpress.com)

[Question] Thoughts on Fran­cois Chol­let’s be­lief that LLMs are far away from AGI?

O O14 Jun 2024 6:32 UTC
26 points
17 comments1 min readLW link

Re­search Re­port: Alter­na­tive spar­sity meth­ods for sparse au­toen­coders with Othel­loGPT.

Andrew Quaisley14 Jun 2024 0:57 UTC
17 points
5 comments12 min readLW link

Slowed ASI—a pos­si­ble tech­ni­cal strat­egy for alignment

Lester Leong14 Jun 2024 0:57 UTC
5 points
2 comments3 min readLW link

Con­cep­tual Ty­pog­ra­phy “spells it out”

milanrosko14 Jun 2024 0:39 UTC
15 points
0 comments1 min readLW link

Safety isn’t safety with­out a so­cial model (or: dis­pel­ling the myth of per se tech­ni­cal safety)

Andrew_Critch14 Jun 2024 0:16 UTC
338 points
38 comments4 min readLW link

OpenAI ap­points Re­tired U.S. Army Gen­eral Paul M. Naka­sone to Board of Directors

Joel Burget13 Jun 2024 21:28 UTC
35 points
10 comments1 min readLW link
(openai.com)

AI #68: Re­mark­ably Rea­son­able Reactions

Zvi13 Jun 2024 16:30 UTC
46 points
11 comments50 min readLW link
(thezvi.wordpress.com)

Four Fu­tures For Cog­ni­tive Labor

Maxwell Tabarrok13 Jun 2024 12:56 UTC
14 points
10 comments4 min readLW link
(www.maximum-progress.com)

Un­der­rated Proverbs

Arjun Panickssery13 Jun 2024 12:30 UTC
10 points
9 comments1 min readLW link
(arjunpanickssery.substack.com)

[Paper] AI Sand­bag­ging: Lan­guage Models can Strate­gi­cally Un­der­perform on Evaluations

13 Jun 2024 10:04 UTC
84 points
10 comments2 min readLW link
(arxiv.org)

AI-cre­ated simu­la­tions, na­ture of DOOM

amelia13 Jun 2024 3:44 UTC
1 point
0 comments1 min readLW link

Prob­a­bly Not a Ghost Story

George Ingebretsen12 Jun 2024 22:55 UTC
27 points
4 comments3 min readLW link

AiPhone

Zvi12 Jun 2024 22:20 UTC
63 points
4 comments14 min readLW link
(thezvi.wordpress.com)

microwave drilling is impractical

bhauth12 Jun 2024 22:16 UTC
58 points
14 comments4 min readLW link
(www.bhauth.com)

Phonose­man­tic Duplication

bitcoinssg12 Jun 2024 20:19 UTC
5 points
0 comments1 min readLW link

My AI Model Delta Com­pared To Christiano

johnswentworth12 Jun 2024 18:19 UTC
190 points
73 comments4 min readLW link

AI: 4 lev­els of im­pact [micro­p­ost]

Mati_Roy12 Jun 2024 16:58 UTC
8 points
0 comments1 min readLW link

Ag­grega­tive prin­ci­ples ap­prox­i­mate util­i­tar­ian principles

Cleo Nardo12 Jun 2024 16:27 UTC
28 points
3 comments23 min readLW link

Sticker Short­cut Fal­lacy — The Real Worst Ar­gu­ment in the World

ymeskhout12 Jun 2024 14:52 UTC
25 points
15 comments4 min readLW link
(www.ymeskhout.com)

Long-Term Fu­ture Fund: May 2023 to March 2024 Pay­out recommendations

Linch12 Jun 2024 13:46 UTC
40 points
0 comments1 min readLW link

An­thropic’s Cer­tifi­cate of Incorporation

Zach Stein-Perlman12 Jun 2024 13:00 UTC
115 points
4 comments4 min readLW link

Calcu­lance: A “Core” Ability

milanrosko12 Jun 2024 7:21 UTC
4 points
0 comments1 min readLW link

AXRP Epi­sode 33 - RLHF Prob­lems with Scott Emmons

DanielFilan12 Jun 2024 3:30 UTC
34 points
0 comments56 min readLW link

[New Fea­ture] Your Sub­scribed Feed

11 Jun 2024 22:45 UTC
69 points
8 comments4 min readLW link

Open Thread Sum­mer 2024

habryka11 Jun 2024 20:57 UTC
22 points
98 comments1 min readLW link

Can effi­ciency-ad­justable re­port­ing thresh­olds close a loop­hole in Bi­den’s ex­ec­u­tive or­der on AI?

Jemal Young11 Jun 2024 20:56 UTC
4 points
1 comment2 min readLW link

“Full Au­toma­tion” is a Slip­pery Metric

ozziegooen11 Jun 2024 19:56 UTC
30 points
1 comment1 min readLW link

AI take­off and nu­clear war

owencb11 Jun 2024 19:36 UTC
76 points
6 comments11 min readLW link
(strangecities.substack.com)

[Question] What do peo­ple think about the poly­mar­ket Eth Etf re­s­olu­tion?

edge_retainer11 Jun 2024 18:34 UTC
1 point
0 comments1 min readLW link

Let’s De­sign A School, Part 3.1: Bring­ing it all to­gether with the Sieve Model

Sable11 Jun 2024 17:03 UTC
13 points
2 comments7 min readLW link
(affablyevil.substack.com)

How to elimi­nate cut?

jessicata11 Jun 2024 15:54 UTC
21 points
0 comments14 min readLW link
(unstableontology.com)

my favourite Scott Sum­ner blog posts

DMMF11 Jun 2024 14:40 UTC
26 points
0 comments3 min readLW link
(danfrank.ca)

[Question] Is any­one de­vel­op­ing op­ti­mi­sa­tion-ro­bust in­ter­pretabil­ity meth­ods?

Jono11 Jun 2024 13:14 UTC
6 points
0 comments1 min readLW link

Keep the Grass Guessing

JackOfAllTrades11 Jun 2024 7:29 UTC
4 points
0 comments2 min readLW link

“Me­tas­trate­gic Brain­storm­ing”, a core build­ing-block skill

Raemon11 Jun 2024 4:27 UTC
57 points
5 comments6 min readLW link

AI De­bate Sta­bil­ity: Ad­dress­ing Self-Defeat­ing Responses

Annie Sorkin11 Jun 2024 3:03 UTC
9 points
0 comments3 min readLW link

Cor­rigi­bil­ity could make things worse

ThomasCederborg11 Jun 2024 0:55 UTC
9 points
6 comments6 min readLW link

Emo­tional is­sues of­ten have an im­me­di­ate payoff

Chipmonk10 Jun 2024 23:39 UTC
26 points
2 comments4 min readLW link
(chrislakin.blog)

DPO/​PPO-RLHF on LLMs in­cen­tivizes syco­phancy, ex­ag­ger­a­tion and de­cep­tive hal­lu­ci­na­tion, but not mis­al­igned powerseeking

tailcalled10 Jun 2024 21:20 UTC
29 points
13 comments2 min readLW link

Plop! Goes the Concept

Jonathan Moregård10 Jun 2024 19:23 UTC
6 points
0 comments8 min readLW link
(honestliving.substack.com)

What can we learn from or­cas?

Jonasb10 Jun 2024 18:01 UTC
1 point
0 comments8 min readLW link
(www.denominations.io)

How to build a data cen­ter, by Con­struc­tion Physics

TheManxLoiner10 Jun 2024 17:38 UTC
2 points
0 comments1 min readLW link
(www.construction-physics.com)