Towards Monose­man­tic­ity: De­com­pos­ing Lan­guage Models With Dic­tionary Learning

Zac Hatfield-Dodds5 Oct 2023 21:01 UTC
288 points
22 comments2 min readLW link1 review
(transformer-circuits.pub)

Align­ment Im­pli­ca­tions of LLM Suc­cesses: a De­bate in One Act

Zack_M_Davis21 Oct 2023 15:22 UTC
247 points
51 comments13 min readLW link1 review

Book Re­view: Go­ing Infinite

Zvi24 Oct 2023 15:00 UTC
242 points
113 comments97 min readLW link1 review
(thezvi.wordpress.com)

An­nounc­ing MIRI’s new CEO and lead­er­ship team

Gretta Duleba10 Oct 2023 19:22 UTC
222 points
52 comments3 min readLW link

Thoughts on re­spon­si­ble scal­ing poli­cies and regulation

paulfchristiano24 Oct 2023 22:21 UTC
220 points
33 comments6 min readLW link

We’re Not Ready: thoughts on “paus­ing” and re­spon­si­ble scal­ing policies

HoldenKarnofsky27 Oct 2023 15:19 UTC
200 points
33 comments8 min readLW link

Labs should be ex­plicit about why they are build­ing AGI

peterbarnett17 Oct 2023 21:09 UTC
196 points
17 comments1 min readLW link

An­nounc­ing Timaeus

22 Oct 2023 11:59 UTC
187 points
15 comments4 min readLW link

AI as a sci­ence, and three ob­sta­cles to al­ign­ment strategies

So8res25 Oct 2023 21:00 UTC
185 points
80 comments11 min readLW link

Ar­chi­tects of Our Own Demise: We Should Stop Devel­op­ing AI Carelessly

Roko26 Oct 2023 0:36 UTC
176 points
75 comments3 min readLW link

Eval­u­at­ing the his­tor­i­cal value mis­speci­fi­ca­tion argument

Matthew Barnett5 Oct 2023 18:34 UTC
173 points
151 comments7 min readLW link2 reviews

Thomas Kwa’s MIRI re­search experience

2 Oct 2023 16:42 UTC
172 points
53 comments1 min readLW link

Pres­i­dent Bi­den Is­sues Ex­ec­u­tive Order on Safe, Se­cure, and Trust­wor­thy Ar­tifi­cial Intelligence

Tristan Williams30 Oct 2023 11:15 UTC
171 points
39 comments1 min readLW link
(www.whitehouse.gov)

RSPs are pauses done right

evhub14 Oct 2023 4:06 UTC
164 points
70 comments7 min readLW link

Holly El­more and Rob Miles di­alogue on AI Safety Advocacy

20 Oct 2023 21:04 UTC
162 points
30 comments27 min readLW link

An­nounc­ing Dialogues

Ben Pace7 Oct 2023 2:57 UTC
155 points
52 comments4 min readLW link

Will no one rid me of this tur­bu­lent pest?

Metacelsus14 Oct 2023 15:27 UTC
154 points
23 comments10 min readLW link
(denovo.substack.com)

Comp Sci in 2027 (Short story by Eliezer Yud­kowsky)

sudo29 Oct 2023 23:09 UTC
154 points
22 comments10 min readLW link
(nitter.net)

LoRA Fine-tun­ing Effi­ciently Un­does Safety Train­ing from Llama 2-Chat 70B

12 Oct 2023 19:58 UTC
151 points
29 comments14 min readLW link

At 87, Pearl is still able to change his mind

rotatingpaguro18 Oct 2023 4:46 UTC
148 points
15 comments5 min readLW link

Graph­i­cal ten­sor no­ta­tion for interpretability

Jordan Taylor4 Oct 2023 8:04 UTC
140 points
11 comments19 min readLW link

Com­par­ing An­thropic’s Dic­tionary Learn­ing to Ours

Robert_AIZI7 Oct 2023 23:30 UTC
137 points
8 comments4 min readLW link

The 99% prin­ci­ple for per­sonal problems

Kaj_Sotala2 Oct 2023 8:20 UTC
135 points
20 comments2 min readLW link
(kajsotala.fi)

Don’t Dis­miss Sim­ple Align­ment Approaches

Chris_Leong7 Oct 2023 0:35 UTC
134 points
9 comments4 min readLW link

Re­sponse to Quintin Pope’s Evolu­tion Pro­vides No Ev­i­dence For the Sharp Left Turn

Zvi5 Oct 2023 11:39 UTC
128 points
29 comments9 min readLW link

Good­hart’s Law in Re­in­force­ment Learning

16 Oct 2023 0:54 UTC
126 points
22 comments7 min readLW link

Re­spon­si­ble Scal­ing Poli­cies Are Risk Man­age­ment Done Wrong

simeon_c25 Oct 2023 23:46 UTC
122 points
35 comments22 min readLW link1 review
(www.navigatingrisks.ai)

Stampy’s AI Safety Info soft launch

5 Oct 2023 22:13 UTC
120 points
9 comments2 min readLW link

I Would Have Solved Align­ment, But I Was Wor­ried That Would Ad­vance Timelines

307th20 Oct 2023 16:37 UTC
119 points
33 comments9 min readLW link

Re­veal­ing In­ten­tion­al­ity In Lan­guage Models Through AdaVAE Guided Sampling

jdp20 Oct 2023 7:32 UTC
119 points
15 comments22 min readLW link

unRLHF—Effi­ciently un­do­ing LLM safeguards

12 Oct 2023 19:58 UTC
117 points
15 comments20 min readLW link

The Witch­ing Hour

Richard_Ngo10 Oct 2023 0:19 UTC
113 points
1 comment9 min readLW link
(www.narrativeark.xyz)

A new in­tro to Quan­tum Physics, with the math fixed

titotal29 Oct 2023 15:11 UTC
113 points
23 comments17 min readLW link
(titotal.substack.com)

Char­bel-Raphaël and Lu­cius dis­cuss interpretability

30 Oct 2023 5:50 UTC
107 points
7 comments21 min readLW link

Pro­gram­matic back­doors: DNNs can use SGD to run ar­bi­trary state­ful computation

23 Oct 2023 16:37 UTC
107 points
3 comments8 min readLW link

TOMORROW: the largest AI Safety protest ever!

Holly_Elmore20 Oct 2023 18:15 UTC
105 points
26 comments2 min readLW link

Ap­ply for MATS Win­ter 2023-24!

21 Oct 2023 2:27 UTC
104 points
6 comments5 min readLW link

Value sys­tem­ati­za­tion: how val­ues be­come co­her­ent (and mis­al­igned)

Richard_Ngo27 Oct 2023 19:06 UTC
102 points
48 comments13 min readLW link

What’s up with “Re­spon­si­ble Scal­ing Poli­cies”?

29 Oct 2023 4:17 UTC
99 points
8 comments20 min readLW link

Sam Alt­man’s sister, An­nie Alt­man, claims Sam has severely abused her

pythagoras50157 Oct 2023 21:06 UTC
98 points
107 comments192 min readLW link

Im­prov­ing the Welfare of AIs: A Nearcasted Proposal

ryan_greenblatt30 Oct 2023 14:51 UTC
98 points
5 comments20 min readLW link

Truth­seek­ing when your dis­agree­ments lie in moral philosophy

10 Oct 2023 0:00 UTC
98 points
4 comments4 min readLW link
(acesounderglass.com)

What’s Hard About The Shut­down Problem

johnswentworth20 Oct 2023 21:13 UTC
98 points
33 comments4 min readLW link

I don’t find the lie de­tec­tion re­sults that sur­pris­ing (by an au­thor of the pa­per)

JanB4 Oct 2023 17:10 UTC
97 points
8 comments3 min readLW link

[Question] Ly­ing to chess play­ers for alignment

Zane25 Oct 2023 17:47 UTC
96 points
54 comments1 min readLW link

In­ves­ti­gat­ing the learn­ing co­effi­cient of mod­u­lar ad­di­tion: hackathon project

17 Oct 2023 19:51 UTC
94 points
5 comments12 min readLW link

Sym­bol/​Refer­ent Con­fu­sions in Lan­guage Model Align­ment Experiments

johnswentworth26 Oct 2023 19:49 UTC
94 points
44 comments6 min readLW link

Open Source Repli­ca­tion & Com­men­tary on An­thropic’s Dic­tionary Learn­ing Paper

Neel Nanda23 Oct 2023 22:38 UTC
93 points
12 comments9 min readLW link

Try­ing to un­der­stand John Went­worth’s re­search agenda

20 Oct 2023 0:05 UTC
92 points
13 comments12 min readLW link

Linkpost: They Stud­ied Dishon­esty. Was Their Work a Lie?

Linch2 Oct 2023 8:10 UTC
91 points
12 comments2 min readLW link
(www.newyorker.com)