Towards Monose­man­tic­ity: De­com­pos­ing Lan­guage Models With Dic­tionary Learning

Zac Hatfield-DoddsOct 5, 2023, 9:01 PM
288 points
22 comments2 min readLW link1 review
(transformer-circuits.pub)

Align­ment Im­pli­ca­tions of LLM Suc­cesses: a De­bate in One Act

Zack_M_DavisOct 21, 2023, 3:22 PM
265 points
56 comments13 min readLW link2 reviews

Book Re­view: Go­ing Infinite

ZviOct 24, 2023, 3:00 PM
242 points
113 comments97 min readLW link1 review
(thezvi.wordpress.com)

An­nounc­ing MIRI’s new CEO and lead­er­ship team

Gretta DulebaOct 10, 2023, 7:22 PM
222 points
52 comments3 min readLW link

Thoughts on re­spon­si­ble scal­ing poli­cies and regulation

paulfchristianoOct 24, 2023, 10:21 PM
221 points
33 comments6 min readLW link

Labs should be ex­plicit about why they are build­ing AGI

peterbarnettOct 17, 2023, 9:09 PM
210 points
18 comments1 min readLW link1 review

We’re Not Ready: thoughts on “paus­ing” and re­spon­si­ble scal­ing policies

HoldenKarnofskyOct 27, 2023, 3:19 PM
200 points
33 comments8 min readLW link

Comp Sci in 2027 (Short story by Eliezer Yud­kowsky)

sudoOct 29, 2023, 11:09 PM
196 points
24 comments10 min readLW link1 review
(nitter.net)

Eval­u­at­ing the his­tor­i­cal value mis­speci­fi­ca­tion argument

Matthew BarnettOct 5, 2023, 6:34 PM
190 points
162 comments7 min readLW link3 reviews

An­nounc­ing Timaeus

Oct 22, 2023, 11:59 AM
188 points
15 comments4 min readLW link

AI as a sci­ence, and three ob­sta­cles to al­ign­ment strategies

So8resOct 25, 2023, 9:00 PM
187 points
80 comments11 min readLW link

Thomas Kwa’s MIRI re­search experience

Oct 2, 2023, 4:42 PM
173 points
53 comments1 min readLW link

Pres­i­dent Bi­den Is­sues Ex­ec­u­tive Order on Safe, Se­cure, and Trust­wor­thy Ar­tifi­cial Intelligence

Tristan WilliamsOct 30, 2023, 11:15 AM
171 points
39 commentsLW link
(www.whitehouse.gov)

Ar­chi­tects of Our Own Demise: We Should Stop Devel­op­ing AI Carelessly

RokoOct 26, 2023, 12:36 AM
170 points
75 comments3 min readLW link

RSPs are pauses done right

evhubOct 14, 2023, 4:06 AM
164 points
73 comments7 min readLW link1 review

Holly El­more and Rob Miles di­alogue on AI Safety Advocacy

Oct 20, 2023, 9:04 PM
162 points
30 comments27 min readLW link

An­nounc­ing Dialogues

Ben PaceOct 7, 2023, 2:57 AM
155 points
59 comments4 min readLW link

Will no one rid me of this tur­bu­lent pest?

MetacelsusOct 14, 2023, 3:27 PM
154 points
23 comments10 min readLW link
(denovo.substack.com)

LoRA Fine-tun­ing Effi­ciently Un­does Safety Train­ing from Llama 2-Chat 70B

Oct 12, 2023, 7:58 PM
151 points
29 comments14 min readLW link

At 87, Pearl is still able to change his mind

rotatingpaguroOct 18, 2023, 4:46 AM
148 points
15 comments5 min readLW link

Graph­i­cal ten­sor no­ta­tion for interpretability

Jordan TaylorOct 4, 2023, 8:04 AM
141 points
11 comments19 min readLW link

The 99% prin­ci­ple for per­sonal problems

Kaj_SotalaOct 2, 2023, 8:20 AM
139 points
20 comments2 min readLW link
(kajsotala.fi)

Com­par­ing An­thropic’s Dic­tionary Learn­ing to Ours

Robert_AIZIOct 7, 2023, 11:30 PM
137 points
8 comments4 min readLW link

Don’t Dis­miss Sim­ple Align­ment Approaches

Chris_LeongOct 7, 2023, 12:35 AM
137 points
9 comments4 min readLW link

Re­sponse to Quintin Pope’s Evolu­tion Pro­vides No Ev­i­dence For the Sharp Left Turn

ZviOct 5, 2023, 11:39 AM
129 points
29 comments9 min readLW link

Good­hart’s Law in Re­in­force­ment Learning

Oct 16, 2023, 12:54 AM
126 points
22 comments7 min readLW link

Re­spon­si­ble Scal­ing Poli­cies Are Risk Man­age­ment Done Wrong

simeon_cOct 25, 2023, 11:46 PM
123 points
35 comments22 min readLW link1 review
(www.navigatingrisks.ai)

I Would Have Solved Align­ment, But I Was Wor­ried That Would Ad­vance Timelines

307thOct 20, 2023, 4:37 PM
122 points
33 comments9 min readLW link

Stampy’s AI Safety Info soft launch

Oct 5, 2023, 10:13 PM
120 points
9 comments2 min readLW link

Re­veal­ing In­ten­tion­al­ity In Lan­guage Models Through AdaVAE Guided Sampling

jdpOct 20, 2023, 7:32 AM
119 points
15 comments22 min readLW link

unRLHF—Effi­ciently un­do­ing LLM safeguards

Oct 12, 2023, 7:58 PM
117 points
15 comments20 min readLW link

A new in­tro to Quan­tum Physics, with the math fixed

titotalOct 29, 2023, 3:11 PM
113 points
24 comments17 min readLW link
(titotal.substack.com)

The Witch­ing Hour

Richard_NgoOct 10, 2023, 12:19 AM
113 points
1 comment9 min readLW link
(www.narrativeark.xyz)

Sym­bol/​Refer­ent Con­fu­sions in Lan­guage Model Align­ment Experiments

johnswentworthOct 26, 2023, 7:49 PM
112 points
49 comments6 min readLW link1 review

Im­prov­ing the Welfare of AIs: A Nearcasted Proposal

ryan_greenblattOct 30, 2023, 2:51 PM
112 points
7 comments20 min readLW link1 review

Char­bel-Raphaël and Lu­cius dis­cuss interpretability

Oct 30, 2023, 5:50 AM
111 points
7 comments21 min readLW link

Pro­gram­matic back­doors: DNNs can use SGD to run ar­bi­trary state­ful computation

Oct 23, 2023, 4:37 PM
107 points
3 comments8 min readLW link

TOMORROW: the largest AI Safety protest ever!

Holly_ElmoreOct 20, 2023, 6:15 PM
105 points
26 comments2 min readLW link

Ap­ply for MATS Win­ter 2023-24!

Oct 21, 2023, 2:27 AM
104 points
6 comments5 min readLW link

Value sys­tem­ati­za­tion: how val­ues be­come co­her­ent (and mis­al­igned)

Richard_NgoOct 27, 2023, 7:06 PM
102 points
49 comments13 min readLW link

What’s up with “Re­spon­si­ble Scal­ing Poli­cies”?

Oct 29, 2023, 4:17 AM
99 points
9 comments20 min readLW link1 review

Truth­seek­ing when your dis­agree­ments lie in moral philosophy

Oct 10, 2023, 12:00 AM
99 points
4 comments4 min readLW link
(acesounderglass.com)

What’s Hard About The Shut­down Problem

johnswentworthOct 20, 2023, 9:13 PM
98 points
33 comments4 min readLW link

[Question] Ly­ing to chess play­ers for alignment

ZaneOct 25, 2023, 5:47 PM
97 points
54 comments1 min readLW link

I don’t find the lie de­tec­tion re­sults that sur­pris­ing (by an au­thor of the pa­per)

JanBOct 4, 2023, 5:10 PM
97 points
8 comments3 min readLW link

In­ves­ti­gat­ing the learn­ing co­effi­cient of mod­u­lar ad­di­tion: hackathon project

Oct 17, 2023, 7:51 PM
94 points
5 comments12 min readLW link

Sam Alt­man’s sister claims Sam sex­u­ally abused her—Part 1: In­tro­duc­tion, out­line, au­thor’s notes

pythagoras5015Oct 7, 2023, 9:06 PM
94 points
108 comments8 min readLW link

Try­ing to un­der­stand John Went­worth’s re­search agenda

Oct 20, 2023, 12:05 AM
93 points
13 comments12 min readLW link

You’re Mea­sur­ing Model Com­plex­ity Wrong

Oct 11, 2023, 11:46 AM
93 points
17 comments13 min readLW link

Open Source Repli­ca­tion & Com­men­tary on An­thropic’s Dic­tionary Learn­ing Paper

Neel NandaOct 23, 2023, 10:38 PM
93 points
12 comments9 min readLW link