al­ign your la­tent spaces

bhauth24 Dec 2023 16:30 UTC
27 points
8 comments2 min readLW link
(www.bhauth.com)

Viral Guess­ing Game

jefftk24 Dec 2023 13:10 UTC
19 points
0 comments1 min readLW link
(www.jefftk.com)

The Su­gar Align­ment Problem

Adam Zerner24 Dec 2023 1:35 UTC
5 points
3 comments7 min readLW link

A Crisper Ex­pla­na­tion of Si­mu­lacrum Levels

Thane Ruthenis23 Dec 2023 22:13 UTC
89 points
13 comments13 min readLW link

Hyper­bolic Dis­count­ing and Pas­cal’s Mugging

Andrew Keenan Richardson23 Dec 2023 21:55 UTC
9 points
0 comments7 min readLW link

AISN #28: Cen­ter for AI Safety 2023 Year in Review

23 Dec 2023 21:31 UTC
30 points
1 comment5 min readLW link
(newsletter.safe.ai)

“In­f­tox­i­c­ity” and other new words to de­scribe mal­i­cious in­for­ma­tion and com­mu­ni­ca­tion thereof

Jáchym Fibír23 Dec 2023 18:15 UTC
−1 points
6 comments3 min readLW link

AI’s im­pact on biol­ogy re­search: Part I, today

octopocta23 Dec 2023 16:29 UTC
31 points
6 comments2 min readLW link

AI Gir­lfriends Won’t Mat­ter Much

Maxwell Tabarrok23 Dec 2023 15:58 UTC
42 points
22 comments2 min readLW link
(maximumprogress.substack.com)

The Next Right Token

jefftk23 Dec 2023 3:20 UTC
14 points
0 comments1 min readLW link
(www.jefftk.com)

Fact Find­ing: Do Early Lay­ers Spe­cial­ise in Lo­cal Pro­cess­ing? (Post 5)

23 Dec 2023 2:46 UTC
18 points
0 comments4 min readLW link

Fact Find­ing: How to Think About In­ter­pret­ing Me­mori­sa­tion (Post 4)

23 Dec 2023 2:46 UTC
22 points
0 comments9 min readLW link

Fact Find­ing: Try­ing to Mechanis­ti­cally Un­der­stand­ing Early MLPs (Post 3)

23 Dec 2023 2:46 UTC
10 points
0 comments16 min readLW link

Fact Find­ing: Sim­plify­ing the Cir­cuit (Post 2)

23 Dec 2023 2:45 UTC
25 points
3 comments14 min readLW link

Fact Find­ing: At­tempt­ing to Re­v­erse-Eng­ineer Fac­tual Re­call on the Neu­ron Level (Post 1)

23 Dec 2023 2:44 UTC
108 points
9 comments22 min readLW link1 review

Mea­sure­ment tam­per­ing de­tec­tion as a spe­cial case of weak-to-strong generalization

23 Dec 2023 0:05 UTC
57 points
10 comments4 min readLW link

How does a toy 2 digit sub­trac­tion trans­former pre­dict the differ­ence?

Evan Anders22 Dec 2023 21:17 UTC
12 points
0 comments10 min readLW link
(evanhanders.blog)

Thoughts on Max Teg­mark’s AI verification

Johannes C. Mayer22 Dec 2023 20:38 UTC
10 points
0 comments3 min readLW link

Ideal­ized Agents Are Ap­prox­i­mate Causal Mir­rors (+ Rad­i­cal Op­ti­mism on Agent Foun­da­tions)

Thane Ruthenis22 Dec 2023 20:19 UTC
74 points
14 comments6 min readLW link

AI safety ad­vo­cates should con­sider pro­vid­ing gen­tle push­back fol­low­ing the events at OpenAI

civilsociety22 Dec 2023 18:55 UTC
16 points
5 comments3 min readLW link

“De­stroy hu­man­ity” as an im­me­di­ate subgoal

Seth Ahrenbach22 Dec 2023 18:52 UTC
3 points
13 comments3 min readLW link

Syn­thetic Restrictions

nano_brasca22 Dec 2023 18:50 UTC
10 points
0 comments4 min readLW link

Re­view Re­port of David­son on Take­off Speeds (2023)

Trent Kannegieter22 Dec 2023 18:48 UTC
37 points
11 comments38 min readLW link

The prob­lems with the con­cept of an in­fo­haz­ard as used by the LW com­mu­nity [Linkpost]

Noosphere8922 Dec 2023 16:13 UTC
75 points
43 comments3 min readLW link
(www.beren.io)

Em­ployee In­cen­tives Make AGI Lab Pauses More Costly

Nikola Jurkovic22 Dec 2023 5:04 UTC
28 points
12 comments3 min readLW link

The LessWrong 2022 Re­view: Re­view Phase

RobertM22 Dec 2023 3:23 UTC
58 points
7 comments2 min readLW link

The ab­sence of self-re­jec­tion is self-acceptance

Chipmonk21 Dec 2023 21:54 UTC
24 points
1 comment1 min readLW link
(chipmonk.substack.com)

A De­ci­sion The­ory Can Be Ra­tional or Com­putable, but Not Both

StrivingForLegibility21 Dec 2023 21:02 UTC
9 points
4 comments1 min readLW link

Most Peo­ple Don’t Real­ize We Have No Idea How Our AIs Work

Thane Ruthenis21 Dec 2023 20:02 UTC
159 points
42 comments1 min readLW link

Pseudonymity and Accusations

jefftk21 Dec 2023 19:20 UTC
52 points
20 comments3 min readLW link
(www.jefftk.com)

At­ten­tion on AI X-Risk Likely Hasn’t Dis­tracted from Cur­rent Harms from AI

Erich_Grunewald21 Dec 2023 17:24 UTC
26 points
2 comments17 min readLW link
(www.erichgrunewald.com)

“Align­ment” is one of six words of the year in the Har­vard Gazette

Nikola Jurkovic21 Dec 2023 15:54 UTC
14 points
1 comment1 min readLW link
(news.harvard.edu)

AI #43: Func­tional Discoveries

Zvi21 Dec 2023 15:50 UTC
52 points
26 comments49 min readLW link
(thezvi.wordpress.com)

Rat­ing my AI Predictions

Robert_AIZI21 Dec 2023 14:07 UTC
22 points
5 comments2 min readLW link
(aizi.substack.com)

AI Safety Chatbot

21 Dec 2023 14:06 UTC
61 points
11 comments4 min readLW link

On OpenAI’s Pre­pared­ness Framework

Zvi21 Dec 2023 14:00 UTC
51 points
4 comments21 min readLW link
(thezvi.wordpress.com)

Pre­dic­tion Mar­kets aren’t Magic

SimonM21 Dec 2023 12:54 UTC
90 points
29 comments3 min readLW link

[Question] Why is cap­nom­e­try biofeed­back not more widely known?

riceissa21 Dec 2023 2:42 UTC
20 points
22 comments4 min readLW link

My best guess at the im­por­tant tricks for train­ing 1L SAEs

Arthur Conmy21 Dec 2023 1:59 UTC
37 points
4 comments3 min readLW link

Seat­tle Win­ter Solstice

a7x20 Dec 2023 20:30 UTC
6 points
1 comment1 min readLW link

How Would an Utopia-Max­i­mizer Look Like?

Thane Ruthenis20 Dec 2023 20:01 UTC
31 points
23 comments10 min readLW link

Succession

Richard_Ngo20 Dec 2023 19:25 UTC
159 points
48 comments11 min readLW link
(www.narrativeark.xyz)

Me­tac­u­lus In­tro­duces Mul­ti­ple Choice Questions

ChristianWilliams20 Dec 2023 19:00 UTC
4 points
0 comments1 min readLW link
(www.metaculus.com)

Brighter Than To­day Versions

jefftk20 Dec 2023 18:20 UTC
16 points
2 comments2 min readLW link
(www.jefftk.com)

Gaia Net­work: a prac­ti­cal, in­cre­men­tal path­way to Open Agency Architecture

20 Dec 2023 17:11 UTC
22 points
8 comments16 min readLW link

On the fu­ture of lan­guage models

owencb20 Dec 2023 16:58 UTC
105 points
17 comments1 min readLW link

[Valence se­ries] Ap­pendix A: He­donic tone /​ (dis)plea­sure /​ (dis)liking

Steven Byrnes20 Dec 2023 15:54 UTC
18 points
0 comments13 min readLW link

Ma­trix com­ple­tion prize results

paulfchristiano20 Dec 2023 15:40 UTC
41 points
0 comments2 min readLW link
(www.alignment.org)

[Question] What’s the min­i­mal ad­di­tive con­stant for Kol­mogorov Com­plex­ity that a pro­gram­ming lan­guage can achieve?

Noosphere8920 Dec 2023 15:36 UTC
11 points
15 comments1 min readLW link

Le­gal­ize bu­tanol?

bhauth20 Dec 2023 14:24 UTC
39 points
20 comments5 min readLW link
(www.bhauth.com)