Short Notes on Re­search Process

Shoshannah Tekofsky22 Feb 2023 23:41 UTC
21 points
0 comments2 min readLW link

Video/​an­i­ma­tion: Neel Nanda ex­plains what mechanis­tic in­ter­pretabil­ity is

DanielFilan22 Feb 2023 22:42 UTC
24 points
7 comments1 min readLW link
(youtu.be)

A Tele­pathic Exam about AI and Consequentialism

alkexr22 Feb 2023 21:00 UTC
4 points
4 comments4 min readLW link

[Question] In­ject­ing noise to GPT to get mul­ti­ple answers

bipolo22 Feb 2023 20:02 UTC
1 point
1 comment1 min readLW link

EIS XI: Mov­ing Forward

scasper22 Feb 2023 19:05 UTC
19 points
2 comments9 min readLW link

Build­ing and En­ter­tain­ing Couples

Jacob Falkovich22 Feb 2023 19:02 UTC
85 points
11 comments4 min readLW link

Can sub­marines swim?

jasoncrawford22 Feb 2023 18:48 UTC
18 points
14 comments13 min readLW link
(rootsofprogress.org)

Is there a ML agent that aban­dons it’s util­ity func­tion out-of-dis­tri­bu­tion with­out los­ing ca­pa­bil­ities?

Christopher King22 Feb 2023 16:49 UTC
1 point
7 comments1 min readLW link

The male AI al­ign­ment solution

TekhneMakre22 Feb 2023 16:34 UTC
−25 points
24 comments1 min readLW link

Progress links and tweets, 2023-02-22

jasoncrawford22 Feb 2023 16:23 UTC
13 points
0 comments1 min readLW link
(rootsofprogress.org)

Cy­borg Pe­ri­ods: There will be mul­ti­ple AI transitions

22 Feb 2023 16:09 UTC
108 points
9 comments6 min readLW link

The Open Agency Model

Eric Drexler22 Feb 2023 10:35 UTC
114 points
18 comments4 min readLW link

In­ter­ven­ing in the Resi­d­ual Stream

MadHatter22 Feb 2023 6:29 UTC
30 points
1 comment9 min readLW link

What do lan­guage mod­els know about fic­tional char­ac­ters?

skybrian22 Feb 2023 5:58 UTC
6 points
0 comments4 min readLW link

Power-Seek­ing = Min­imis­ing free energy

Jonas Hallgren22 Feb 2023 4:28 UTC
21 points
10 comments7 min readLW link

The shal­low re­al­ity of ‘deep learn­ing the­ory’

Jesse Hoogland22 Feb 2023 4:16 UTC
34 points
11 comments3 min readLW link
(www.jessehoogland.com)

Candy­land is Terrible

jefftk22 Feb 2023 1:50 UTC
16 points
2 comments1 min readLW link
(www.jefftk.com)

A proof of in­ner Löb’s theorem

James Payor21 Feb 2023 21:11 UTC
13 points
0 comments2 min readLW link

Fight­ing For Our Lives—What Or­di­nary Peo­ple Can Do

TinkerBird21 Feb 2023 20:36 UTC
12 points
18 comments4 min readLW link

The Emo­tional Type of a Decision

moridinamael21 Feb 2023 20:35 UTC
13 points
0 comments4 min readLW link

What is it like do­ing AI safety work?

KatWoods21 Feb 2023 20:12 UTC
57 points
2 comments1 min readLW link

Pre­train­ing Lan­guage Models with Hu­man Preferences

21 Feb 2023 17:57 UTC
134 points
19 comments11 min readLW link

A Stranger Pri­or­ity? Topics at the Outer Reaches of Effec­tive Altru­ism (my dis­ser­ta­tion)

Joe Carlsmith21 Feb 2023 17:26 UTC
38 points
16 comments1 min readLW link

EIS X: Con­tinual Learn­ing, Mo­du­lar­ity, Com­pres­sion, and Biolog­i­cal Brains

scasper21 Feb 2023 16:59 UTC
14 points
4 comments3 min readLW link

No Room for Poli­ti­cal Philosophy

Arturo Macias21 Feb 2023 16:11 UTC
0 points
7 comments3 min readLW link

De­cep­tive Align­ment is <1% Likely by Default

DavidW21 Feb 2023 15:09 UTC
90 points
29 comments14 min readLW link

AI #1: Syd­ney and Bing

Zvi21 Feb 2023 14:00 UTC
171 points
45 comments61 min readLW link1 review
(thezvi.wordpress.com)

You’re not a simu­la­tion, ’cause you’re hallucinating

Stuart_Armstrong21 Feb 2023 12:12 UTC
25 points
6 comments1 min readLW link

Ba­sic facts about lan­guage mod­els dur­ing training

beren21 Feb 2023 11:46 UTC
97 points
15 comments18 min readLW link

[Preprint] Pre­train­ing Lan­guage Models with Hu­man Preferences

Giulio21 Feb 2023 11:44 UTC
12 points
0 comments1 min readLW link
(arxiv.org)

Break­ing the Op­ti­mizer’s Curse, and Con­se­quences for Ex­is­ten­tial Risks and Value Learning

Roger Dearnaley21 Feb 2023 9:05 UTC
10 points
1 comment23 min readLW link

Medlife Cri­sis: “Why Do Peo­ple Keep Fal­ling For Things That Don’t Work?”

RomanHauksson21 Feb 2023 6:22 UTC
12 points
5 comments1 min readLW link
(www.youtube.com)

A foun­da­tion model ap­proach to value inference

sen21 Feb 2023 5:09 UTC
6 points
0 comments3 min readLW link

In­stru­men­tal­ity makes agents agenty

porby21 Feb 2023 4:28 UTC
20 points
4 comments6 min readLW link

Gam­ified nar­row re­verse imi­ta­tion learn­ing

TekhneMakre21 Feb 2023 4:26 UTC
8 points
0 comments2 min readLW link

Feel­ings are Good, Actually

Gordon Seidoh Worley21 Feb 2023 2:38 UTC
18 points
1 comment4 min readLW link

AI al­ign­ment re­searchers don’t (seem to) stack

So8res21 Feb 2023 0:48 UTC
191 points
40 comments3 min readLW link

EA & LW Fo­rum Weekly Sum­mary (6th − 19th Feb 2023)

Zoe Williams21 Feb 2023 0:26 UTC
8 points
0 comments1 min readLW link

What to think when a lan­guage model tells you it’s sentient

Robbo21 Feb 2023 0:01 UTC
9 points
6 comments6 min readLW link

On sec­ond thought, prompt in­jec­tions are prob­a­bly ex­am­ples of misalignment

lc20 Feb 2023 23:56 UTC
22 points
5 comments1 min readLW link

Noth­ing Is Ever Taught Correctly

LVSN20 Feb 2023 22:31 UTC
5 points
3 comments1 min readLW link

Be­hav­ioral and mechanis­tic defi­ni­tions (of­ten con­fuse AI al­ign­ment dis­cus­sions)

LawrenceC20 Feb 2023 21:33 UTC
33 points
5 comments6 min readLW link

Val­ida­tor mod­els: A sim­ple ap­proach to de­tect­ing goodharting

beren20 Feb 2023 21:32 UTC
14 points
1 comment4 min readLW link

There are no co­her­ence theorems

20 Feb 2023 21:25 UTC
145 points
127 comments19 min readLW link1 review

[Question] Are there any AI safety rele­vant fully re­mote roles suit­able for some­one with 2-3 years of ma­chine learn­ing en­g­ineer­ing in­dus­try ex­pe­rience?

Malleable_shape20 Feb 2023 19:57 UTC
7 points
2 comments1 min readLW link

A cir­cuit for Python doc­strings in a 4-layer at­ten­tion-only transformer

20 Feb 2023 19:35 UTC
96 points
8 comments21 min readLW link

Syd­ney the Bin­gena­tor Can’t Think, But It Still Threat­ens People

Valentin Baltadzhiev20 Feb 2023 18:37 UTC
−3 points
2 comments8 min readLW link

EIS IX: In­ter­pretabil­ity and Adversaries

scasper20 Feb 2023 18:25 UTC
30 points
8 comments8 min readLW link

What AI com­pa­nies can do to­day to help with the most im­por­tant century

HoldenKarnofsky20 Feb 2023 17:00 UTC
38 points
3 comments9 min readLW link
(www.cold-takes.com)

Ban­kless Pod­cast: 159 - We’re All Gonna Die with Eliezer Yudkowsky

bayesed20 Feb 2023 16:42 UTC
83 points
54 comments1 min readLW link
(www.youtube.com)