Free­dom and Pri­vacy of Thought Architectures

JohnBuridan20 Jul 2024 21:43 UTC
5 points
2 comments1 min readLW link

In­tro­duc­tion to Modern Dat­ing: Strate­gic Dat­ing Ad­vice for be­gin­ners

Jesper Lindholm20 Jul 2024 15:45 UTC
5 points
6 comments13 min readLW link

Why Ge­or­gism Lost Its Popularity

Zero Contradictions20 Jul 2024 15:08 UTC
43 points
50 comments1 min readLW link
(zerocontradictions.net)

Only Fools Avoid Hind­sight Bias

Kevin Dorst20 Jul 2024 13:42 UTC
−11 points
5 comments6 min readLW link
(kevindorst.substack.com)

A more sys­tem­atic case for in­ner misalignment

Richard_Ngo20 Jul 2024 5:03 UTC
31 points
4 comments5 min readLW link

BatchTopK: A Sim­ple Im­prove­ment for TopK-SAEs

20 Jul 2024 2:20 UTC
52 points
0 comments4 min readLW link

Krona Compare

jefftk20 Jul 2024 1:10 UTC
10 points
0 comments2 min readLW link
(www.jefftk.com)

(Ap­prox­i­mately) Deter­minis­tic Nat­u­ral Latents

19 Jul 2024 23:02 UTC
41 points
0 comments4 min readLW link

Fea­ture Tar­geted LLC Es­ti­ma­tion Dist­in­guishes SAE Fea­tures from Ran­dom Directions

19 Jul 2024 20:32 UTC
59 points
6 comments16 min readLW link

JumpReLU SAEs + Early Ac­cess to Gemma 2 SAEs

19 Jul 2024 16:10 UTC
48 points
10 comments1 min readLW link
(storage.googleapis.com)

Truth is Univer­sal: Ro­bust De­tec­tion of Lies in LLMs

Lennart Buerger19 Jul 2024 14:07 UTC
24 points
3 comments2 min readLW link
(arxiv.org)

Sus­tain­abil­ity of Digi­tal Life Form Societies

Hiroshi Yamakawa19 Jul 2024 13:59 UTC
19 points
1 comment20 min readLW link

Ro­mae Industriae

Maxwell Tabarrok19 Jul 2024 13:03 UTC
34 points
2 comments7 min readLW link
(www.maximum-progress.com)

[Question] Have peo­ple given up on iter­ated dis­til­la­tion and am­plifi­ca­tion?

Chris_Leong19 Jul 2024 12:23 UTC
20 points
1 comment1 min readLW link

How do we know that “good re­search” is good? (aka “di­rect eval­u­a­tion” vs “eigen-eval­u­a­tion”)

Ruby19 Jul 2024 0:31 UTC
47 points
21 comments6 min readLW link

Linkpost: Surely you can be serious

kave18 Jul 2024 22:18 UTC
59 points
8 comments1 min readLW link
(www.experimental-history.com)

My ex­pe­rience ap­ply­ing to MATS 6.0

mic18 Jul 2024 19:02 UTC
16 points
3 comments5 min readLW link

[Question] What are the ac­tual ar­gu­ments in fa­vor of com­pu­ta­tion­al­ism as a the­ory of iden­tity?

sunwillrise18 Jul 2024 18:44 UTC
12 points
24 comments5 min readLW link

Yet Another Cri­tique of “Lux­ury Beliefs”

ymeskhout18 Jul 2024 18:37 UTC
6 points
10 comments9 min readLW link
(www.ymeskhout.com)

[In­terim re­search re­port] Eval­u­at­ing the Goal-Direct­ed­ness of Lan­guage Models

18 Jul 2024 18:19 UTC
39 points
4 comments11 min readLW link

In­ter­pretabil­ity in Ac­tion: Ex­plo­ra­tory Anal­y­sis of VPT, a Minecraft Agent

18 Jul 2024 17:02 UTC
9 points
0 comments1 min readLW link
(arxiv.org)

Ac­ti­va­tion Eng­ineer­ing The­o­ries of Impact

kubanetics18 Jul 2024 16:44 UTC
6 points
1 comment2 min readLW link

[Question] Me & My Clone

SimonBaars18 Jul 2024 16:25 UTC
27 points
22 comments1 min readLW link

AI #73: Openly Evil AI

Zvi18 Jul 2024 14:40 UTC
89 points
20 comments52 min readLW link
(thezvi.wordpress.com)

A List of 45+ Mech In­terp Pro­ject Ideas from Apollo Re­search’s In­ter­pretabil­ity Team

18 Jul 2024 14:15 UTC
117 points
18 comments18 min readLW link

SAEs (usu­ally) Trans­fer Between Base and Chat Models

18 Jul 2024 10:29 UTC
65 points
0 comments10 min readLW link

[Question] Should we ex­clude al­ign­ment re­search from LLM train­ing datasets?

Ben Millwood18 Jul 2024 10:27 UTC
1 point
1 comment1 min readLW link

Keep­ing con­tent out of LLM train­ing datasets

Ben Millwood18 Jul 2024 10:27 UTC
3 points
0 comments5 min readLW link

The As­sas­si­na­tion of Trump’s Ear is Ev­i­dence for Time-Travel

elv18 Jul 2024 7:01 UTC
−9 points
5 comments5 min readLW link

Friend­ship is trans­ac­tional, un­con­di­tional friend­ship is insurance

Ruby17 Jul 2024 22:52 UTC
66 points
24 comments2 min readLW link

D&D.Sci: Whom Shall You Call? [Eval­u­a­tion and Rule­set]

abstractapplic17 Jul 2024 22:34 UTC
17 points
5 comments5 min readLW link

Op­ti­mistic As­sump­tions, Longterm Plan­ning, and “Cope”

Raemon17 Jul 2024 22:14 UTC
193 points
46 comments7 min readLW link

Bak­ing vs Patiss­ing vs Cook­ing, the HPS explanation

adamShimi17 Jul 2024 20:29 UTC
30 points
16 comments3 min readLW link
(epistemologicalfascinations.substack.com)

Launch­ing the Re­s­pi­ra­tory Out­look 2024/​25 Fore­cast­ing Series

ChristianWilliams17 Jul 2024 19:51 UTC
5 points
0 comments1 min readLW link
(www.metaculus.com)

What are you get­ting paid in?

Austin Chen17 Jul 2024 19:23 UTC
84 points
14 comments4 min readLW link
(www.approachwithalacrity.com)

In­di­vi­d­u­ally in­cen­tivized safe Pareto im­prove­ments in open-source bargaining

17 Jul 2024 18:26 UTC
39 points
2 comments17 min readLW link

Profit and Value

kwang17 Jul 2024 18:06 UTC
22 points
3 comments6 min readLW link
(open.substack.com)

So You’ve Learned To Tele­port by Tom Scott

landscape_kiwi17 Jul 2024 18:04 UTC
4 points
0 comments1 min readLW link
(www.youtube.com)

How does gen­er­al­ized ac­cessibil­ity com­pare to tar­geted ac­cessibil­ity?

ErioirE17 Jul 2024 17:07 UTC
3 points
0 comments2 min readLW link

Hous­ing Roundup #9: Restrict­ing Supply

Zvi17 Jul 2024 12:50 UTC
25 points
8 comments44 min readLW link
(thezvi.wordpress.com)

We ran an AI safety con­fer­ence in Tokyo. It went re­ally well. Come next year!

Blaine17 Jul 2024 6:55 UTC
45 points
1 comment6 min readLW link

Agency in Politics

Martin Sustrik17 Jul 2024 5:30 UTC
35 points
2 comments3 min readLW link
(250bpm.substack.com)

Ar­rakis—A toolkit to con­duct, track and vi­su­al­ize mechanis­tic in­ter­pretabil­ity ex­per­i­ments.

Yash Srivastava17 Jul 2024 2:02 UTC
2 points
2 comments5 min readLW link

An­nounc­ing Open Philan­thropy’s AI gov­er­nance and policy RFP

Julian Hazell17 Jul 2024 2:02 UTC
25 points
0 comments1 min readLW link
(www.openphilanthropy.org)

Turn­ing Your Back On Traffic

jefftk17 Jul 2024 1:00 UTC
37 points
7 comments1 min readLW link
(www.jefftk.com)

[Question] Opinions on Eureka Labs

jmh17 Jul 2024 0:16 UTC
6 points
2 comments1 min readLW link

Sim­plify­ing Cor­rigi­bil­ity – Subagent Cor­rigi­bil­ity Is Not Anti-Natural

Rubi J. Hudson16 Jul 2024 22:44 UTC
44 points
27 comments5 min readLW link

Mul­ti­plex Gene Edit­ing: Where Are We Now?

sarahconstantin16 Jul 2024 20:50 UTC
69 points
6 comments7 min readLW link
(sarahconstantin.substack.com)

Re­cur­sion in AI is scary. But let’s talk solu­tions.

Oleg Trott16 Jul 2024 20:34 UTC
3 points
10 comments2 min readLW link

How to wash your hands pre­cisely and thoroughly

dkl916 Jul 2024 18:29 UTC
12 points
0 comments1 min readLW link
(dkl9.net)