RSS

David Scott Krueger (formerly: capybaralet)

Karma: 2,176

I’m more active on Twitter than LW/​AF these days: https://​​twitter.com/​​DavidSKrueger

Bio from https://​​www.davidscottkrueger.com/​​:
I am an Assistant Professor at the University of Cambridge and a member of Cambridge’s Computational and Biological Learning lab (CBL). My research group focuses on Deep Learning, AI Alignment, and AI safety. I’m broadly interested in work (including in areas outside of Machine Learning, e.g. AI governance) that could reduce the risk of human extinction (“x-risk”) resulting from out-of-control AI systems. Particular interests include:

Grad­ual Disem­pow­er­ment: Sys­temic Ex­is­ten­tial Risks from In­cre­men­tal AI Development

Jan 30, 2025, 5:03 PM
159 points
52 comments2 min readLW link
(gradual-disempowerment.ai)

A Sober Look at Steer­ing Vec­tors for LLMs

Nov 23, 2024, 5:30 PM
38 points
0 comments5 min readLW link

[Question] Is there any rigor­ous work on us­ing an­thropic un­cer­tainty to pre­vent situ­a­tional aware­ness /​ de­cep­tion?

David Scott Krueger (formerly: capybaralet)Sep 4, 2024, 12:40 PM
19 points
7 comments1 min readLW link

An ML pa­per on data steal­ing pro­vides a con­struc­tion for “gra­di­ent hack­ing”

David Scott Krueger (formerly: capybaralet)Jul 30, 2024, 9:44 PM
21 points
1 comment1 min readLW link
(arxiv.org)

[Link Post] “Foun­da­tional Challenges in As­sur­ing Align­ment and Safety of Large Lan­guage Models”

David Scott Krueger (formerly: capybaralet)Jun 6, 2024, 6:55 PM
70 points
2 comments6 min readLW link
(llm-safety-challenges.github.io)

Test­ing for con­se­quence-blind­ness in LLMs us­ing the HI-ADS unit test.

David Scott Krueger (formerly: capybaralet)Nov 24, 2023, 11:35 PM
25 points
2 comments2 min readLW link

“Pub­lish or Per­ish” (a quick note on why you should try to make your work leg­ible to ex­ist­ing aca­demic com­mu­ni­ties)

David Scott Krueger (formerly: capybaralet)Mar 18, 2023, 7:01 PM
112 points
49 comments1 min readLW link1 review

[Question] What or­ga­ni­za­tions other than Con­jec­ture have (esp. pub­lic) info-haz­ard poli­cies?

David Scott Krueger (formerly: capybaralet)Mar 16, 2023, 2:49 PM
20 points
1 comment1 min readLW link

A (EtA: quick) note on ter­minol­ogy: AI Align­ment != AI x-safety

David Scott Krueger (formerly: capybaralet)Feb 8, 2023, 10:33 PM
46 points
20 comments1 min readLW link

Why I hate the “ac­ci­dent vs. mi­suse” AI x-risk di­chotomy (quick thoughts on “struc­tural risk”)

David Scott Krueger (formerly: capybaralet)Jan 30, 2023, 6:50 PM
34 points
41 comments2 min readLW link

Quick thoughts on “scal­able over­sight” /​ “su­per-hu­man feed­back” research

David Scott Krueger (formerly: capybaralet)Jan 25, 2023, 12:55 PM
27 points
9 comments2 min readLW link

Mechanis­tic In­ter­pretabil­ity as Re­v­erse Eng­ineer­ing (fol­low-up to “cars and elephants”)

David Scott Krueger (formerly: capybaralet)Nov 3, 2022, 11:19 PM
28 points
3 comments1 min readLW link

“Cars and Elephants”: a hand­wavy ar­gu­ment/​anal­ogy against mechanis­tic interpretability

David Scott Krueger (formerly: capybaralet)Oct 31, 2022, 9:26 PM
48 points
25 comments2 min readLW link

[Question] I’m plan­ning to start cre­at­ing more write-ups sum­ma­riz­ing my thoughts on var­i­ous is­sues, mostly re­lated to AI ex­is­ten­tial safety. What do you want to hear my nu­anced takes on?

David Scott Krueger (formerly: capybaralet)Sep 24, 2022, 12:38 PM
9 points
10 comments1 min readLW link

[An email with a bunch of links I sent an ex­pe­rienced ML re­searcher in­ter­ested in learn­ing about Align­ment /​ x-safety.]

David Scott Krueger (formerly: capybaralet)Sep 8, 2022, 10:28 PM
47 points
1 comment5 min readLW link

An Up­date on Academia vs. In­dus­try (one year into my fac­ulty job)

David Scott Krueger (formerly: capybaralet)Sep 3, 2022, 8:43 PM
122 points
18 comments4 min readLW link

Causal con­fu­sion as an ar­gu­ment against the scal­ing hypothesis

Jun 20, 2022, 10:54 AM
86 points
30 comments15 min readLW link

[Question] Do FDT (or similar) recom­mend repa­ra­tions?

David Scott Krueger (formerly: capybaralet)Apr 29, 2022, 5:34 PM
13 points
3 comments1 min readLW link

[Question] What’s a good prob­a­bil­ity dis­tri­bu­tion fam­ily (e.g. “log-nor­mal”) to use for AGI timelines?

David Scott Krueger (formerly: capybaralet)Apr 13, 2022, 4:45 AM
9 points
11 comments1 min readLW link

[Question] Is “gears-level” just a syn­onym for “mechanis­tic”?

David Scott Krueger (formerly: capybaralet)Dec 13, 2021, 4:11 AM
48 points
29 comments1 min readLW link