RSS

David Scott Krueger (formerly: capybaralet)

Karma: 1,959

I’m more active on Twitter than LW/​AF these days: https://​​twitter.com/​​DavidSKrueger

Bio from https://​​www.davidscottkrueger.com/​​:
I am an Assistant Professor at the University of Cambridge and a member of Cambridge’s Computational and Biological Learning lab (CBL). My research group focuses on Deep Learning, AI Alignment, and AI safety. I’m broadly interested in work (including in areas outside of Machine Learning, e.g. AI governance) that could reduce the risk of human extinction (“x-risk”) resulting from out-of-control AI systems. Particular interests include:

A Sober Look at Steer­ing Vec­tors for LLMs

23 Nov 2024 17:30 UTC
9 points
0 comments5 min readLW link

[Question] Is there any rigor­ous work on us­ing an­thropic un­cer­tainty to pre­vent situ­a­tional aware­ness /​ de­cep­tion?

David Scott Krueger (formerly: capybaralet)4 Sep 2024 12:40 UTC
17 points
7 comments1 min readLW link

An ML pa­per on data steal­ing pro­vides a con­struc­tion for “gra­di­ent hack­ing”

David Scott Krueger (formerly: capybaralet)30 Jul 2024 21:44 UTC
21 points
1 comment1 min readLW link
(arxiv.org)

[Link Post] “Foun­da­tional Challenges in As­sur­ing Align­ment and Safety of Large Lan­guage Models”

David Scott Krueger (formerly: capybaralet)6 Jun 2024 18:55 UTC
70 points
2 comments6 min readLW link
(llm-safety-challenges.github.io)

Test­ing for con­se­quence-blind­ness in LLMs us­ing the HI-ADS unit test.

David Scott Krueger (formerly: capybaralet)24 Nov 2023 23:35 UTC
25 points
2 comments2 min readLW link

“Pub­lish or Per­ish” (a quick note on why you should try to make your work leg­ible to ex­ist­ing aca­demic com­mu­ni­ties)

David Scott Krueger (formerly: capybaralet)18 Mar 2023 19:01 UTC
99 points
48 comments1 min readLW link

[Question] What or­ga­ni­za­tions other than Con­jec­ture have (esp. pub­lic) info-haz­ard poli­cies?

David Scott Krueger (formerly: capybaralet)16 Mar 2023 14:49 UTC
20 points
1 comment1 min readLW link

A (EtA: quick) note on ter­minol­ogy: AI Align­ment != AI x-safety

David Scott Krueger (formerly: capybaralet)8 Feb 2023 22:33 UTC
46 points
20 comments1 min readLW link

Why I hate the “ac­ci­dent vs. mi­suse” AI x-risk di­chotomy (quick thoughts on “struc­tural risk”)

David Scott Krueger (formerly: capybaralet)30 Jan 2023 18:50 UTC
32 points
41 comments2 min readLW link

Quick thoughts on “scal­able over­sight” /​ “su­per-hu­man feed­back” research

David Scott Krueger (formerly: capybaralet)25 Jan 2023 12:55 UTC
27 points
9 comments2 min readLW link

Mechanis­tic In­ter­pretabil­ity as Re­v­erse Eng­ineer­ing (fol­low-up to “cars and elephants”)

David Scott Krueger (formerly: capybaralet)3 Nov 2022 23:19 UTC
28 points
3 comments1 min readLW link

“Cars and Elephants”: a hand­wavy ar­gu­ment/​anal­ogy against mechanis­tic interpretability

David Scott Krueger (formerly: capybaralet)31 Oct 2022 21:26 UTC
48 points
25 comments2 min readLW link

[Question] I’m plan­ning to start cre­at­ing more write-ups sum­ma­riz­ing my thoughts on var­i­ous is­sues, mostly re­lated to AI ex­is­ten­tial safety. What do you want to hear my nu­anced takes on?

David Scott Krueger (formerly: capybaralet)24 Sep 2022 12:38 UTC
9 points
10 comments1 min readLW link

[An email with a bunch of links I sent an ex­pe­rienced ML re­searcher in­ter­ested in learn­ing about Align­ment /​ x-safety.]

David Scott Krueger (formerly: capybaralet)8 Sep 2022 22:28 UTC
47 points
1 comment5 min readLW link

An Up­date on Academia vs. In­dus­try (one year into my fac­ulty job)

David Scott Krueger (formerly: capybaralet)3 Sep 2022 20:43 UTC
122 points
18 comments4 min readLW link

Causal con­fu­sion as an ar­gu­ment against the scal­ing hypothesis

20 Jun 2022 10:54 UTC
86 points
30 comments15 min readLW link

[Question] Do FDT (or similar) recom­mend repa­ra­tions?

David Scott Krueger (formerly: capybaralet)29 Apr 2022 17:34 UTC
13 points
3 comments1 min readLW link

[Question] What’s a good prob­a­bil­ity dis­tri­bu­tion fam­ily (e.g. “log-nor­mal”) to use for AGI timelines?

David Scott Krueger (formerly: capybaralet)13 Apr 2022 4:45 UTC
9 points
11 comments1 min readLW link

[Question] Is “gears-level” just a syn­onym for “mechanis­tic”?

David Scott Krueger (formerly: capybaralet)13 Dec 2021 4:11 UTC
48 points
29 comments1 min readLW link

[Question] Is there a name for the the­ory that “There will be fast take­off in real-world ca­pa­bil­ities be­cause al­most ev­ery­thing is AGI-com­plete”?

David Scott Krueger (formerly: capybaralet)2 Sep 2021 23:00 UTC
31 points
8 comments1 min readLW link