David Scott Krueger (formerly: capybaralet)

Karma: 2,176

I’m more active on Twitter than LW/AF these days: https://twitter.com/DavidSKrueger

Bio from https://www.davidscottkrueger.com/:
I am an Assistant Professor at the University of Cambridge and a member of Cambridge’s Computational and Biological Learning lab (CBL). My research group focuses on Deep Learning, AI Alignment, and AI safety. I’m broadly interested in work (including in areas outside of Machine Learning, e.g. AI governance) that could reduce the risk of human extinction (“x-risk”) resulting from out-of-control AI systems. Particular interests include:

Reward modeling and reward gaming
Aligning foundation models
Understanding learning and generalization in deep learning and foundation models, especially via “empirical theory” approaches
Preventing the development and deployment of socially harmful AI systems
Elaborating and evaluating speculative concerns about more advanced future AI systems

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

Jan_Kulveit, Raymond D, Nora_Ammann, Deger Turan, David Scott Krueger (formerly: capybaralet) and David Duvenaud

Jan 30, 2025, 5:03 PM

159 points

52 comments2 min readLW link

(gradual-disempowerment.ai)

A Sober Look at Steering Vectors for LLMs

Joschka Braun, Dmitrii Krasheninnikov, Usman Anwar, RobertKirk, Daniel Tan and David Scott Krueger (formerly: capybaralet)

Nov 23, 2024, 5:30 PM

38 points

0 comments5 min readLW link

[Question] Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception?

David Scott Krueger (formerly: capybaralet)Sep 4, 2024, 12:40 PM

19 points

7 comments1 min readLW link

An ML paper on data stealing provides a construction for “gradient hacking”

David Scott Krueger (formerly: capybaralet)Jul 30, 2024, 9:44 PM

21 points

1 comment1 min readLW link

(arxiv.org)

[Link Post] “Foundational Challenges in Assuring Alignment and Safety of Large Language Models”

David Scott Krueger (formerly: capybaralet)Jun 6, 2024, 6:55 PM

70 points

2 comments6 min readLW link

(llm-safety-challenges.github.io)

Testing for consequence-blindness in LLMs using the HI-ADS unit test.

David Scott Krueger (formerly: capybaralet)Nov 24, 2023, 11:35 PM

25 points

2 comments2 min readLW link

“Publish or Perish” (a quick note on why you should try to make your work legible to existing academic communities)

David Scott Krueger (formerly: capybaralet)Mar 18, 2023, 7:01 PM

112 points

49 comments1 min readLW link 1 review

[Question] What organizations other than Conjecture have (esp. public) info-hazard policies?

David Scott Krueger (formerly: capybaralet)Mar 16, 2023, 2:49 PM

20 points

1 comment1 min readLW link

A (EtA: quick) note on terminology: AI Alignment != AI x-safety

David Scott Krueger (formerly: capybaralet)Feb 8, 2023, 10:33 PM

46 points

20 comments1 min readLW link

Why I hate the “accident vs. misuse” AI x-risk dichotomy (quick thoughts on “structural risk”)

David Scott Krueger (formerly: capybaralet)Jan 30, 2023, 6:50 PM

34 points

41 comments2 min readLW link

Quick thoughts on “scalable oversight” / “super-human feedback” research

David Scott Krueger (formerly: capybaralet)Jan 25, 2023, 12:55 PM

27 points

9 comments2 min readLW link

Mechanistic Interpretability as Reverse Engineering (follow-up to “cars and elephants”)

David Scott Krueger (formerly: capybaralet)Nov 3, 2022, 11:19 PM

28 points

3 comments1 min readLW link

“Cars and Elephants”: a handwavy argument/analogy against mechanistic interpretability

David Scott Krueger (formerly: capybaralet)Oct 31, 2022, 9:26 PM

48 points

25 comments2 min readLW link

[Question] I’m planning to start creating more write-ups summarizing my thoughts on various issues, mostly related to AI existential safety. What do you want to hear my nuanced takes on?

David Scott Krueger (formerly: capybaralet)Sep 24, 2022, 12:38 PM

9 points

10 comments1 min readLW link

[An email with a bunch of links I sent an experienced ML researcher interested in learning about Alignment / x-safety.]

David Scott Krueger (formerly: capybaralet)Sep 8, 2022, 10:28 PM

47 points

1 comment5 min readLW link

An Update on Academia vs. Industry (one year into my faculty job)

David Scott Krueger (formerly: capybaralet)Sep 3, 2022, 8:43 PM

122 points

18 comments4 min readLW link

Causal confusion as an argument against the scaling hypothesis

RobertKirk and David Scott Krueger (formerly: capybaralet)

Jun 20, 2022, 10:54 AM

86 points

30 comments15 min readLW link

[Question] Do FDT (or similar) recommend reparations?

David Scott Krueger (formerly: capybaralet)Apr 29, 2022, 5:34 PM

13 points

3 comments1 min readLW link

[Question] What’s a good probability distribution family (e.g. “log-normal”) to use for AGI timelines?

David Scott Krueger (formerly: capybaralet)Apr 13, 2022, 4:45 AM

9 points

11 comments1 min readLW link

[Question] Is “gears-level” just a synonym for “mechanistic”?

David Scott Krueger (formerly: capybaralet)Dec 13, 2021, 4:11 AM

48 points

29 comments1 min readLW link