Gradient Hacking

TagLast edit: Aug 27, 2022, 6:12 PM by Multicore

Gradient Hacking describes a scenario where a mesa-optimizer in an AI system acts in a way that intentionally manipulates the way that gradient descent updates it, likely to preserve its own mesa-objective in future iterations of the AI.

Gradient hacking

evhubOct 16, 2019, 12:53 AM

107 points

39 comments3 min readLW link 2 reviews

Challenge: construct a Gradient Hacker

Thomas Larsen and Thomas Kwa

Mar 9, 2023, 2:38 AM

39 points

10 comments1 min readLW link

Some real examples of gradient hacking

Oliver SourbutNov 22, 2021, 12:11 AM

15 points

8 comments2 min readLW link

Gradient Filtering

Jozdien and janus

Jan 18, 2023, 8:09 PM

56 points

16 comments13 min readLW link

Gradient hacking is extremely difficult

berenJan 24, 2023, 3:45 PM

164 points

22 comments5 min readLW link

Approaches to gradient hacking

adamShimiAug 14, 2021, 3:16 PM

16 points

8 comments8 min readLW link

Gradient hacking: definitions and examples

Richard_NgoJun 29, 2022, 9:35 PM

38 points

2 comments5 min readLW link

Gradient Hacker Design Principles From Biology

johnswentworthSep 1, 2022, 7:03 PM

60 points

13 comments3 min readLW link

Towards Deconfusing Gradient Hacking

leogaoOct 24, 2021, 12:43 AM

39 points

3 comments12 min readLW link

[Question] How does Gradient Descent Interact with Goodhart?

Scott GarrabrantFeb 2, 2019, 12:14 AM

68 points

19 comments4 min readLW link

Understanding Gradient Hacking

peterbarnettDec 10, 2021, 3:58 PM

41 points

5 comments30 min readLW link

Thoughts on gradient hacking

Richard_NgoSep 3, 2021, 1:02 PM

33 points

11 comments4 min readLW link

Some motivations to gradient hack

peterbarnettDec 17, 2021, 3:06 AM

8 points

0 comments6 min readLW link

Gradient Hacking via Schelling Goals

Adam ScherlisDec 28, 2021, 8:38 PM

33 points

4 comments4 min readLW link

Is Fisherian Runaway Gradient Hacking?

Ryan KiddApr 10, 2022, 1:47 PM

15 points

6 comments4 min readLW link

(Extremely) Naive Gradient Hacking Doesn’t Work

ojorgensenDec 20, 2022, 2:35 PM

17 points

0 comments6 min readLW link

[ASoT] Simulators show us behavioural properties by default

JozdienJan 13, 2023, 6:42 PM

36 points

3 comments3 min readLW link

Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor

RogerDearnaleyJan 9, 2024, 8:42 PM

47 points

8 comments36 min readLW link

AI Can be “Gradient Aware” Without Doing Gradient hacking.

SodiumOct 20, 2024, 9:02 PM

20 points

0 comments2 min readLW link

Programmatic backdoors: DNNs can use SGD to run arbitrary stateful computation

Fabien Roger and Buck

Oct 23, 2023, 4:37 PM

107 points

3 comments8 min readLW link

Gradient hacking via actual hacking

Max HMay 10, 2023, 1:57 AM

12 points

7 comments3 min readLW link

Eliciting Credit Hacking Behaviours in LLMs

omegastickSep 14, 2023, 3:07 PM

3 points

2 comments7 min readLW link

(github.com)

Meta learning to gradient hack

Quintin PopeOct 1, 2021, 7:25 PM

55 points

11 comments3 min readLW link

Interpreting the Learning of Deceit

RogerDearnaleyDec 18, 2023, 8:12 AM

30 points

14 comments9 min readLW link

Obstacles to gradient hacking

leogaoSep 5, 2021, 10:42 PM

28 points

11 comments4 min readLW link

No comments.

Gra­di­ent Hacking

Gradient Hacking