RSS

Gra­di­ent Hacking

TagLast edit: Aug 27, 2022, 6:12 PM by Multicore

Gradient Hacking describes a scenario where a mesa-optimizer in an AI system acts in a way that intentionally manipulates the way that gradient descent updates it, likely to preserve its own mesa-objective in future iterations of the AI.

See also: Inner Alignment

Gra­di­ent hacking

evhubOct 16, 2019, 12:53 AM
107 points
39 comments3 min readLW link2 reviews

Challenge: con­struct a Gra­di­ent Hacker

Mar 9, 2023, 2:38 AM
39 points
10 comments1 min readLW link

Some real ex­am­ples of gra­di­ent hacking

Oliver SourbutNov 22, 2021, 12:11 AM
15 points
8 comments2 min readLW link

Gra­di­ent Filtering

Jan 18, 2023, 8:09 PM
56 points
16 comments13 min readLW link

Gra­di­ent hack­ing is ex­tremely difficult

berenJan 24, 2023, 3:45 PM
164 points
22 comments5 min readLW link

Ap­proaches to gra­di­ent hacking

adamShimiAug 14, 2021, 3:16 PM
16 points
8 comments8 min readLW link

Gra­di­ent hack­ing: defi­ni­tions and examples

Richard_NgoJun 29, 2022, 9:35 PM
38 points
2 comments5 min readLW link

Gra­di­ent Hacker De­sign Prin­ci­ples From Biology

johnswentworthSep 1, 2022, 7:03 PM
60 points
13 comments3 min readLW link

Towards De­con­fus­ing Gra­di­ent Hacking

leogaoOct 24, 2021, 12:43 AM
39 points
3 comments12 min readLW link

[Question] How does Gra­di­ent Des­cent In­ter­act with Good­hart?

Scott GarrabrantFeb 2, 2019, 12:14 AM
68 points
19 comments4 min readLW link

Un­der­stand­ing Gra­di­ent Hacking

peterbarnettDec 10, 2021, 3:58 PM
41 points
5 comments30 min readLW link

Thoughts on gra­di­ent hacking

Richard_NgoSep 3, 2021, 1:02 PM
33 points
11 comments4 min readLW link

Some mo­ti­va­tions to gra­di­ent hack

peterbarnettDec 17, 2021, 3:06 AM
8 points
0 comments6 min readLW link

Gra­di­ent Hack­ing via Schel­ling Goals

Adam ScherlisDec 28, 2021, 8:38 PM
33 points
4 comments4 min readLW link

Is Fish­e­rian Ru­n­away Gra­di­ent Hack­ing?

Ryan KiddApr 10, 2022, 1:47 PM
15 points
6 comments4 min readLW link

(Ex­tremely) Naive Gra­di­ent Hack­ing Doesn’t Work

ojorgensenDec 20, 2022, 2:35 PM
17 points
0 comments6 min readLW link

[ASoT] Si­mu­la­tors show us be­havi­oural prop­er­ties by default

JozdienJan 13, 2023, 6:42 PM
36 points
3 comments3 min readLW link

Good­bye, Shog­goth: The Stage, its An­i­ma­tron­ics, & the Pup­peteer – a New Metaphor

RogerDearnaleyJan 9, 2024, 8:42 PM
47 points
8 comments36 min readLW link

AI Can be “Gra­di­ent Aware” Without Do­ing Gra­di­ent hack­ing.

SodiumOct 20, 2024, 9:02 PM
20 points
0 comments2 min readLW link

Pro­gram­matic back­doors: DNNs can use SGD to run ar­bi­trary state­ful computation

Oct 23, 2023, 4:37 PM
107 points
3 comments8 min readLW link

Gra­di­ent hack­ing via ac­tual hacking

Max HMay 10, 2023, 1:57 AM
12 points
7 comments3 min readLW link

Elic­it­ing Credit Hack­ing Be­havi­ours in LLMs

omegastickSep 14, 2023, 3:07 PM
3 points
2 comments7 min readLW link
(github.com)

Meta learn­ing to gra­di­ent hack

Quintin PopeOct 1, 2021, 7:25 PM
55 points
11 comments3 min readLW link

In­ter­pret­ing the Learn­ing of Deceit

RogerDearnaleyDec 18, 2023, 8:12 AM
30 points
14 comments9 min readLW link

Ob­sta­cles to gra­di­ent hacking

leogaoSep 5, 2021, 10:42 PM
28 points
11 comments4 min readLW link
No comments.