RSS

Shard Theory

TagLast edit: Dec 30, 2024, 10:03 AM by Dakara

Shard Theory is an alignment research program, about the relationship between training variables and learned values in trained Reinforcement Learning (RL) agents. It is thus an approach to progressively fleshing out a mechanistic account of human values, learned values in RL agents, and (to a lesser extent) the learned algorithms in ML generally.

Shard theory’s basic ontology of RL holds that shards are contextually activated, behavior-steering computations in neural networks (biological and artificial). The circuits that implement a shard that garners reinforcement are reinforced, meaning that that shard will be more likely to trigger again in the future, when given similar cognitive inputs.

As an appreciable fraction of a neural network is composed of shards, large neural nets can possess quite intelligent constituent shards. These shards can be sophisticated enough to be well-modeled as playing negotiation games with each other, (potentially) explaining human psychological phenomena like akrasia and value changes from moral reflection. Shard theory also suggests an approach to explaining the shape of human values, and a scheme for RL alignment.

The shard the­ory of hu­man values

Sep 4, 2022, 4:28 AM
255 points
67 comments24 min readLW link2 reviews

Shard The­ory in Nine Th­e­ses: a Distil­la­tion and Crit­i­cal Appraisal

LawrenceCDec 19, 2022, 10:52 PM
150 points
30 comments18 min readLW link

Un­der­stand­ing and avoid­ing value drift

TurnTroutSep 9, 2022, 4:16 AM
48 points
14 comments6 min readLW link

Con­tra shard the­ory, in the con­text of the di­a­mond max­i­mizer problem

So8resOct 13, 2022, 11:51 PM
105 points
19 comments2 min readLW link1 review

The her­i­ta­bil­ity of hu­man val­ues: A be­hav­ior ge­netic cri­tique of Shard Theory

geoffreymillerOct 20, 2022, 3:51 PM
82 points
63 comments21 min readLW link

Shard The­ory—is it true for hu­mans?

RishikaJun 14, 2024, 7:21 PM
71 points
7 comments15 min readLW link

In­ner and outer al­ign­ment de­com­pose one hard prob­lem into two ex­tremely hard problems

TurnTroutDec 2, 2022, 2:43 AM
148 points
22 comments47 min readLW link3 reviews

Un­der­stand­ing and con­trol­ling a maze-solv­ing policy network

Mar 11, 2023, 6:59 PM
332 points
28 comments23 min readLW link

Shard The­ory: An Overview

David UdellAug 11, 2022, 5:44 AM
166 points
34 comments10 min readLW link

Re­ward is not the op­ti­miza­tion target

TurnTroutJul 25, 2022, 12:03 AM
375 points
123 comments10 min readLW link3 reviews

A shot at the di­a­mond-al­ign­ment problem

TurnTroutOct 6, 2022, 6:29 PM
95 points
67 comments15 min readLW link

Some Thoughts on Virtue Ethics for AIs

peligrietzerMay 2, 2023, 5:46 AM
77 points
8 comments4 min readLW link

AXRP Epi­sode 22 - Shard The­ory with Quintin Pope

DanielFilanJun 15, 2023, 7:00 PM
52 points
11 comments93 min readLW link

Paper: Un­der­stand­ing and Con­trol­ling a Maze-Solv­ing Policy Network

Oct 13, 2023, 1:38 AM
70 points
0 comments1 min readLW link
(arxiv.org)

Gen­eral al­ign­ment properties

TurnTroutAug 8, 2022, 11:40 PM
50 points
2 comments1 min readLW link

Team Shard Sta­tus Report

David UdellAug 9, 2022, 5:33 AM
38 points
8 comments3 min readLW link

Gra­di­ent Rout­ing: Mask­ing Gra­di­ents to Lo­cal­ize Com­pu­ta­tion in Neu­ral Networks

Dec 6, 2024, 10:19 PM
161 points
12 comments11 min readLW link
(arxiv.org)

Why I’m bear­ish on mechanis­tic in­ter­pretabil­ity: the shards are not in the network

tailcalledSep 13, 2024, 5:09 PM
22 points
40 comments1 min readLW link

In­trin­sic Power-Seek­ing: AI Might Seek Power for Power’s Sake

TurnTroutNov 19, 2024, 6:36 PM
40 points
5 comments1 min readLW link
(turntrout.com)

Re­ward Bases: A sim­ple mechanism for adap­tive ac­qui­si­tion of mul­ti­ple re­ward type

Bogdan Ionut CirsteaNov 23, 2024, 12:45 PM
11 points
0 comments1 min readLW link

[April Fools’] Defini­tive con­fir­ma­tion of shard theory

TurnTroutApr 1, 2023, 7:27 AM
169 points
8 comments2 min readLW link

Be­havi­oural statis­tics for a maze-solv­ing agent

Apr 20, 2023, 10:26 PM
46 points
11 comments10 min readLW link

Self-di­alogue: Do be­hav­iorist re­wards make schem­ing AGIs?

Steven ByrnesFeb 13, 2025, 6:39 PM
42 points
0 comments46 min readLW link

Re­search agenda: Su­per­vis­ing AIs im­prov­ing AIs

Apr 29, 2023, 5:09 PM
76 points
5 comments19 min readLW link

The Shard The­ory Align­ment Scheme

David UdellAug 25, 2022, 4:52 AM
47 points
32 comments2 min readLW link

Hu­man val­ues & bi­ases are in­ac­cessible to the genome

TurnTroutJul 7, 2022, 5:29 PM
94 points
54 comments6 min readLW link1 review

A frame­work and open ques­tions for game the­o­retic shard modeling

Garrett BakerOct 21, 2022, 9:40 PM
11 points
4 comments4 min readLW link

Align­ment al­lows “non­ro­bust” de­ci­sion-in­fluences and doesn’t re­quire ro­bust grading

TurnTroutNov 29, 2022, 6:23 AM
62 points
41 comments15 min readLW link

Disen­tan­gling Shard The­ory into Atomic Claims

Leon LangJan 13, 2023, 4:23 AM
86 points
6 comments18 min readLW link

Pos­i­tive val­ues seem more ro­bust and last­ing than prohibitions

TurnTroutDec 17, 2022, 9:43 PM
52 points
13 comments2 min readLW link

An ML in­ter­pre­ta­tion of Shard Theory

berenJan 3, 2023, 8:30 PM
39 points
5 comments4 min readLW link

Shard the­ory al­ign­ment has im­por­tant, of­ten-over­looked free pa­ram­e­ters.

Charlie SteinerJan 20, 2023, 9:30 AM
36 points
10 comments3 min readLW link

Re­view of AI Align­ment Progress

PeterMcCluskeyFeb 7, 2023, 6:57 PM
72 points
32 comments7 min readLW link
(bayesianinvestor.com)

Pre­dic­tions for shard the­ory mechanis­tic in­ter­pretabil­ity results

Mar 1, 2023, 5:16 AM
105 points
10 comments5 min readLW link

Con­tra “Strong Co­her­ence”

DragonGodMar 4, 2023, 8:05 PM
39 points
24 comments1 min readLW link

[Question] Is “Strong Co­her­ence” Anti-Nat­u­ral?

DragonGodApr 11, 2023, 6:22 AM
23 points
25 comments2 min readLW link

Clippy, the friendly paperclipper

Seth HerdMar 2, 2023, 12:02 AM
3 points
11 comments2 min readLW link

Un­pack­ing “Shard The­ory” as Hunch, Ques­tion, The­ory, and Insight

Jacy Reese AnthisNov 16, 2022, 1:54 PM
31 points
9 comments2 min readLW link

A Short Dialogue on the Mean­ing of Re­ward Functions

Nov 19, 2022, 9:04 PM
45 points
0 comments3 min readLW link

If Went­worth is right about nat­u­ral ab­strac­tions, it would be bad for alignment

Wuschel SchulzDec 8, 2022, 3:19 PM
29 points
5 comments4 min readLW link

AGI will have learnt util­ity functions

berenJan 25, 2023, 7:42 PM
36 points
4 comments13 min readLW link

Adap­ta­tion-Ex­e­cuters, not Fit­ness-Maximizers

Eliezer YudkowskyNov 11, 2007, 6:39 AM
165 points
33 comments3 min readLW link

Ex­plor­ing Shard-like Be­hav­ior: Em­piri­cal In­sights into Con­tex­tual De­ci­sion-Mak­ing in RL Agents

Alejandro AristizabalSep 29, 2024, 12:32 AM
6 points
0 comments15 min readLW link

Hu­mans provide an un­tapped wealth of ev­i­dence about alignment

Jul 14, 2022, 2:31 AM
211 points
94 comments9 min readLW link1 review

Broad Pic­ture of Hu­man Values

Thane RuthenisAug 20, 2022, 7:42 PM
42 points
6 comments10 min readLW link

In Defense of Wrap­per-Minds

Thane RuthenisDec 28, 2022, 6:28 PM
24 points
38 comments3 min readLW link

Evolu­tion is a bad anal­ogy for AGI: in­ner alignment

Quintin PopeAug 13, 2022, 10:15 PM
79 points
15 comments8 min readLW link

Failure modes in a shard the­ory al­ign­ment plan

Thomas KwaSep 27, 2022, 10:34 PM
26 points
2 comments7 min readLW link

Ex­per­i­ment Idea: RL Agents Evad­ing Learned Shutdownability

Leon LangJan 16, 2023, 10:46 PM
31 points
7 comments17 min readLW link
(docs.google.com)

Steer­ing GPT-2-XL by adding an ac­ti­va­tion vector

May 13, 2023, 6:42 PM
437 points
98 comments50 min readLW link1 review

The al­ign­ment sta­bil­ity problem

Seth HerdMar 26, 2023, 2:10 AM
35 points
15 comments4 min readLW link

Pes­simistic Shard Theory

Garrett BakerJan 25, 2023, 12:59 AM
72 points
13 comments3 min readLW link