Shard Theory

TagLast edit: Dec 30, 2024, 10:03 AM by Dakara

Shard Theory is an alignment research program, about the relationship between training variables and learned values in trained Reinforcement Learning (RL) agents. It is thus an approach to progressively fleshing out a mechanistic account of human values, learned values in RL agents, and (to a lesser extent) the learned algorithms in ML generally.

Shard theory’s basic ontology of RL holds that shards are contextually activated, behavior-steering computations in neural networks (biological and artificial). The circuits that implement a shard that garners reinforcement are reinforced, meaning that that shard will be more likely to trigger again in the future, when given similar cognitive inputs.

As an appreciable fraction of a neural network is composed of shards, large neural nets can possess quite intelligent constituent shards. These shards can be sophisticated enough to be well-modeled as playing negotiation games with each other, (potentially) explaining human psychological phenomena like akrasia and value changes from moral reflection. Shard theory also suggests an approach to explaining the shape of human values, and a scheme for RL alignment.

The shard theory of human values

Quintin Pope and TurnTrout

Sep 4, 2022, 4:28 AM

255 points

67 comments24 min readLW link 2 reviews

Shard Theory in Nine Theses: a Distillation and Critical Appraisal

LawrenceCDec 19, 2022, 10:52 PM

150 points

30 comments18 min readLW link

Understanding and avoiding value drift

TurnTroutSep 9, 2022, 4:16 AM

48 points

14 comments6 min readLW link

Contra shard theory, in the context of the diamond maximizer problem

So8resOct 13, 2022, 11:51 PM

105 points

19 comments2 min readLW link 1 review

The heritability of human values: A behavior genetic critique of Shard Theory

geoffreymillerOct 20, 2022, 3:51 PM

82 points

63 comments21 min readLW link

Shard Theory—is it true for humans?

RishikaJun 14, 2024, 7:21 PM

71 points

7 comments15 min readLW link

Inner and outer alignment decompose one hard problem into two extremely hard problems

TurnTroutDec 2, 2022, 2:43 AM

148 points

22 comments47 min readLW link 3 reviews

Understanding and controlling a maze-solving policy network

TurnTrout, peligrietzer, Ulisse Mini, Monte M and David Udell

Mar 11, 2023, 6:59 PM

332 points

28 comments23 min readLW link

Shard Theory: An Overview

David UdellAug 11, 2022, 5:44 AM

166 points

34 comments10 min readLW link

Reward is not the optimization target

TurnTroutJul 25, 2022, 12:03 AM

375 points

123 comments10 min readLW link 3 reviews

A shot at the diamond-alignment problem

TurnTroutOct 6, 2022, 6:29 PM

95 points

67 comments15 min readLW link

Some Thoughts on Virtue Ethics for AIs

peligrietzerMay 2, 2023, 5:46 AM

77 points

8 comments4 min readLW link

AXRP Episode 22 - Shard Theory with Quintin Pope

DanielFilanJun 15, 2023, 7:00 PM

52 points

11 comments93 min readLW link

Paper: Understanding and Controlling a Maze-Solving Policy Network

TurnTrout, Ulisse Mini, peligrietzer, mrinank_sharma, Austin Meek, Monte M and lisathiergart

Oct 13, 2023, 1:38 AM

70 points

0 comments1 min readLW link

(arxiv.org)

General alignment properties

TurnTroutAug 8, 2022, 11:40 PM

50 points

2 comments1 min readLW link

Team Shard Status Report

David UdellAug 9, 2022, 5:33 AM

38 points

8 comments3 min readLW link

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

cloud, Jacob G-W, Evzen, Joseph Miller and TurnTrout

Dec 6, 2024, 10:19 PM

165 points

12 comments11 min readLW link

(arxiv.org)

Why I’m bearish on mechanistic interpretability: the shards are not in the network

tailcalledSep 13, 2024, 5:09 PM

22 points

40 comments1 min readLW link

Intrinsic Power-Seeking: AI Might Seek Power for Power’s Sake

TurnTroutNov 19, 2024, 6:36 PM

40 points

5 comments1 min readLW link

(turntrout.com)

Reward Bases: A simple mechanism for adaptive acquisition of multiple reward type

Bogdan Ionut CirsteaNov 23, 2024, 12:45 PM

11 points

0 comments1 min readLW link

[April Fools’] Definitive confirmation of shard theory

TurnTroutApr 1, 2023, 7:27 AM

169 points

8 comments2 min readLW link

Behavioural statistics for a maze-solving agent

peligrietzer and TurnTrout

Apr 20, 2023, 10:26 PM

46 points

11 comments10 min readLW link

Self-dialogue: Do behaviorist rewards make scheming AGIs?

Steven ByrnesFeb 13, 2025, 6:39 PM

43 points

0 comments46 min readLW link

Research agenda: Supervising AIs improving AIs

Quintin Pope, Owen D, Roman Engeler and jacquesthibs

Apr 29, 2023, 5:09 PM

76 points

5 comments19 min readLW link

The Shard Theory Alignment Scheme

David UdellAug 25, 2022, 4:52 AM

47 points

32 comments2 min readLW link

Human values & biases are inaccessible to the genome

TurnTroutJul 7, 2022, 5:29 PM

94 points

54 comments6 min readLW link 1 review

A framework and open questions for game theoretic shard modeling

Garrett BakerOct 21, 2022, 9:40 PM

11 points

4 comments4 min readLW link

Alignment allows “nonrobust” decision-influences and doesn’t require robust grading

TurnTroutNov 29, 2022, 6:23 AM

62 points

41 comments15 min readLW link

Disentangling Shard Theory into Atomic Claims

Leon LangJan 13, 2023, 4:23 AM

86 points

6 comments18 min readLW link

Positive values seem more robust and lasting than prohibitions

TurnTroutDec 17, 2022, 9:43 PM

52 points

13 comments2 min readLW link

An ML interpretation of Shard Theory

berenJan 3, 2023, 8:30 PM

39 points

5 comments4 min readLW link

Shard theory alignment has important, often-overlooked free parameters.

Charlie SteinerJan 20, 2023, 9:30 AM

36 points

10 comments3 min readLW link

Review of AI Alignment Progress

PeterMcCluskeyFeb 7, 2023, 6:57 PM

72 points

32 comments7 min readLW link

(bayesianinvestor.com)

Predictions for shard theory mechanistic interpretability results

TurnTrout, Ulisse Mini and peligrietzer

Mar 1, 2023, 5:16 AM

105 points

10 comments5 min readLW link

Contra “Strong Coherence”

DragonGodMar 4, 2023, 8:05 PM

39 points

24 comments1 min readLW link

[Question] Is “Strong Coherence” Anti-Natural?

DragonGodApr 11, 2023, 6:22 AM

23 points

25 comments2 min readLW link

Clippy, the friendly paperclipper

Seth HerdMar 2, 2023, 12:02 AM

3 points

11 comments2 min readLW link

Unpacking “Shard Theory” as Hunch, Question, Theory, and Insight

Jacy Reese AnthisNov 16, 2022, 1:54 PM

31 points

9 comments2 min readLW link

A Short Dialogue on the Meaning of Reward Functions

Leon Lang, Quintin Pope and peligrietzer

Nov 19, 2022, 9:04 PM

45 points

0 comments3 min readLW link

If Wentworth is right about natural abstractions, it would be bad for alignment

Wuschel SchulzDec 8, 2022, 3:19 PM

29 points

5 comments4 min readLW link

AGI will have learnt utility functions

berenJan 25, 2023, 7:42 PM

36 points

4 comments13 min readLW link

Adaptation-Executers, not Fitness-Maximizers

Eliezer YudkowskyNov 11, 2007, 6:39 AM

165 points

33 comments3 min readLW link

Exploring Shard-like Behavior: Empirical Insights into Contextual Decision-Making in RL Agents

Alejandro AristizabalSep 29, 2024, 12:32 AM

6 points

0 comments15 min readLW link

Humans provide an untapped wealth of evidence about alignment

TurnTrout and Quintin Pope

Jul 14, 2022, 2:31 AM

211 points

94 comments9 min readLW link 1 review

Broad Picture of Human Values

Thane RuthenisAug 20, 2022, 7:42 PM

42 points

6 comments10 min readLW link

In Defense of Wrapper-Minds

Thane RuthenisDec 28, 2022, 6:28 PM

24 points

38 comments3 min readLW link

Evolution is a bad analogy for AGI: inner alignment

Quintin PopeAug 13, 2022, 10:15 PM

79 points

15 comments8 min readLW link

Failure modes in a shard theory alignment plan

Thomas KwaSep 27, 2022, 10:34 PM

26 points

2 comments7 min readLW link

Experiment Idea: RL Agents Evading Learned Shutdownability

Leon LangJan 16, 2023, 10:46 PM

31 points

7 comments17 min readLW link

(docs.google.com)

Steering GPT-2-XL by adding an activation vector

TurnTrout, Monte M, David Udell, lisathiergart and Ulisse Mini

May 13, 2023, 6:42 PM

437 points

98 comments50 min readLW link 1 review

The alignment stability problem

Seth HerdMar 26, 2023, 2:10 AM

35 points

15 comments4 min readLW link

Pessimistic Shard Theory

Garrett BakerJan 25, 2023, 12:59 AM

72 points

13 comments3 min readLW link

raccoon Feb 15, 2023, 3:19 AM
3 points
0
Changed first instance of “RL” to “Reinforcement Learning (RL)” because if I didn’t immediately realize what it meant, someone who is learning this for the first time won’t think of it either.
- Raemon Feb 15, 2023, 5:58 AM
  2 points
  0
  Parent
  Yeah I think this is good practice.