Shard Theory

TagLast edit: 26 Oct 2024 0:18 UTC by Noosphere89

Shard theory is an alignment research program, about the relationship between training variables and learned values in trained Reinforcement Learning (RL) agents. It is thus an approach to progressively fleshing out a mechanistic account of human values, learned values in RL agents, and (to a lesser extent) the learned algorithms in ML generally.

Shard theory’s basic ontology of RL holds that shards are contextually activated, behavior-steering computations in neural networks (biological and artificial). The circuits that implement a shard that garners reinforcement are reinforced, meaning that that shard will be more likely to trigger again in the future, when given similar cognitive inputs.

As an appreciable fraction of a neural network is composed of shards, large neural nets can possess quite intelligent constituent shards. These shards can be sophisticated enough to be well-modeled as playing negotiation games with each other, (potentially) explaining human psychological phenomena like akrasia and value changes from moral reflection. Shard theory also suggests an approach to explaining the shape of human values, and a scheme for RL alignment.

The shard theory of human values

Quintin Pope and TurnTrout

4 Sep 2022 4:28 UTC

248 points

67 comments24 min readLW link 2 reviews

Shard Theory in Nine Theses: a Distillation and Critical Appraisal

LawrenceC19 Dec 2022 22:52 UTC

143 points

30 comments18 min readLW link

Understanding and avoiding value drift

TurnTrout9 Sep 2022 4:16 UTC

48 points

11 comments6 min readLW link

Contra shard theory, in the context of the diamond maximizer problem

So8res13 Oct 2022 23:51 UTC

102 points

19 comments2 min readLW link 1 review

The heritability of human values: A behavior genetic critique of Shard Theory

geoffreymiller20 Oct 2022 15:51 UTC

80 points

59 comments21 min readLW link

Shard Theory—is it true for humans?

Rishika14 Jun 2024 19:21 UTC

68 points

7 comments15 min readLW link

Understanding and controlling a maze-solving policy network

TurnTrout, peligrietzer, Ulisse Mini, Monte M and David Udell

11 Mar 2023 18:59 UTC

328 points

27 comments23 min readLW link

Reward is not the optimization target

TurnTrout25 Jul 2022 0:03 UTC

376 points

123 comments10 min readLW link 3 reviews

Shard Theory: An Overview

David Udell11 Aug 2022 5:44 UTC

165 points

34 comments10 min readLW link

Inner and outer alignment decompose one hard problem into two extremely hard problems

TurnTrout2 Dec 2022 2:43 UTC

146 points

22 comments47 min readLW link 3 reviews

A shot at the diamond-alignment problem

TurnTrout6 Oct 2022 18:29 UTC

95 points

59 comments15 min readLW link

Paper: Understanding and Controlling a Maze-Solving Policy Network

TurnTrout, Ulisse Mini, peligrietzer, mrinank_sharma, Austin Meek, Monte M and lisathiergart

13 Oct 2023 1:38 UTC

70 points

0 comments1 min readLW link

(arxiv.org)

General alignment properties

TurnTrout8 Aug 2022 23:40 UTC

50 points

2 comments1 min readLW link

A framework and open questions for game theoretic shard modeling

Garrett Baker21 Oct 2022 21:40 UTC

11 points

4 comments4 min readLW link

Why I’m bearish on mechanistic interpretability: the shards are not in the network

tailcalled13 Sep 2024 17:09 UTC

19 points

40 comments1 min readLW link

Intrinsic Power-Seeking: AI Might Seek Power for Power’s Sake

TurnTrout19 Nov 2024 18:36 UTC

37 points

4 comments1 min readLW link

(turntrout.com)

[April Fools’] Definitive confirmation of shard theory

TurnTrout1 Apr 2023 7:27 UTC

168 points

8 comments2 min readLW link

Behavioural statistics for a maze-solving agent

peligrietzer and TurnTrout

20 Apr 2023 22:26 UTC

46 points

11 comments10 min readLW link

Research agenda: Supervising AIs improving AIs

Quintin Pope, Owen D, Roman Engeler and jacquesthibs

29 Apr 2023 17:09 UTC

76 points

5 comments19 min readLW link

Some Thoughts on Virtue Ethics for AIs

peligrietzer2 May 2023 5:46 UTC

76 points

8 comments4 min readLW link

AXRP Episode 22 - Shard Theory with Quintin Pope

DanielFilan15 Jun 2023 19:00 UTC

52 points

11 comments93 min readLW link

The Shard Theory Alignment Scheme

David Udell25 Aug 2022 4:52 UTC

47 points

32 comments2 min readLW link

Team Shard Status Report

David Udell9 Aug 2022 5:33 UTC

38 points

8 comments3 min readLW link

Human values & biases are inaccessible to the genome

TurnTrout7 Jul 2022 17:29 UTC

94 points

54 comments6 min readLW link 1 review

Alignment allows “nonrobust” decision-influences and doesn’t require robust grading

TurnTrout29 Nov 2022 6:23 UTC

60 points

42 comments15 min readLW link

Disentangling Shard Theory into Atomic Claims

Leon Lang13 Jan 2023 4:23 UTC

86 points

6 comments18 min readLW link

Positive values seem more robust and lasting than prohibitions

TurnTrout17 Dec 2022 21:43 UTC

52 points

13 comments2 min readLW link

An ML interpretation of Shard Theory

beren3 Jan 2023 20:30 UTC

39 points

5 comments4 min readLW link

Shard theory alignment has important, often-overlooked free parameters.

Charlie Steiner20 Jan 2023 9:30 UTC

36 points

10 comments3 min readLW link

Review of AI Alignment Progress

PeterMcCluskey7 Feb 2023 18:57 UTC

72 points

32 comments7 min readLW link

(bayesianinvestor.com)

Predictions for shard theory mechanistic interpretability results

TurnTrout, Ulisse Mini and peligrietzer

1 Mar 2023 5:16 UTC

105 points

10 comments5 min readLW link

Contra “Strong Coherence”

DragonGod4 Mar 2023 20:05 UTC

39 points

24 comments1 min readLW link

[Question] Is “Strong Coherence” Anti-Natural?

DragonGod11 Apr 2023 6:22 UTC

23 points

25 comments2 min readLW link

Clippy, the friendly paperclipper

Seth Herd2 Mar 2023 0:02 UTC

3 points

11 comments2 min readLW link

Unpacking “Shard Theory” as Hunch, Question, Theory, and Insight

Jacy Reese Anthis16 Nov 2022 13:54 UTC

31 points

9 comments2 min readLW link

A Short Dialogue on the Meaning of Reward Functions

Leon Lang, Quintin Pope and peligrietzer

19 Nov 2022 21:04 UTC

45 points

0 comments3 min readLW link

If Wentworth is right about natural abstractions, it would be bad for alignment

Wuschel Schulz8 Dec 2022 15:19 UTC

29 points

5 comments4 min readLW link

AGI will have learnt utility functions

beren25 Jan 2023 19:42 UTC

36 points

3 comments13 min readLW link

Adaptation-Executers, not Fitness-Maximizers

Eliezer Yudkowsky11 Nov 2007 6:39 UTC

156 points

33 comments3 min readLW link

Exploring Shard-like Behavior: Empirical Insights into Contextual Decision-Making in RL Agents

Alejandro Aristizabal29 Sep 2024 0:32 UTC

6 points

0 comments15 min readLW link

Humans provide an untapped wealth of evidence about alignment

TurnTrout and Quintin Pope

14 Jul 2022 2:31 UTC

210 points

94 comments9 min readLW link 1 review

Broad Picture of Human Values

Thane Ruthenis20 Aug 2022 19:42 UTC

42 points

6 comments10 min readLW link

In Defense of Wrapper-Minds

Thane Ruthenis28 Dec 2022 18:28 UTC

24 points

38 comments3 min readLW link

Evolution is a bad analogy for AGI: inner alignment

Quintin Pope13 Aug 2022 22:15 UTC

78 points

15 comments8 min readLW link

Experiment Idea: RL Agents Evading Learned Shutdownability

Leon Lang16 Jan 2023 22:46 UTC

31 points

7 comments17 min readLW link

(docs.google.com)

Failure modes in a shard theory alignment plan

Thomas Kwa27 Sep 2022 22:34 UTC

26 points

2 comments7 min readLW link

Steering GPT-2-XL by adding an activation vector

TurnTrout, Monte M, David Udell, lisathiergart and Ulisse Mini

13 May 2023 18:42 UTC

436 points

97 comments50 min readLW link

The alignment stability problem

Seth Herd26 Mar 2023 2:10 UTC

35 points

15 comments4 min readLW link

Pessimistic Shard Theory

Garrett Baker25 Jan 2023 0:59 UTC

72 points

13 comments3 min readLW link

raccoon 15 Feb 2023 3:19 UTC
3 points
0
Changed first instance of “RL” to “Reinforcement Learning (RL)” because if I didn’t immediately realize what it meant, someone who is learning this for the first time won’t think of it either.
- Raemon 15 Feb 2023 5:58 UTC
  2 points
  0
  Parent
  Yeah I think this is good practice.