TurnTrout comments on Unpacking “Shard Theory” as Hunch, Question, Theory, and Insight

TurnTrout 15 Dec 2022 3:37 UTC
LW: 2 AF: 2
0
AF
Pope and Turner say in “The Shard Theory of Human Values” that ““Shard theory” also has been used to refer to insights gained by considering the shard theory of human values and by operating the shard frame on alignment. … We don’t like this ambiguous usage. We would instead say something like “insights from shard theory.”” I take that to mean they do not include anything about AI alignment itself as shard theory. I think this will confuse many people because of how central AI alignment is to the shard theory project.
Hm, I didn’t mean to imply that. Point (2) of that decomposition was:
The shard paradigm/theory/frame of AI alignment analyzes the value formation processes which will occur in deep learning, and tries to figure out their properties.
This definitely includes AI alignment insights as part of shard theory $_{AI}$ , but not as part of shard theory $_{human values}$ . What I was trying to gesture at is how e.g. reward != optimization target is not necessarily making predictions about modular contextual influences within policy networks trained via e.g. PPO. Instead, Reward!=OT explains a few general insights into the nature of alignment in the deep learning regime.
- Jacy Reese Anthis 16 Dec 2022 18:30 UTC
  1 point
  0
  Parent
  Interesting! I’m not sure what you’re saying here. Which of those two things (shard theory $_{human values}$ and shard theory $_{AI}$ ) is shard theory (written without a subscript)? If the former, then the OP seems accurate. If the latter or if shard theory without a subscript includes both of those two things, then I misread your view and will edit the post to note that this comment supersedes (my reading of) your previous statement.
  - TurnTrout 21 Dec 2022 21:03 UTC
    3 points
    0
    Parent
    Descriptively, the meaning of “shard theory” is contextual (unfortunately). If I say “I’m working on shard theory today”, I think my colleagues would correctly infer the AI case. If I say “the shard theory posts”, I’m talking about posts which discuss both the human and AI cases. If I say “shard theory predicts that humans do X”, I would be referring to shard theory_{human values}.
    (FWIW I find my above disambiguation to remain somewhat ambiguous, but it’s what I have to say right now.)