I now finally read LawrenceC’s Shard Theory in Nine Theses: a Distillation and Critical Appraisal. I think it is worth reading for many people even if they already read my own distillation. A summary of things emphasized that are missing (or less emphasized) in mine (Note: Lawrence doesn’t necessarily believe these claims and, for some of them, lists his disagreements):
A nice picture of how the different parts of an agent composed of shards interact. This includes a planner, which I’ve not mentioned in my post.
A comparison of shard theory with the subagent model and rational agent model, and also the learning/steering model
A greater emphasis that shards care about world model concepts instead of sensory experiences
The following quote: “As far as I can tell, shard theory does not make specific claims about what form these bids take, how a planner works, how much weight each shard has, or how the bids are aggregated together into an action.”
This part is a bit more conservative than my speculations about how shards influence the log probabilities of different actions, which are more specific and falsifiable.
In my description, it may implicitly sound as if all the shards are agents. Lawrence adds more nuance:
“While not all shards are agents, shard theory claims that relatively agentic shards exist and will eventually end up “in control” of the agent’s actions.”
A mechanism for how agentic shards make less agentic shards lose influence over time (namely, by steering behavior such that the less agentic shards are not reinforced)
A greater emphasis that learned values are path-dependent (and in a way that can be steered by our choices to bring about the values we want)
The sometimes-made claim that learned values are relatively architecture-independent
A discussion of how inner misalignment may not be a problem when viewed in the shard frame.
I now finally read LawrenceC’s Shard Theory in Nine Theses: a Distillation and Critical Appraisal. I think it is worth reading for many people even if they already read my own distillation. A summary of things emphasized that are missing (or less emphasized) in mine (Note: Lawrence doesn’t necessarily believe these claims and, for some of them, lists his disagreements):
A nice picture of how the different parts of an agent composed of shards interact. This includes a planner, which I’ve not mentioned in my post.
A comparison of shard theory with the subagent model and rational agent model, and also the learning/steering model
A greater emphasis that shards care about world model concepts instead of sensory experiences
The following quote: “As far as I can tell, shard theory does not make specific claims about what form these bids take, how a planner works, how much weight each shard has, or how the bids are aggregated together into an action.”
This part is a bit more conservative than my speculations about how shards influence the log probabilities of different actions, which are more specific and falsifiable.
In my description, it may implicitly sound as if all the shards are agents. Lawrence adds more nuance:
“While not all shards are agents, shard theory claims that relatively agentic shards exist and will eventually end up “in control” of the agent’s actions.”
A mechanism for how agentic shards make less agentic shards lose influence over time (namely, by steering behavior such that the less agentic shards are not reinforced)
A greater emphasis that learned values are path-dependent (and in a way that can be steered by our choices to bring about the values we want)
The sometimes-made claim that learned values are relatively architecture-independent
A discussion of how inner misalignment may not be a problem when viewed in the shard frame.