Nice overview, David! You’ve made lots of good points and clarifications. I worry that this overview goes a little too fast for new readers. For example,
Shard theory thus predicts that dumb RL agents internalize lots of representable-by-them in-distribution proxies for reinforcement as shards, as a straightforward consequence of reinforcement events being gated behind a complex conditional distribution of task blends.
I can read this if I think carefully, but it’s a little difficult. I presently view this article as more of “motivating shard theory” and “explaining some shard theory 201″ than “explaining some shard theory 101.”
Here’s another point I want to make: Shard theory is anticipation-constraining. You can’t just say “a shard made me do it” for absolutely whatever. Shards are contextually activated in the contexts which would have been pinged by previous credit assignment invocations. Experiments show that people get coerced into obeying cruel orders not by showing them bright blue paintings or a dog, but by an authoritative man in a lab coat. I model this as situationally strongly activating deference- and social-maintenance shards, which are uniquely influential here because those shards were most strongly reinforced in similar situations in the past. Shard theory would be surprised by normal people becoming particularly deferential around chihuahuas.
You might go “duh”, but actually this is a phenomenon which has to be explained! And if people just have “weird values”, there has to be a within-lifetime learning explanation for how those values got to be weird in that particular way, for why a person isn’t extra deferent around small dogs but they are extra deferent around apparent-doctors.
Comments on other parts:
Your “utility function” (an ordering over possible worlds, subject to some consistency conditions) is far too big for your brain to represent.
I think that committing to an ordering over possible worlds is committing way too much. Coherence theorems tell you to be coherent over outcome lotteries, but they don’t prescribe what outcomes are. Are outcomes universe-histories? World states? Something else? This is a common way I perceive people to be misled by utility theory.
But a utility function can be lossily projected down into a bounded computational object by factoring it into a few shards, each representing a term in the utility function, each term conceptually chunked out of perceptual input.
I feel confused by the role of “perceptual input” here. Can you give an example of a situation where the utility function gets chunked in this way?
shard theory in general doesn’t have a good account of credit assignment improving in-lifetime.
Yeah, I fervently wish I knew what were happening here. I think that sophisticated credit assignment is probably convergently bootstrapped from some genetically hard-coded dumb credit assignment.
But a utility function can be lossily projected down into a bounded computational object by factoring it into a few shards, each representing a term in the utility function, each term conceptually chunked out of perceptual input.
I feel confused by the role of “perceptual input” here. Can you give an example of a situation where the utility function gets chunked in this way?
I had meant to suggest that your shards interface with a messy perceptual world of incoming retinal activations and the like, but are trained to nonetheless chunk out latent variables like “human flourishing” or “lollipops” in the input stream. That is, I was suggesting a rough shape for the link between the outside world as you observe it and the ontology your shards express their ends in.
If you formalized utility functions as orderings over possible worlds (or over other equivalent objects!), and your perception simply looked over the set of all possible worlds, then there wouldn’t be anything interesting to explain about perception and the ontology your values are framed in. For agents that can’t run impossibly large computations like that, though, I think you do have something to explain here.
Nice overview, David! You’ve made lots of good points and clarifications. I worry that this overview goes a little too fast for new readers. For example,
I can read this if I think carefully, but it’s a little difficult. I presently view this article as more of “motivating shard theory” and “explaining some shard theory 201″ than “explaining some shard theory 101.”
Here’s another point I want to make: Shard theory is anticipation-constraining. You can’t just say “a shard made me do it” for absolutely whatever. Shards are contextually activated in the contexts which would have been pinged by previous credit assignment invocations. Experiments show that people get coerced into obeying cruel orders not by showing them bright blue paintings or a dog, but by an authoritative man in a lab coat. I model this as situationally strongly activating deference- and social-maintenance shards, which are uniquely influential here because those shards were most strongly reinforced in similar situations in the past. Shard theory would be surprised by normal people becoming particularly deferential around chihuahuas.
You might go “duh”, but actually this is a phenomenon which has to be explained! And if people just have “weird values”, there has to be a within-lifetime learning explanation for how those values got to be weird in that particular way, for why a person isn’t extra deferent around small dogs but they are extra deferent around apparent-doctors.
Comments on other parts:
I think that committing to an ordering over possible worlds is committing way too much. Coherence theorems tell you to be coherent over outcome lotteries, but they don’t prescribe what outcomes are. Are outcomes universe-histories? World states? Something else? This is a common way I perceive people to be misled by utility theory.
I feel confused by the role of “perceptual input” here. Can you give an example of a situation where the utility function gets chunked in this way?
Yeah, I fervently wish I knew what were happening here. I think that sophisticated credit assignment is probably convergently bootstrapped from some genetically hard-coded dumb credit assignment.
I had meant to suggest that your shards interface with a messy perceptual world of incoming retinal activations and the like, but are trained to nonetheless chunk out latent variables like “human flourishing” or “lollipops” in the input stream. That is, I was suggesting a rough shape for the link between the outside world as you observe it and the ontology your shards express their ends in.
If you formalized utility functions as orderings over possible worlds (or over other equivalent objects!), and your perception simply looked over the set of all possible worlds, then there wouldn’t be anything interesting to explain about perception and the ontology your values are framed in. For agents that can’t run impossibly large computations like that, though, I think you do have something to explain here.