Hm. I find I’m very scared of giving dogs chocolate and grapes because it was emphasized in my childhood this is a common failure-mode, and so will upweight actions which get rid of the chocolate in my hands when I’m around dogs. I expect the results of this experiment to be unclear, since a capable shard composition would want to get rid of the chocolate so it doesn’t accidentally give the chocolate to the dog, but this is also what the consequentialist would do, so that they can (say) more easily use their hands for anticipated hand-related tasks (like petting the dog) without needing to expend computational resources keeping track of the dog’s relation to the chocolate (if they place the chocolate in their pants).
More generally, it seems hard to separate shard-theoretic hypotheses from results-focused reasoning hypotheses without much understanding of the thought processes or values going into each, mostly I think because both theories are still in their infancy.
Here’s how I think about it: Capable agents will be able to do consequentialist reasoning, but the shard-theory-inspired hypothesis is that running the consequences through your world-model is harder / less accessible / less likely than just letting your shards vote on it. If you’ve been specifically taught that chocolate is bad for dogs, maybe this is a bad example.
I also wasn’t trying to think about whether shards are subagents; this came out of a discussion on finding the simplest possible shard theory hypotheses and applying them to gridworlds.
Hm. I find I’m very scared of giving dogs chocolate and grapes because it was emphasized in my childhood this is a common failure-mode, and so will upweight actions which get rid of the chocolate in my hands when I’m around dogs. I expect the results of this experiment to be unclear, since a capable shard composition would want to get rid of the chocolate so it doesn’t accidentally give the chocolate to the dog, but this is also what the consequentialist would do, so that they can (say) more easily use their hands for anticipated hand-related tasks (like petting the dog) without needing to expend computational resources keeping track of the dog’s relation to the chocolate (if they place the chocolate in their pants).
More generally, it seems hard to separate shard-theoretic hypotheses from results-focused reasoning hypotheses without much understanding of the thought processes or values going into each, mostly I think because both theories are still in their infancy.
Here’s how I think about it: Capable agents will be able to do consequentialist reasoning, but the shard-theory-inspired hypothesis is that running the consequences through your world-model is harder / less accessible / less likely than just letting your shards vote on it. If you’ve been specifically taught that chocolate is bad for dogs, maybe this is a bad example.
I also wasn’t trying to think about whether shards are subagents; this came out of a discussion on finding the simplest possible shard theory hypotheses and applying them to gridworlds.