Thomas Kwa suggested that consequentialist agents seem to have less superficial (observation, belief state) → action mappings. EG a shard agent might have:
An “it’s good to give your friends chocolate” subshard
A “give dogs treats” subshard
-> An impulse to give dogs chocolate, even though the shard agent knows what the result would be
But a consequentialist would just reason about what happens, and not mess with those heuristics. (OFC, consequentialism would be a matter of degree)
In this way, changing a small set of decision-relevant features (e.g. “Brown dog treat” → “brown ball of chocolate”) changes the consequentialist’s action logits a lot, way more than it changes the shard agent’s logits. In a squinty, informal way, the (belief state → logits) function has a higher Lipschitz constant/is more smooth for the shard agent than for the consequentialist agent.
So maybe one (pre-deception) test for consequentialist reasoning is to test sensitivity of decision-making to small perturbations in observation-space (e.g. dog treat → tiny chocolate) but large perturbations in action-consequence space (e.g. happy dog → sick dog). You could spin up two copies of the model to compare.
Hm. I find I’m very scared of giving dogs chocolate and grapes because it was emphasized in my childhood this is a common failure-mode, and so will upweight actions which get rid of the chocolate in my hands when I’m around dogs. I expect the results of this experiment to be unclear, since a capable shard composition would want to get rid of the chocolate so it doesn’t accidentally give the chocolate to the dog, but this is also what the consequentialist would do, so that they can (say) more easily use their hands for anticipated hand-related tasks (like petting the dog) without needing to expend computational resources keeping track of the dog’s relation to the chocolate (if they place the chocolate in their pants).
More generally, it seems hard to separate shard-theoretic hypotheses from results-focused reasoning hypotheses without much understanding of the thought processes or values going into each, mostly I think because both theories are still in their infancy.
Here’s how I think about it: Capable agents will be able to do consequentialist reasoning, but the shard-theory-inspired hypothesis is that running the consequences through your world-model is harder / less accessible / less likely than just letting your shards vote on it. If you’ve been specifically taught that chocolate is bad for dogs, maybe this is a bad example.
I also wasn’t trying to think about whether shards are subagents; this came out of a discussion on finding the simplest possible shard theory hypotheses and applying them to gridworlds.
Thomas Kwa suggested that consequentialist agents seem to have less superficial (observation, belief state) → action mappings. EG a shard agent might have:
An “it’s good to give your friends chocolate” subshard
A “give dogs treats” subshard
-> An impulse to give dogs chocolate, even though the shard agent knows what the result would be
But a consequentialist would just reason about what happens, and not mess with those heuristics. (OFC, consequentialism would be a matter of degree)
In this way, changing a small set of decision-relevant features (e.g. “Brown dog treat” → “brown ball of chocolate”) changes the consequentialist’s action logits a lot, way more than it changes the shard agent’s logits. In a squinty, informal way, the (belief state → logits) function has a higher Lipschitz constant/is more smooth for the shard agent than for the consequentialist agent.
So maybe one (pre-deception) test for consequentialist reasoning is to test sensitivity of decision-making to small perturbations in observation-space (e.g. dog treat → tiny chocolate) but large perturbations in action-consequence space (e.g. happy dog → sick dog). You could spin up two copies of the model to compare.
Hm. I find I’m very scared of giving dogs chocolate and grapes because it was emphasized in my childhood this is a common failure-mode, and so will upweight actions which get rid of the chocolate in my hands when I’m around dogs. I expect the results of this experiment to be unclear, since a capable shard composition would want to get rid of the chocolate so it doesn’t accidentally give the chocolate to the dog, but this is also what the consequentialist would do, so that they can (say) more easily use their hands for anticipated hand-related tasks (like petting the dog) without needing to expend computational resources keeping track of the dog’s relation to the chocolate (if they place the chocolate in their pants).
More generally, it seems hard to separate shard-theoretic hypotheses from results-focused reasoning hypotheses without much understanding of the thought processes or values going into each, mostly I think because both theories are still in their infancy.
Here’s how I think about it: Capable agents will be able to do consequentialist reasoning, but the shard-theory-inspired hypothesis is that running the consequences through your world-model is harder / less accessible / less likely than just letting your shards vote on it. If you’ve been specifically taught that chocolate is bad for dogs, maybe this is a bad example.
I also wasn’t trying to think about whether shards are subagents; this came out of a discussion on finding the simplest possible shard theory hypotheses and applying them to gridworlds.