Thanks for the response, Martin. I’d like to try to get to the heart of what we disagree on. Do you agree that, given a sufficiently different architecture—e.g. a human who had a dog’s brain implanted somehow—would grow to have different values in some respect? For example, you mention arguing persuasively. Argument is a pretty specific ability, but we can widen our field to language in general—the human brain has pretty specific circuitry for that. A dog’s brain that lacks the appropriate language centers would likely never learn to speak, leave alone argue persuasively.
I want to point out again that the disagreement is just a matter of scale. I do think relatively similar values can be learnt through similar experiences for basic RL agents; I just want to caution that for most human and animal examples, architecture may matter more than you might think.
Do you agree that, given a sufficiently different architecture—e.g. a human who had a dog’s brain implanted somehow—would grow to have different values in some respect?
Yes, according to shard theory. And thank you for switching to a more coherent hypothetical. Two reasons are different reinforcement signals and different capabilities.
Reinforcement signals
Brain → Reinforcement Signals → Reinforcement Events → Value Formation
The main way that brain architecture influences values, according to Quintin Pope and Alex Turner, is via reinforcement signals:
Assumption 3: The brain does reinforcement learning. According to this assumption, the brain has a genetically hard-coded reward system (implemented via certain hard-coded circuits in the brainstem and midbrain). In some[3] fashion, the brain reinforces thoughts and mental subroutines which have led to reward, so that they will be more likely to fire in similar contexts in the future. We suspect that the “base” reinforcement learning algorithm is relatively crude, but that people reliably bootstrap up to smarter credit assignment.
Dogs and humans have different ideal diets. Dogs have a better sense of smell than humans. Therefore it is likely that dogs and humans have different hard-coded reinforcement signals around the taste and smell of food. This is part of why dog treats smell and taste different to human treats. Human children often develop strong values around certain foods. Our hypothetical dog-human child would likely also develop strong values around food, but predictably different foods.
As human children grow up, they slowly develop values around healthy eating. Our hypothetical dog-human child would do the same. Because they have a human body, these healthy eating values would be more human-like.
Capabilities
Brain → Capabilities → Reinforcement Events → Value Formation
Let’s extend the hypothetic further. Suppose that we have a hybrid dog-human with a human body, human reinforcement learning signals, but otherwise a dog brain. Now does shard theory claim they will have the same values as a human?
No. While shard theory is based on the theory of “learning from scratch” in the brain, “Learning-from-scratch is NOT blank-slate”. So it’s reasonable to suppose, as you do, that the hybrid dog-human will have many cognitive weaknesses compared to humans, especially given its smaller cortex.
These cognitive weaknesses will naturally lead to different experiences, different reinforcement events. Shard theory claims that “reinforcement events shape human value shards”″. Accordingly shard theory predicts different values. Perhaps the dog-human never learns to read, and so it does not value reading.
On the other hand, once you learn an agent’s capabilities, this largely screens off its brain architecture. For example, to the extent that deafness influences values, the influence is the same for deafness caused by a brain defect and deafness caused by an ear defect. The dog-human would likely have similar values to a disabled human with similar cognitive abilities.
I discuss above that “architecture independent” is only true if you hold capabilities and reinforcement signals constant. The example given in the distillation is “a sufficiently large transformer and a sufficiently large conv net, given the same training data presented in the same order”.
I’m realizing that this distillation is potentially misleading when moved from the AI context to the context of natural intelligence.
Thanks for the response, Martin. I’d like to try to get to the heart of what we disagree on. Do you agree that, given a sufficiently different architecture—e.g. a human who had a dog’s brain implanted somehow—would grow to have different values in some respect? For example, you mention arguing persuasively. Argument is a pretty specific ability, but we can widen our field to language in general—the human brain has pretty specific circuitry for that. A dog’s brain that lacks the appropriate language centers would likely never learn to speak, leave alone argue persuasively.
I want to point out again that the disagreement is just a matter of scale. I do think relatively similar values can be learnt through similar experiences for basic RL agents; I just want to caution that for most human and animal examples, architecture may matter more than you might think.
Thanks for trying to get to the heart of it.
Yes, according to shard theory. And thank you for switching to a more coherent hypothetical. Two reasons are different reinforcement signals and different capabilities.
Reinforcement signals
Brain → Reinforcement Signals → Reinforcement Events → Value Formation
The main way that brain architecture influences values, according to Quintin Pope and Alex Turner, is via reinforcement signals:
Dogs and humans have different ideal diets. Dogs have a better sense of smell than humans. Therefore it is likely that dogs and humans have different hard-coded reinforcement signals around the taste and smell of food. This is part of why dog treats smell and taste different to human treats. Human children often develop strong values around certain foods. Our hypothetical dog-human child would likely also develop strong values around food, but predictably different foods.
As human children grow up, they slowly develop values around healthy eating. Our hypothetical dog-human child would do the same. Because they have a human body, these healthy eating values would be more human-like.
Capabilities
Brain → Capabilities → Reinforcement Events → Value Formation
Let’s extend the hypothetic further. Suppose that we have a hybrid dog-human with a human body, human reinforcement learning signals, but otherwise a dog brain. Now does shard theory claim they will have the same values as a human?
No. While shard theory is based on the theory of “learning from scratch” in the brain, “Learning-from-scratch is NOT blank-slate”. So it’s reasonable to suppose, as you do, that the hybrid dog-human will have many cognitive weaknesses compared to humans, especially given its smaller cortex.
These cognitive weaknesses will naturally lead to different experiences, different reinforcement events. Shard theory claims that “reinforcement events shape human value shards”″. Accordingly shard theory predicts different values. Perhaps the dog-human never learns to read, and so it does not value reading.
On the other hand, once you learn an agent’s capabilities, this largely screens off its brain architecture. For example, to the extent that deafness influences values, the influence is the same for deafness caused by a brain defect and deafness caused by an ear defect. The dog-human would likely have similar values to a disabled human with similar cognitive abilities.
Where is the disagreement?
My current guess is that you agree with most of the above, but disagree with the distilled claim that “value formation is very path dependent and architecture independent”.
I think the problem is that this is very “distilled”. Pope and Trout claim that some human values are convergent: “We think that many biases are convergently produced artifacts of the human learning process & environment”. In other words, these values end up being path independent.
I discuss above that “architecture independent” is only true if you hold capabilities and reinforcement signals constant. The example given in the distillation is “a sufficiently large transformer and a sufficiently large conv net, given the same training data presented in the same order”.
I’m realizing that this distillation is potentially misleading when moved from the AI context to the context of natural intelligence.