Good post. I left a comment on the original shard theory post echoing more or less these same objections. TurnTrout left a reply which I think may address some of your critique:
This seems wrong to me. Twin studies, GCTA estimates, and actual genetic predictors all predict that a portion of the variance in human biases is “hardcoded” in the genome.
I’d also imagine that mathematical skill is heritable. [Finds an article on Google Scholar] The abstract of https://doi.org/10.1037/a0015115 seems to agree. Yet due to information inaccesibility and lack of selection pressure ancestrally, I infer math ability probably isn’t hardcoded.
There are a range of possible explanations which reconcile these two observations, like “better genetically specified learning hyperparameters in brain regions which convergently get allocated to math” or “tweaks to the connectivity initialization procedure[1] involving that brain region (how neurons get ~randomly wired up at the local level).”
I expect similar explanations for heritability of biases.
I guess the relevant questions here are:
Can a set of these evolutionarily optimized hyperparameters close to those possessed by humans be learned by AI?
How much can we mess up while learning those and still get an agent that generally shares human values?
Shard theory still seems quite useful to me even if it turns out that tweaking these hyperparameters is essential to reproducing human-like values in AI, if only because it provides a very good description of how values are actually encoded (contextually activated “shards”). Before shard theory, I hadn’t heard a good explanation of how humans were able to avoid wireheading.
(Elucidating the genetic architecture of human values would be enormously useful for AI alignment, in my opinion.)
What exactly do you mean by this?
Thus, I think that Shard Theory has some good insights and some promise as a research paradigm, but I think it needs some updating in terms of its model of human evolution, genetics, development, neuroscience, psychology, and values.
Agreed.
We should expect that advanced AI systems will typically have access to the complete genomes of the people they interact with most often – and will be able to use polygenic scores to translate those genomes into predicted value profiles.
I’m not sure that this will actually matter as much as people assume it will. Posessing a few megabytes of compressed data from someone’s social media profiles might tell you even more about someone than their genome because it captures genetic and environmentally caused variance.
I would also guess that there may be quite a few laws limiting how advanced AI can use people’s genetic information.
Overall I thought this post good, though a bit long for the relatively simple message it was trying to convey. I very much appreciated a lot of the links you left, especially in the “human values are heritable” section.
My current impression is that shard theory has a lot of new and valuable insights to offer, but in order to be a complete description of human values it needs to incorporate a mechanistic description of the ways in which genetics influence value formation.
Typo feedback, ignore unless you are the author
For example, Quintin Trout argues
Typo. I’m guessing this was supposed to say “Turn Trout”?
GeneSmith—thanks for your comment. I’ll need to think about some of your questions a bit more before replying.
But one idea popped out to me: the idea that shard theory offers ‘a good explanation of how humans were able to avoid wireheading.’
I don’t understand this claim on two levels:
I may be missing something about shard theory, but I don’t actually see how it could prevent humans, at a general level, from hacking their reward systems in many ways
As an empirical matter, humans do, in fact, hack our reward systems in thousands of ways that distract us from the traditional goals of survival and reproduction (i.e. in ways that represent catastrophic ‘alignment failures’ with our genetic interests). My book ‘Spent’ (2008), about the evolutionary psychology of consumer behavior, detailed many examples. Billions of people spend many hours a day on social media, watching fictional TV shows, and playing video games—rather than doing anything their Pleistocene ancestors would have recognized as reproductively relevant real-world behaviors. We are the world champions at wire-heading, so I don’t see how a theory like Shard Theory that predicts the impossibility of wire-heading could be accepted as empirically accurate.
When I say “wireheading” I’m referring to something more specific than simply “humans not employing strategies that maximize reproductive fitness”
I’m talking about how people know that you could experience the greatest possible pleasure by hooking up wires directly to the reward centers in your brain, yet they don’t pursue that strategy.
If you model humans as reward maximizing agents, that doesn’t really make sense.
GeneSmith—when people in AI alignment or LessWrong talk about ‘wireheading’, I understood that not to refer to people literally asking neurosurgeons to stick wires into their brains, but rather to a somewhat larger class of ways to hack one’s own reward systems through the usual perceptual input channels.
I agree that humans are not ‘reward maximizing agents’, whatever that is supposed to mean in reference to actual evolved organisms with diverse, heterogenous, & domain-specific motivational systems.
I don’t think I explained my thinking clearly enough.
If using wireheading to refer to the broader class of actions that increase reward at the cost of maximizing reproductive fitness, I agree humans in general do wirehead to some degree. But even if we count taking recreational drugs or masturbation as wireheading, I still don’t believe any other theory of values explains the relative lack of these behaviors as well as shard theory.
If humans were truly reward maximizers, it’s difficult to imagine how they would manage to avoid wireheading as well as they do. I suppose perhaps the “thousand genetic hacks” theory might be able to explain it if evolution was clever enough? There’s certainly some evidence that when humans are exposed to new sources of reward that do nothing to benefit reproductive fitness, it’s often a disaster. See the numerous cases of hunter-gatherer peoples being exposed to alcohol for the first time.
But again… think about the actual wireheading example. There must be millions of humans that know about wireheading, yet so far as I know there are zero examples of people doing so recreationally. There was nothing similar to wireheading in the ancestral environment. Yet nearly everyone seems aversive to the idea of literally wireheading themselves.
Why? Humans can anticipate how incredibly rewarding it would be to wirehead. And many humans could afford full time caretakers to ensure they would be able to experience the rewards for years or decades. So why are people aversive to the idea?
My interpretation is that humans develop their initial shards of value during childhood in an environment that usually contains limited opportunities to wirehead. As the human world model becomes generally filled in, it becomes the case that most sensory environments activate at least one shard, whose “values” are not aligned with wireheading.
I think shard theory has a better explanation of this relative lack of wireheading than alternative models. But it’s obviously incomplete without a description of the free parameters that can be tweaked by genetics to produce the distribution of behaviors we see in the human population.
This is why I am hopeful others will start working on establishing the mathematics of shard theory so that we can see if shards really do form in neural networks, and if so how they behave.
GeneSmith—I guess I’m still puzzled about how Shard Theory prevents wireheading (broadly construed); I just don’t see it as a magic bullet that can keep agents focused on their ultimate goals. I must be missing something.
And, insofar as Shard Theory is supposed to be an empirically accurate description of human agents, it would need to explain why some people become fentanyl addicts who might eventually overdose, and others don’t. Or why some people pursue credentials and careers at the cost of staying childless… while others settle down young, have six kids, and don’t worry as much about status-seeking. Or why some people take up free solo mountain climbing, for the rush, and fall to their deaths by age 30, whereas others are more risk-averse.
Modern consumerist capitalism offers thousands of ways to ‘wirehead’ our reward systems, that don’t require experimental neurosurgery—and billions of people get caught up in those reward-hacks. If Shard Theory is serious about describing actual human behavior, it needs some way to describe both our taste for many kinds of reward-hacking, and our resistance to it.
Good post. I left a comment on the original shard theory post echoing more or less these same objections. TurnTrout left a reply which I think may address some of your critique:
I guess the relevant questions here are:
Can a set of these evolutionarily optimized hyperparameters close to those possessed by humans be learned by AI?
How much can we mess up while learning those and still get an agent that generally shares human values?
Shard theory still seems quite useful to me even if it turns out that tweaking these hyperparameters is essential to reproducing human-like values in AI, if only because it provides a very good description of how values are actually encoded (contextually activated “shards”). Before shard theory, I hadn’t heard a good explanation of how humans were able to avoid wireheading.
What exactly do you mean by this?
Agreed.
I’m not sure that this will actually matter as much as people assume it will. Posessing a few megabytes of compressed data from someone’s social media profiles might tell you even more about someone than their genome because it captures genetic and environmentally caused variance.
I would also guess that there may be quite a few laws limiting how advanced AI can use people’s genetic information.
Overall I thought this post good, though a bit long for the relatively simple message it was trying to convey. I very much appreciated a lot of the links you left, especially in the “human values are heritable” section.
My current impression is that shard theory has a lot of new and valuable insights to offer, but in order to be a complete description of human values it needs to incorporate a mechanistic description of the ways in which genetics influence value formation.
Typo feedback, ignore unless you are the author
Typo. I’m guessing this was supposed to say “Turn Trout”?
GeneSmith—thanks for your comment. I’ll need to think about some of your questions a bit more before replying.
But one idea popped out to me: the idea that shard theory offers ‘a good explanation of how humans were able to avoid wireheading.’
I don’t understand this claim on two levels:
I may be missing something about shard theory, but I don’t actually see how it could prevent humans, at a general level, from hacking their reward systems in many ways
As an empirical matter, humans do, in fact, hack our reward systems in thousands of ways that distract us from the traditional goals of survival and reproduction (i.e. in ways that represent catastrophic ‘alignment failures’ with our genetic interests). My book ‘Spent’ (2008), about the evolutionary psychology of consumer behavior, detailed many examples. Billions of people spend many hours a day on social media, watching fictional TV shows, and playing video games—rather than doing anything their Pleistocene ancestors would have recognized as reproductively relevant real-world behaviors. We are the world champions at wire-heading, so I don’t see how a theory like Shard Theory that predicts the impossibility of wire-heading could be accepted as empirically accurate.
When I say “wireheading” I’m referring to something more specific than simply “humans not employing strategies that maximize reproductive fitness”
I’m talking about how people know that you could experience the greatest possible pleasure by hooking up wires directly to the reward centers in your brain, yet they don’t pursue that strategy.
If you model humans as reward maximizing agents, that doesn’t really make sense.
GeneSmith—when people in AI alignment or LessWrong talk about ‘wireheading’, I understood that not to refer to people literally asking neurosurgeons to stick wires into their brains, but rather to a somewhat larger class of ways to hack one’s own reward systems through the usual perceptual input channels.
I agree that humans are not ‘reward maximizing agents’, whatever that is supposed to mean in reference to actual evolved organisms with diverse, heterogenous, & domain-specific motivational systems.
I don’t think I explained my thinking clearly enough.
If using wireheading to refer to the broader class of actions that increase reward at the cost of maximizing reproductive fitness, I agree humans in general do wirehead to some degree. But even if we count taking recreational drugs or masturbation as wireheading, I still don’t believe any other theory of values explains the relative lack of these behaviors as well as shard theory.
If humans were truly reward maximizers, it’s difficult to imagine how they would manage to avoid wireheading as well as they do. I suppose perhaps the “thousand genetic hacks” theory might be able to explain it if evolution was clever enough? There’s certainly some evidence that when humans are exposed to new sources of reward that do nothing to benefit reproductive fitness, it’s often a disaster. See the numerous cases of hunter-gatherer peoples being exposed to alcohol for the first time.
But again… think about the actual wireheading example. There must be millions of humans that know about wireheading, yet so far as I know there are zero examples of people doing so recreationally. There was nothing similar to wireheading in the ancestral environment. Yet nearly everyone seems aversive to the idea of literally wireheading themselves.
Why? Humans can anticipate how incredibly rewarding it would be to wirehead. And many humans could afford full time caretakers to ensure they would be able to experience the rewards for years or decades. So why are people aversive to the idea?
My interpretation is that humans develop their initial shards of value during childhood in an environment that usually contains limited opportunities to wirehead. As the human world model becomes generally filled in, it becomes the case that most sensory environments activate at least one shard, whose “values” are not aligned with wireheading.
I think shard theory has a better explanation of this relative lack of wireheading than alternative models. But it’s obviously incomplete without a description of the free parameters that can be tweaked by genetics to produce the distribution of behaviors we see in the human population.
This is why I am hopeful others will start working on establishing the mathematics of shard theory so that we can see if shards really do form in neural networks, and if so how they behave.
GeneSmith—I guess I’m still puzzled about how Shard Theory prevents wireheading (broadly construed); I just don’t see it as a magic bullet that can keep agents focused on their ultimate goals. I must be missing something.
And, insofar as Shard Theory is supposed to be an empirically accurate description of human agents, it would need to explain why some people become fentanyl addicts who might eventually overdose, and others don’t. Or why some people pursue credentials and careers at the cost of staying childless… while others settle down young, have six kids, and don’t worry as much about status-seeking. Or why some people take up free solo mountain climbing, for the rush, and fall to their deaths by age 30, whereas others are more risk-averse.
Modern consumerist capitalism offers thousands of ways to ‘wirehead’ our reward systems, that don’t require experimental neurosurgery—and billions of people get caught up in those reward-hacks. If Shard Theory is serious about describing actual human behavior, it needs some way to describe both our taste for many kinds of reward-hacking, and our resistance to it.