(Partly transcribed from a correspondence on Eleuther.)
I disagree about concepts in the human world model being inaccessible in theory to the genome. I think lots of concepts could be accessed, and that (2) is true in the trilemma.
Consider: As a dumb example that I don’t expect to actually be the case but which gives useful intuition, suppose the genome really wants to wire something up to the tree neuron. Then the genome could encode a handful of images of trees and then once the brain is fully formed it can go through and search for whichever neuron activates the hardest on those 10 images. (Of course it wouldn’t actually do literal images, but I expect compressing it down to not actually be that hard.) The more general idea is that we can specify concepts in the world model extensionally by specifying constraints that the concept has to satisfy (for instance, it should activate on these particular data points, or it should have this particular temporal consistency, etc.) Keep in mind this means that the genome just has to vaguely gesture at the concept, and not define the decision boundary exactly.
If this sounds familiar, that’s because this basically corresponds to the naivest ELK solution where you hope the reporter generalizes correctly. This probably even works for lots of current NNs. The fact that this works in humans and possibly current NNs, though, is not really surprising to me, and doesn’t necessarily imply that ELK continues to work in superintelligence. In fact, to me, the vast majority of the hardness of ELK is making sure it continues to work up to superintelligence/arbitrarily weird ontologies. One can argue for natural abstractions, but that would be an orthogonal argument to the one made in this post. This is why I think (2) is true, though I think the statement would be more obvious if stated as “the solution in humans doesn’t scale” rather than “can’t be replicated”.
Note: I don’t expect very many things like this to be hard coded; I expect only a few things to be hard coded and a lot of things to result as emergent interactions of those things. But this post is claiming that the hard coded things can’t reference concepts in the world model at all.
As for more abstract concepts: I think encoding the concept of, say, death, is actually extremely doable extensionally. There are a bunch of ways we can point at the concept of death relative to other anticipated experiences/concepts (i.e the thing that follows serious illness and pain, unconsciousness/the thing that’s like dreamless sleep, the thing that we observe happens to other beings that causes them to become disempowered, etc). Anecdotally, people do seem to be afraid of death in large part because they’re afraid of losing consciousness, the pain that comes before it, the disempowerment of no longer being able to affect things, etc. Again, none of these things have to be exactly pointing to death; they just serve to select out the neuron(s) that encode the concept of death. Further evidence for this theory includes the fact that humans across many cultures and even many animals pretty reliably develop an understanding of death in their world models, so it seems plausible that evolution would have had time to wire things up, and it’s a fairly well known phenomenon that very small children who don’t yet have well formed world models tend to endanger themselves with seemingly no fear of death. This all also seems consistent with the fact that lots of things we seem fairly hardwired to care about (i.e death, happiness, etc) splinter; we’re wired to care about things as specified by some set of points that were relevant in the ancestral environment, and the splintering is because those points don’t actually define a sharp decision boundary.
As for why I think more powerful AIs will have more alien abstractions: I think that there are many situations where the human abstractions are used because they are optimal for a mind with our constraints. In some situations, given more computing power you ideally want to model things at a lower level of abstraction. If you can calculate how the coin will land by modelling the air currents and its rotational speed, you want to do that to predict exactly the outcome, rather than abstracting it away as a Bernoulli process. Conversely, sometimes there are high levels of abstraction that carve reality at the joints that require fitting too much stuff in your mind at once, or involve regularities of the world that we haven’t discovered yet. Consider how having an understanding of thermodynamics lets you predict macroscopic properties of the system, but only if you already know about and are capable of understanding it. Thus, it seems highly likely that a powerful AI would develop very weird abstractions from our perspective. To be clear, I still think natural abstractions is likely enough to be true that it’s worth elevating as a hypothesis under consideration, and a large part of my remaining optimism lies there, but I don’t think it’s automatically true at all.
Hm. Here’s another stab at isolating my disagreement (?) with you:
I agree that, in theory, there exist (possibly extremely complicated) genotypes which do specify extensive hardcoded circuitry which does in practice access certain abstract concepts like death.
(Because you can do a lot if you’re talking about “in theory”; probably the case that a few complicated programs which don’t seem like they should work, will work, even though most do fail)
I think the more complicated indirect specifications (like associatively learning where the tree abstraction is learned) are “plausible” in the sense that a not-immediately-crisply-debunkable alignment idea seems “plausible”, but if you actually try that kind of idea in reality, it doesn’t work (with high probability).
But marginalizing over all such implausible “plausible” ideas and adding in evolution’s “multiple tries” advantage and adding in some unforeseen clever solutions I haven’t yet considered, I reach a credence of about 4-8% for such approaches actually explaining significant portions of human mental events.
So now I’m not sure where we disagree. I don’t think it’s literally impossible for the genome to access death, but it sure sounds sketchy to me, so I assign it low credence. I agree that (2) is possible, but I assign it low credence. You don’t think it’s impossible either, but you seem to agree that relatively few things are in fact hardcoded, but also you think (2) is the resolution to the trilemma. But wouldn’t that imply (3) instead, even though, perhaps for a select few concepts, (2) is the case?
Here’s some misc commentaries:
The fact that this works in humans and possibly current NNs
(Nitpick for clarity) “Fact”? Be careful to not condition on your own hypothesis! I don’t think you’re literally doing as much, but for other readers, I want to flag this as importantly an inference on your part and not an observation. (LMK if I unintentionally do this elsewhere, of course)
Note: I don’t expect very many things like this to be hard coded; I expect only a few things to be hard coded and a lot of things to result as emergent interactions of those things.
Ah, interesting, maybe we disagree less than I thought. Do you have any sense of your numerical value of “a few”, or some percentage? I think a lot of the most important shard theory inferences only require that most of the important mental events/biases/values in humans are convergently downstream results of a relatively small set of hardcoded circuitry.
even many animals pretty reliably develop an understanding of death in their world models
I buy that maybe chimps and a small few other animals understand death. But I think “grieves” and “understands death-the-abstract-concept as we usually consider it” and “has a predictive abstraction around death (in the sense that people probably have predictive abstractions around edge detectors before they have a concept of ‘edge’)” are importantly distinct propositions.
There are a bunch of ways we can point at the concept of death relative to other anticipated experiences/concepts (i.e the thing that follows serious illness and pain, unconsciousness/the thing that’s like dreamless sleep, the thing that we observe happens to other beings that causes them to become disempowered, etc)
FWIW I think that lots of these other concepts are also inaccessible and run into various implausibilities of their own.
I don’t think that defining things “extensively” in this manner works for any even moderately abstract concepts. I think that human concepts are far too varied for this to work. E.g., different cultures can have very different notions of death. I also think that the evidence from children points in the other direction. Children often have to be told that death is bad, that it’s not just a long sleep or that the dead person / entity hasn’t just gone away somewhere far off. I think that, if aversion to death were hard coded, we’d see children quickly gain an aversion to death as soon as they discovered the concept.
I also think you can fully explain the convergent aversion to death simply by the fact that death is obviously bad relative to your other values. E.g., I’d be quite averse to having by arm turn into a ballon animal, but that’s not because it was evolutionarily hard-coded into me. I can just roll out the consequences of that change and see that they’re bad.
I’d also note that human abstractions vary quite a lot, but having different abstractions doesn’t seem to particularly affect humans’ levels of morality / caring about each other. E.g., blind people don’t have any visual abstractions, but are not thereby morally deficient in any way. Note that blindness means that the entire visual cortex is no longer dedicated to vision, and can be repurposed for other tasks. This “additional hardware” seems like it should somewhat affect which distribution of abstractions are optimal (since the constraints on the non-visual tasks have changed). And yet, values seem quite unaffected by that.
Similarly, learning about quantum physics, evolution, neuroscience, and the like doesn’t then cause your morality to collapse. In fact, the abstractions that are most likely to affect a human’s morality, such as religion, political ideology and the like, do not seem very predicatively performant.
The fact that different cultures have different concepts of death, or that it splinters away from the things it was needed for in the ancestral environment, doesn’t seem to contradict my claim. What matters is not that the ideas are entirely the same from person to person, but rather that the concept has the kinds of essential properties that mattered in the ancestral environment. For instance, as long as your concept of death you pick out can predict that killing a lion makes it no longer able to kill you, that dying means disempowerment, etc, it doesn’t matter if you also believe ghosts exist, as long as your ghost belief isn’t so strong that it makes you not mind being killed by a lion.
I think these core properties are conserved across cultures. Grab two people from extremely different cultures and they can agree that people eventually die, and if you die your ability to influence the world is sharply diminished. (Even people who believe in ghosts have to begrudgingly accept that ghosts have a much harder time filing their taxes.) I don’t think this splintering contradicts my theory at all. You’re selecting out the concept in the brain that best fits these constraints, and maybe in one brain that comes with ghosts and in another it doesn’t.
To be fully clear, I’m not positing the existence of some kind of globally universal concept of death or whatever that is shared by everyone, or that concepts in brains are stored at fixed “neural addresses”. The entire point of doing ELK/ontology identification is to pick out the thing that best corresponds to some particular concept in a wide variety of different minds. This also allows for splintering outside the region where the concept is well defined.
I concede that fear of death could be downstream of other fears rather than encoded. However, I still think it’s wrong to believe that this isn’t possible in principle, and I think these other fears/motivations (wanting to achieve values, fear of , etc) are still pretty abstract, and there’s a good chance of some of those things being anchored directly into the genome using a similar mechanism to what I described.
I don’t get how the case of morality existing in blind people relates. Sure, it could affect the distribution somewhat. That still shouldn’t break extensional specification. I’m worried that maybe your model of my beliefs looks like the genome encoding some kind of fixed neural address thing, or a perfectly death-shaped hole that accepts concepts that exactly fit the mold of Standardized Death Concept, and breaks whenever given a slightly misshapen death concept. That’s not at all what I’m pointing at.
I feel similarly about the quantum physics or neuroscience cases. My theory doesn’t predict that your morality collapses when you learn about quantum physics! Your morality is defined by extensional specification (possibly indirectly, the genome probably doesn’t directly encode many examples of what’s right and wrong), and within any new ontology you use your extensional specification to figure out which things are moral. Sometimes this is smooth, when you make small localized changes to your ontology. Sometimes you will experience an ontological crisis—empirically, it seems many people experience some kind of crisis of morality when concepts like free will get called into question due to quantum mechanics for instance, and then you inspect lots of examples of things you’re confident about and then try to find something in the new ontology that stretches to cover all of those cases (which is extensional reasoning). None of this contradicts the idea that morality, or rather its many constituent heuristics built on high level abstractions, can be defined extensionally in the genome.
I like the tree example, and I think it’s quite useful (and fun) to think of dumb and speculative way for the genome to access world concept. For instance, in response to “I infer that the genome cannot directly specify circuitry which detects whether you’re thinking about your family”, the genome could:
Hardcode a face detector, and store the face most seen during early childhood (for instance to link them to the reward center).
Store faces of people with an odor similar to amniotic fluid odor or with a weak odor (if you’re insensitive to your own smell and family member have a more similar smell)
In these cases, I’m not sure if it counts for you as the genome directly specifying circuitry, but it should quite robustly point to a real world concept (which could be “gamed” in certain situations like adoptive parents, but I think that’s actually what happens)
I totally buy that the genome can do those things, but think that that it will probably not be locating the “family” concept in your learned world model.
(Partly transcribed from a correspondence on Eleuther.)
I disagree about concepts in the human world model being inaccessible in theory to the genome. I think lots of concepts could be accessed, and that (2) is true in the trilemma.
Consider: As a dumb example that I don’t expect to actually be the case but which gives useful intuition, suppose the genome really wants to wire something up to the tree neuron. Then the genome could encode a handful of images of trees and then once the brain is fully formed it can go through and search for whichever neuron activates the hardest on those 10 images. (Of course it wouldn’t actually do literal images, but I expect compressing it down to not actually be that hard.) The more general idea is that we can specify concepts in the world model extensionally by specifying constraints that the concept has to satisfy (for instance, it should activate on these particular data points, or it should have this particular temporal consistency, etc.) Keep in mind this means that the genome just has to vaguely gesture at the concept, and not define the decision boundary exactly.
If this sounds familiar, that’s because this basically corresponds to the naivest ELK solution where you hope the reporter generalizes correctly. This probably even works for lots of current NNs. The fact that this works in humans and possibly current NNs, though, is not really surprising to me, and doesn’t necessarily imply that ELK continues to work in superintelligence. In fact, to me, the vast majority of the hardness of ELK is making sure it continues to work up to superintelligence/arbitrarily weird ontologies. One can argue for natural abstractions, but that would be an orthogonal argument to the one made in this post. This is why I think (2) is true, though I think the statement would be more obvious if stated as “the solution in humans doesn’t scale” rather than “can’t be replicated”.
Note: I don’t expect very many things like this to be hard coded; I expect only a few things to be hard coded and a lot of things to result as emergent interactions of those things. But this post is claiming that the hard coded things can’t reference concepts in the world model at all.
As for more abstract concepts: I think encoding the concept of, say, death, is actually extremely doable extensionally. There are a bunch of ways we can point at the concept of death relative to other anticipated experiences/concepts (i.e the thing that follows serious illness and pain, unconsciousness/the thing that’s like dreamless sleep, the thing that we observe happens to other beings that causes them to become disempowered, etc). Anecdotally, people do seem to be afraid of death in large part because they’re afraid of losing consciousness, the pain that comes before it, the disempowerment of no longer being able to affect things, etc. Again, none of these things have to be exactly pointing to death; they just serve to select out the neuron(s) that encode the concept of death. Further evidence for this theory includes the fact that humans across many cultures and even many animals pretty reliably develop an understanding of death in their world models, so it seems plausible that evolution would have had time to wire things up, and it’s a fairly well known phenomenon that very small children who don’t yet have well formed world models tend to endanger themselves with seemingly no fear of death. This all also seems consistent with the fact that lots of things we seem fairly hardwired to care about (i.e death, happiness, etc) splinter; we’re wired to care about things as specified by some set of points that were relevant in the ancestral environment, and the splintering is because those points don’t actually define a sharp decision boundary.
As for why I think more powerful AIs will have more alien abstractions: I think that there are many situations where the human abstractions are used because they are optimal for a mind with our constraints. In some situations, given more computing power you ideally want to model things at a lower level of abstraction. If you can calculate how the coin will land by modelling the air currents and its rotational speed, you want to do that to predict exactly the outcome, rather than abstracting it away as a Bernoulli process. Conversely, sometimes there are high levels of abstraction that carve reality at the joints that require fitting too much stuff in your mind at once, or involve regularities of the world that we haven’t discovered yet. Consider how having an understanding of thermodynamics lets you predict macroscopic properties of the system, but only if you already know about and are capable of understanding it. Thus, it seems highly likely that a powerful AI would develop very weird abstractions from our perspective. To be clear, I still think natural abstractions is likely enough to be true that it’s worth elevating as a hypothesis under consideration, and a large part of my remaining optimism lies there, but I don’t think it’s automatically true at all.
(Upvoted, unsure of whether to hit ‘disagree’)
Hm. Here’s another stab at isolating my disagreement (?) with you:
I agree that, in theory, there exist (possibly extremely complicated) genotypes which do specify extensive hardcoded circuitry which does in practice access certain abstract concepts like death.
(Because you can do a lot if you’re talking about “in theory”; probably the case that a few complicated programs which don’t seem like they should work, will work, even though most do fail)
I think the more complicated indirect specifications (like associatively learning where the tree abstraction is learned) are “plausible” in the sense that a not-immediately-crisply-debunkable alignment idea seems “plausible”, but if you actually try that kind of idea in reality, it doesn’t work (with high probability).
But marginalizing over all such implausible “plausible” ideas and adding in evolution’s “multiple tries” advantage and adding in some unforeseen clever solutions I haven’t yet considered, I reach a credence of about 4-8% for such approaches actually explaining significant portions of human mental events.
So now I’m not sure where we disagree. I don’t think it’s literally impossible for the genome to access death, but it sure sounds sketchy to me, so I assign it low credence. I agree that (2) is possible, but I assign it low credence. You don’t think it’s impossible either, but you seem to agree that relatively few things are in fact hardcoded, but also you think (2) is the resolution to the trilemma. But wouldn’t that imply (3) instead, even though, perhaps for a select few concepts, (2) is the case?
Here’s some misc commentaries:
(Nitpick for clarity) “Fact”? Be careful to not condition on your own hypothesis! I don’t think you’re literally doing as much, but for other readers, I want to flag this as importantly an inference on your part and not an observation. (LMK if I unintentionally do this elsewhere, of course)
Ah, interesting, maybe we disagree less than I thought. Do you have any sense of your numerical value of “a few”, or some percentage? I think a lot of the most important shard theory inferences only require that most of the important mental events/biases/values in humans are convergently downstream results of a relatively small set of hardcoded circuitry.
I buy that maybe chimps and a small few other animals understand death. But I think “grieves” and “understands death-the-abstract-concept as we usually consider it” and “has a predictive abstraction around death (in the sense that people probably have predictive abstractions around edge detectors before they have a concept of ‘edge’)” are importantly distinct propositions.
FWIW I think that lots of these other concepts are also inaccessible and run into various implausibilities of their own.
I don’t think that defining things “extensively” in this manner works for any even moderately abstract concepts. I think that human concepts are far too varied for this to work. E.g., different cultures can have very different notions of death. I also think that the evidence from children points in the other direction. Children often have to be told that death is bad, that it’s not just a long sleep or that the dead person / entity hasn’t just gone away somewhere far off. I think that, if aversion to death were hard coded, we’d see children quickly gain an aversion to death as soon as they discovered the concept.
I also think you can fully explain the convergent aversion to death simply by the fact that death is obviously bad relative to your other values. E.g., I’d be quite averse to having by arm turn into a ballon animal, but that’s not because it was evolutionarily hard-coded into me. I can just roll out the consequences of that change and see that they’re bad.
I’d also note that human abstractions vary quite a lot, but having different abstractions doesn’t seem to particularly affect humans’ levels of morality / caring about each other. E.g., blind people don’t have any visual abstractions, but are not thereby morally deficient in any way. Note that blindness means that the entire visual cortex is no longer dedicated to vision, and can be repurposed for other tasks. This “additional hardware” seems like it should somewhat affect which distribution of abstractions are optimal (since the constraints on the non-visual tasks have changed). And yet, values seem quite unaffected by that.
Similarly, learning about quantum physics, evolution, neuroscience, and the like doesn’t then cause your morality to collapse. In fact, the abstractions that are most likely to affect a human’s morality, such as religion, political ideology and the like, do not seem very predicatively performant.
The fact that different cultures have different concepts of death, or that it splinters away from the things it was needed for in the ancestral environment, doesn’t seem to contradict my claim. What matters is not that the ideas are entirely the same from person to person, but rather that the concept has the kinds of essential properties that mattered in the ancestral environment. For instance, as long as your concept of death you pick out can predict that killing a lion makes it no longer able to kill you, that dying means disempowerment, etc, it doesn’t matter if you also believe ghosts exist, as long as your ghost belief isn’t so strong that it makes you not mind being killed by a lion.
I think these core properties are conserved across cultures. Grab two people from extremely different cultures and they can agree that people eventually die, and if you die your ability to influence the world is sharply diminished. (Even people who believe in ghosts have to begrudgingly accept that ghosts have a much harder time filing their taxes.) I don’t think this splintering contradicts my theory at all. You’re selecting out the concept in the brain that best fits these constraints, and maybe in one brain that comes with ghosts and in another it doesn’t.
To be fully clear, I’m not positing the existence of some kind of globally universal concept of death or whatever that is shared by everyone, or that concepts in brains are stored at fixed “neural addresses”. The entire point of doing ELK/ontology identification is to pick out the thing that best corresponds to some particular concept in a wide variety of different minds. This also allows for splintering outside the region where the concept is well defined.
I concede that fear of death could be downstream of other fears rather than encoded. However, I still think it’s wrong to believe that this isn’t possible in principle, and I think these other fears/motivations (wanting to achieve values, fear of , etc) are still pretty abstract, and there’s a good chance of some of those things being anchored directly into the genome using a similar mechanism to what I described.
I don’t get how the case of morality existing in blind people relates. Sure, it could affect the distribution somewhat. That still shouldn’t break extensional specification. I’m worried that maybe your model of my beliefs looks like the genome encoding some kind of fixed neural address thing, or a perfectly death-shaped hole that accepts concepts that exactly fit the mold of Standardized Death Concept, and breaks whenever given a slightly misshapen death concept. That’s not at all what I’m pointing at.
I feel similarly about the quantum physics or neuroscience cases. My theory doesn’t predict that your morality collapses when you learn about quantum physics! Your morality is defined by extensional specification (possibly indirectly, the genome probably doesn’t directly encode many examples of what’s right and wrong), and within any new ontology you use your extensional specification to figure out which things are moral. Sometimes this is smooth, when you make small localized changes to your ontology. Sometimes you will experience an ontological crisis—empirically, it seems many people experience some kind of crisis of morality when concepts like free will get called into question due to quantum mechanics for instance, and then you inspect lots of examples of things you’re confident about and then try to find something in the new ontology that stretches to cover all of those cases (which is extensional reasoning). None of this contradicts the idea that morality, or rather its many constituent heuristics built on high level abstractions, can be defined extensionally in the genome.
I like the tree example, and I think it’s quite useful (and fun) to think of dumb and speculative way for the genome to access world concept. For instance, in response to “I infer that the genome cannot directly specify circuitry which detects whether you’re thinking about your family”, the genome could:
Hardcode a face detector, and store the face most seen during early childhood (for instance to link them to the reward center).
Store faces of people with an odor similar to amniotic fluid odor or with a weak odor (if you’re insensitive to your own smell and family member have a more similar smell)
In these cases, I’m not sure if it counts for you as the genome directly specifying circuitry, but it should quite robustly point to a real world concept (which could be “gamed” in certain situations like adoptive parents, but I think that’s actually what happens)
I totally buy that the genome can do those things, but think that that it will probably not be locating the “family” concept in your learned world model.