What do you mean by ‘enormous news for AI alignment’? That either of these would be surprising to people in the field? Or that resolving that dilemma would be useful to build from? Or something else?
More that it would be useful. I would be surprised by (1) and predict (3) is the case. I expect there are indirect handles that the genome uses to guide the human value formation process towards values that were adaptive in the ancestral environment, without directly having to solve information inaccessibility. I think we’ll be able to use similar methods to guide an AI’s value formation process towards our desired outcomes.
I and Alex both agree that the genome can influence learned behavior and concepts by exploiting its access to sensory ground truth. E.g., imprinting can work through simple circuitry that forms positive affect around certain clusters of visual features in an animal’s early environment. Typically, these feature clusters correspond to the animal’s parents, but note that the imprinting circuitry is so imprecise that even visual clusters very different from the species in question can still trigger imprinting (e.g., humans).
Sunk cost, framing, and goal conflation smell weird to me in this list—like they’re the wrong type?
I think the “type” of the things Alex was listing is “features of human cognition that have mechanistic causes beyond “some weird evolution thing”. I expect that these biases occur due to some deeper explanation grounded in how the human learning process works, which is not simply “evolution did it” or “weird consequence of compute limits”. E.g., there’s some explanation that specifically predicts sunk cost / framing / goal conflation as the convergent consequences of the human learning process.
other goals at various levels of hierarchy, strength, and temporal extent get installed as we go
I think most high level goals / values are learned, and emerge from the interaction between simple, hard-coded reward circuitry and our environments. I don’t think most are directly installed by evolution. Even something like sexual desires learned later in life seems like it’s mostly due to time-dependent changes in reward circuitry (and possibly some hard-coded attention biases).
ontological shifts are just supplementary world abstractions being installed which happen to overlap with preexisting abstractions
I think our abstraction-manipulating machinery is mostly meta-learned (witness how different such machinery is across people). I don’t think evolution did anything special to make us robust to ontological shifts. Such robustness seems likely to be strongly convergent across many types of learning processes. IMO, the key is that learning systems don’t develop a single ontology, but instead more like a continuous distribution over the different ontologies that the learned intelligence can deploy in different situations. Thus, values “learn” to generalize across different ontolgies well before you learn that people are made of cells, and you usually don’t model other people as being giant piles of cells / atoms / quantum fields anyways because modeling them like that is usually pointless anyways.
This response is really helpful, thank you! I take various of the points as uncontroversial[1], so I’ll respond mainly to those where I think you seem surprisingly confident (vs my own current epistemic position).
I and Alex both agree that the genome can influence learned behavior and concepts by exploiting its access to sensory ground truth… the imprinting circuitry is… imprecise
It seems like there are two salient hypotheses that can come out of the imprinting phenomenon, though (they seem to sort of depend on what direction you draw the arrows between different bits of brain?):
Hard-coded proxies fire for the thing in question. (Maybe also this encourages more attention and makes it more likely for the runtime learner to develop corresponding abstractions.) Corresponding abstractions are highly correlated with the proxies, and this strong signal helps with symbol grounding. (And the now-grounded ‘symbols’ feed into whatever other circuitry.) Maybe decision-making is—at least partially—defined relative to these ‘symbols’.
Hard-coded proxies fire for the thing in question. (Maybe also this encourages more attention and makes it more likely for the runtime learner to develop corresponding abstractions.) These proxies directly wire to reward circuits. There is runtime reinforcement learning. The runtime reinforcement learner generates corresponding abstractions because these are useful ‘features’ for reinforced behaviour. Decision-making is the product of reinforced behaviour.
Both of these seem like useful things to happen from the POV of natural selection, so I don’t see how to rule out either (and tentatively expect both to be true). I think you and Alex are exploring the hypothesis 2?
FWIW I tentatively wonder if to the extent that human and animal decision-making fits something like an actor-critic or a propose-promote deliberation framing, the actor/propose might be more 2-ish and the critic/promote might be more 1-ish.
there’s some explanation that specifically predicts sunk cost / framing / goal conflation as the convergent consequences of the human learning process.
We could probably dig further into each of these, but for now I’ll say: I don’t think these have in common a material/mechanical cause much lower than ‘the brain’ and I don’t think they have in common a moving cause much lower than ‘evolution did it’. Framing, like anchoring, seems like a straightforward consequence of ‘sensible’ computational shortcuts to make world modelling tractable (on any computer, not just a human brain).
I think most high level goals / values are learned… don’t think most are directly installed by evolution
I basically can’t evaluate whether I agree with this because I don’t know what ‘high level’ and ‘most’ means. This isn’t intended as a rebuttal; this topic is in general hard to discuss with precision. I also find it disconcertingly hard to talk/think about high and low level goals in humans without bumping into ‘consciousness’ one way or another and I really wish that was less of a mystery. I basically agree that the vast majority of what seem to pass for goals at almost any level are basically instrumental and generated at runtime. But, is this supposed to be a surprise? I don’t think it is.
learning systems don’t develop a single ontology…
values “learn” to generalize across different ontolgies well before you learn that people are made of cells
Seems uncontroversial to me. I think we’re on the same page when I said
ontological shifts are just supplementary world abstractions being installed which happen to overlap with preexisting abstractions
I don’t see any reason for supplementary abstractions to interfere with values, terminal or otherwise, resting on existing ontologies. (They can interfere enormously with new instrumental things, for epistemic reasons, of course.)
I note that sometimes people do have what looks passingly similar to ontological crises. I don’t know what to make of this except by noting that people’s ‘most salient active goals’ are often instrumental goals expressed in one or other folk ontology and subject to the very conflation we’ve agreed exists, so I suppose if newly-installed abstractions are sufficiently incompatible in the world model it can dislodge a lot of aggregate weight from the active goalset. A ‘healthy’ recovery from this sort of thing usually looks like someone identifying the in-fact-more-fundamental goals (which might putatively be the ones (or closer to the ones) installed by evolution, I don’t know).
Thanks again for this clarifying response, and I’m looking forward to more stuff from you and Alex and/or others in this area.
By the way, I get a sense of ‘controversy signalling’ from some of this ‘shard theory’ stuff. I don’t have a good way to describe this, but it seems to make it harder for me to engage because I’m not sure what’s supposed to be new and for some reason I can’t really tell what I agree with. cf Richard’s comment. Please take this as a friendly note because I understand you’ve had a hard time getting some people to engage constructively (Alex told me something to the effect of ‘most people slide off this’). I’m afraid I don’t have positive textual/presentational advice here beyond this footnote.
More that it would be useful. I would be surprised by (1) and predict (3) is the case. I expect there are indirect handles that the genome uses to guide the human value formation process towards values that were adaptive in the ancestral environment, without directly having to solve information inaccessibility. I think we’ll be able to use similar methods to guide an AI’s value formation process towards our desired outcomes.
I and Alex both agree that the genome can influence learned behavior and concepts by exploiting its access to sensory ground truth. E.g., imprinting can work through simple circuitry that forms positive affect around certain clusters of visual features in an animal’s early environment. Typically, these feature clusters correspond to the animal’s parents, but note that the imprinting circuitry is so imprecise that even visual clusters very different from the species in question can still trigger imprinting (e.g., humans).
I think the “type” of the things Alex was listing is “features of human cognition that have mechanistic causes beyond “some weird evolution thing”. I expect that these biases occur due to some deeper explanation grounded in how the human learning process works, which is not simply “evolution did it” or “weird consequence of compute limits”. E.g., there’s some explanation that specifically predicts sunk cost / framing / goal conflation as the convergent consequences of the human learning process.
I think most high level goals / values are learned, and emerge from the interaction between simple, hard-coded reward circuitry and our environments. I don’t think most are directly installed by evolution. Even something like sexual desires learned later in life seems like it’s mostly due to time-dependent changes in reward circuitry (and possibly some hard-coded attention biases).
I think our abstraction-manipulating machinery is mostly meta-learned (witness how different such machinery is across people). I don’t think evolution did anything special to make us robust to ontological shifts. Such robustness seems likely to be strongly convergent across many types of learning processes. IMO, the key is that learning systems don’t develop a single ontology, but instead more like a continuous distribution over the different ontologies that the learned intelligence can deploy in different situations. Thus, values “learn” to generalize across different ontolgies well before you learn that people are made of cells, and you usually don’t model other people as being giant piles of cells / atoms / quantum fields anyways because modeling them like that is usually pointless anyways.
This response is really helpful, thank you! I take various of the points as uncontroversial[1], so I’ll respond mainly to those where I think you seem surprisingly confident (vs my own current epistemic position).
Of course!
It seems like there are two salient hypotheses that can come out of the imprinting phenomenon, though (they seem to sort of depend on what direction you draw the arrows between different bits of brain?):
Hard-coded proxies fire for the thing in question. (Maybe also this encourages more attention and makes it more likely for the runtime learner to develop corresponding abstractions.) Corresponding abstractions are highly correlated with the proxies, and this strong signal helps with symbol grounding. (And the now-grounded ‘symbols’ feed into whatever other circuitry.) Maybe decision-making is—at least partially—defined relative to these ‘symbols’.
Hard-coded proxies fire for the thing in question. (Maybe also this encourages more attention and makes it more likely for the runtime learner to develop corresponding abstractions.) These proxies directly wire to reward circuits. There is runtime reinforcement learning. The runtime reinforcement learner generates corresponding abstractions because these are useful ‘features’ for reinforced behaviour. Decision-making is the product of reinforced behaviour.
Both of these seem like useful things to happen from the POV of natural selection, so I don’t see how to rule out either (and tentatively expect both to be true). I think you and Alex are exploring the hypothesis 2?
FWIW I tentatively wonder if to the extent that human and animal decision-making fits something like an actor-critic or a propose-promote deliberation framing, the actor/propose might be more 2-ish and the critic/promote might be more 1-ish.
We could probably dig further into each of these, but for now I’ll say: I don’t think these have in common a material/mechanical cause much lower than ‘the brain’ and I don’t think they have in common a moving cause much lower than ‘evolution did it’. Framing, like anchoring, seems like a straightforward consequence of ‘sensible’ computational shortcuts to make world modelling tractable (on any computer, not just a human brain).
I basically can’t evaluate whether I agree with this because I don’t know what ‘high level’ and ‘most’ means. This isn’t intended as a rebuttal; this topic is in general hard to discuss with precision. I also find it disconcertingly hard to talk/think about high and low level goals in humans without bumping into ‘consciousness’ one way or another and I really wish that was less of a mystery. I basically agree that the vast majority of what seem to pass for goals at almost any level are basically instrumental and generated at runtime. But, is this supposed to be a surprise? I don’t think it is.
Seems uncontroversial to me. I think we’re on the same page when I said
I don’t see any reason for supplementary abstractions to interfere with values, terminal or otherwise, resting on existing ontologies. (They can interfere enormously with new instrumental things, for epistemic reasons, of course.)
I note that sometimes people do have what looks passingly similar to ontological crises. I don’t know what to make of this except by noting that people’s ‘most salient active goals’ are often instrumental goals expressed in one or other folk ontology and subject to the very conflation we’ve agreed exists, so I suppose if newly-installed abstractions are sufficiently incompatible in the world model it can dislodge a lot of aggregate weight from the active goalset. A ‘healthy’ recovery from this sort of thing usually looks like someone identifying the in-fact-more-fundamental goals (which might putatively be the ones (or closer to the ones) installed by evolution, I don’t know).
Thanks again for this clarifying response, and I’m looking forward to more stuff from you and Alex and/or others in this area.
By the way, I get a sense of ‘controversy signalling’ from some of this ‘shard theory’ stuff. I don’t have a good way to describe this, but it seems to make it harder for me to engage because I’m not sure what’s supposed to be new and for some reason I can’t really tell what I agree with. cf Richard’s comment. Please take this as a friendly note because I understand you’ve had a hard time getting some people to engage constructively (Alex told me something to the effect of ‘most people slide off this’). I’m afraid I don’t have positive textual/presentational advice here beyond this footnote.