Information inaccessibility is somehow a surmountable problem for AI alignment (and the genome surmounted it),
The genome solves information inaccessibility in some way we cannot replicate for AI alignment, or
The genome cannot directly address the vast majority of interesting human cognitive events, concepts, and properties. (The point argued by this essay)
In my opinion, either (1) or (3) would be enormous news for AI alignment
What do you mean by ‘enormous news for AI alignment’? That either of these would be surprising to people in the field? Or that resolving that dilemma would be useful to build from? Or something else?
FWIW from my POV the trilemma isn’t, because I agree that (2) is obviously not the case in principle (subject to enough research time!). And I further think it reasonably clear that both (1) and (3) are true in some measure. Granted you say ‘at least one’ must be true, but I think the framing as a trilemma suggests you want to dismiss (1) - is that right?
I’ll bite those bullets (in devil’s advocate style)...
I think about half of your bullets are probably (1), except via rough proxies (power, scamming, family, status, maybe cheating)
why? One clue is that people have quite specific physiological responses to some of these things. Another is that various of these are characterised by different behaviour in different species.
why proxies? It stands to reason, like you’re pointing out here, it’s hard and expensive to specify things exactly. Further, lots of animal research demonstrates hardwired proxies pointing to runtime-learned concepts
Sunk cost, framing, and goal conflation smell weird to me in this list—like they’re the wrong type? I’m not sure what it would mean for these to be ‘detected’ and then the bias ‘implemented’. Rather I think they emerge from failure of imagination due to bounded compute.
in the case of goals I think that’s just how we’re implemented (it’s parsimonious)
with the possible exception of ‘conscious self approval’ as a differently-typed and differently-implemented sole terminal goal
other goals at various levels of hierarchy, strength, and temporal extent get installed as we go
ontological shifts are just supplementary world abstractions being installed which happen to overlap with preexisting abstractions
tentatively, I expect cells and atoms probably have similar representation to ghosts and spirits and numbers and ecosystems and whatnot—they’re just abstractions and we have machinery which forms and manipulates them
admittedly this machinery is basically magic to me at this point
wireheading and reality/non-reality are unclear to me and I’m looking forward to seeing where you go with it
I suspect all imagined circumstances (‘real’ or non-real) go via basically the same circuitry, and that ‘non-real’ is just an abstraction like ‘far away’ or ‘unlikely’
after all, any imagined circumstances is non-real to some extent
What do you mean by ‘enormous news for AI alignment’? That either of these would be surprising to people in the field? Or that resolving that dilemma would be useful to build from? Or something else?
More that it would be useful. I would be surprised by (1) and predict (3) is the case. I expect there are indirect handles that the genome uses to guide the human value formation process towards values that were adaptive in the ancestral environment, without directly having to solve information inaccessibility. I think we’ll be able to use similar methods to guide an AI’s value formation process towards our desired outcomes.
I and Alex both agree that the genome can influence learned behavior and concepts by exploiting its access to sensory ground truth. E.g., imprinting can work through simple circuitry that forms positive affect around certain clusters of visual features in an animal’s early environment. Typically, these feature clusters correspond to the animal’s parents, but note that the imprinting circuitry is so imprecise that even visual clusters very different from the species in question can still trigger imprinting (e.g., humans).
Sunk cost, framing, and goal conflation smell weird to me in this list—like they’re the wrong type?
I think the “type” of the things Alex was listing is “features of human cognition that have mechanistic causes beyond “some weird evolution thing”. I expect that these biases occur due to some deeper explanation grounded in how the human learning process works, which is not simply “evolution did it” or “weird consequence of compute limits”. E.g., there’s some explanation that specifically predicts sunk cost / framing / goal conflation as the convergent consequences of the human learning process.
other goals at various levels of hierarchy, strength, and temporal extent get installed as we go
I think most high level goals / values are learned, and emerge from the interaction between simple, hard-coded reward circuitry and our environments. I don’t think most are directly installed by evolution. Even something like sexual desires learned later in life seems like it’s mostly due to time-dependent changes in reward circuitry (and possibly some hard-coded attention biases).
ontological shifts are just supplementary world abstractions being installed which happen to overlap with preexisting abstractions
I think our abstraction-manipulating machinery is mostly meta-learned (witness how different such machinery is across people). I don’t think evolution did anything special to make us robust to ontological shifts. Such robustness seems likely to be strongly convergent across many types of learning processes. IMO, the key is that learning systems don’t develop a single ontology, but instead more like a continuous distribution over the different ontologies that the learned intelligence can deploy in different situations. Thus, values “learn” to generalize across different ontolgies well before you learn that people are made of cells, and you usually don’t model other people as being giant piles of cells / atoms / quantum fields anyways because modeling them like that is usually pointless anyways.
This response is really helpful, thank you! I take various of the points as uncontroversial[1], so I’ll respond mainly to those where I think you seem surprisingly confident (vs my own current epistemic position).
I and Alex both agree that the genome can influence learned behavior and concepts by exploiting its access to sensory ground truth… the imprinting circuitry is… imprecise
It seems like there are two salient hypotheses that can come out of the imprinting phenomenon, though (they seem to sort of depend on what direction you draw the arrows between different bits of brain?):
Hard-coded proxies fire for the thing in question. (Maybe also this encourages more attention and makes it more likely for the runtime learner to develop corresponding abstractions.) Corresponding abstractions are highly correlated with the proxies, and this strong signal helps with symbol grounding. (And the now-grounded ‘symbols’ feed into whatever other circuitry.) Maybe decision-making is—at least partially—defined relative to these ‘symbols’.
Hard-coded proxies fire for the thing in question. (Maybe also this encourages more attention and makes it more likely for the runtime learner to develop corresponding abstractions.) These proxies directly wire to reward circuits. There is runtime reinforcement learning. The runtime reinforcement learner generates corresponding abstractions because these are useful ‘features’ for reinforced behaviour. Decision-making is the product of reinforced behaviour.
Both of these seem like useful things to happen from the POV of natural selection, so I don’t see how to rule out either (and tentatively expect both to be true). I think you and Alex are exploring the hypothesis 2?
FWIW I tentatively wonder if to the extent that human and animal decision-making fits something like an actor-critic or a propose-promote deliberation framing, the actor/propose might be more 2-ish and the critic/promote might be more 1-ish.
there’s some explanation that specifically predicts sunk cost / framing / goal conflation as the convergent consequences of the human learning process.
We could probably dig further into each of these, but for now I’ll say: I don’t think these have in common a material/mechanical cause much lower than ‘the brain’ and I don’t think they have in common a moving cause much lower than ‘evolution did it’. Framing, like anchoring, seems like a straightforward consequence of ‘sensible’ computational shortcuts to make world modelling tractable (on any computer, not just a human brain).
I think most high level goals / values are learned… don’t think most are directly installed by evolution
I basically can’t evaluate whether I agree with this because I don’t know what ‘high level’ and ‘most’ means. This isn’t intended as a rebuttal; this topic is in general hard to discuss with precision. I also find it disconcertingly hard to talk/think about high and low level goals in humans without bumping into ‘consciousness’ one way or another and I really wish that was less of a mystery. I basically agree that the vast majority of what seem to pass for goals at almost any level are basically instrumental and generated at runtime. But, is this supposed to be a surprise? I don’t think it is.
learning systems don’t develop a single ontology…
values “learn” to generalize across different ontolgies well before you learn that people are made of cells
Seems uncontroversial to me. I think we’re on the same page when I said
ontological shifts are just supplementary world abstractions being installed which happen to overlap with preexisting abstractions
I don’t see any reason for supplementary abstractions to interfere with values, terminal or otherwise, resting on existing ontologies. (They can interfere enormously with new instrumental things, for epistemic reasons, of course.)
I note that sometimes people do have what looks passingly similar to ontological crises. I don’t know what to make of this except by noting that people’s ‘most salient active goals’ are often instrumental goals expressed in one or other folk ontology and subject to the very conflation we’ve agreed exists, so I suppose if newly-installed abstractions are sufficiently incompatible in the world model it can dislodge a lot of aggregate weight from the active goalset. A ‘healthy’ recovery from this sort of thing usually looks like someone identifying the in-fact-more-fundamental goals (which might putatively be the ones (or closer to the ones) installed by evolution, I don’t know).
Thanks again for this clarifying response, and I’m looking forward to more stuff from you and Alex and/or others in this area.
By the way, I get a sense of ‘controversy signalling’ from some of this ‘shard theory’ stuff. I don’t have a good way to describe this, but it seems to make it harder for me to engage because I’m not sure what’s supposed to be new and for some reason I can’t really tell what I agree with. cf Richard’s comment. Please take this as a friendly note because I understand you’ve had a hard time getting some people to engage constructively (Alex told me something to the effect of ‘most people slide off this’). I’m afraid I don’t have positive textual/presentational advice here beyond this footnote.
That either of these would be surprising to people in the field? Or that resolving that dilemma would be useful to build from?
Both.
I think the framing as a trilemma suggests you want to dismiss (1) - is that right?
Yup!
I perceive many of your points as not really grappling with the key arguments in the post, so I’ll step through them. My remarks may come off as aggressive, and I do not mean them as such. I have not yet gained the skill of disagreeing frankly and bluntly without seeming chilly, so I will preface this comment with goodwill!
I think about half of your bullets are probably (1), except via rough proxies (power, scamming, family, status, maybe cheating)
I think that you’re saying “rough proxies” and then imagining it solved, somehow, but I don’t see that step?
Whenever I imagine try to imagine a “proxy”, I get stuck. What, specifically, could the proxy be? Such that it will actually reliably entangle itself with the target learned-concept (e.g. “someone’s cheating me”), such that the imagined proxy explains why people care so robustly about punishing cheaters. Whenever I generate candidate proxies (e.g. detecting physiological anger, or just scanning the brain somehow), the scheme seems pretty implausible to me.
Do you disagree?
One clue is that people have quite specific physiological responses to some of these things. Another is that various of these are characterised by different behaviour in different species.
I don’t presently see why “a physiological response is produced” is more likely to come out true in worlds where the genome solves information inaccessibility, than in worlds where it doesn’t.
why proxies? It stands to reason, like you’re pointing out here, it’s hard and expensive to specify things exactly. Further, lots of animal research demonstrates hardwired proxies pointing to runtime-learned concepts
Note that all of the imprinting examples rely on direct sensory observables. This is not (1): Information inaccessibility is solved by the genome—these imprinting examples aren’t inaccessible to begin with.
(Except “limbic imprinting”, I can’t make heads or tails of that one. I couldn’t quickly understand what a concrete example would be after skimming a few resources.)
Rather I think they emerge from failure of imagination due to bounded compute.
My first pass is “I don’t feel less confused after reading this potential explanation.” More in detail—“bounded compute” a priori predicts many possible observations, AFAICT it does not concentrate probability onto specific observed biases (like sunk cost or framing effect). Rather, “bounded compute” can, on its own, explain a vast range of behavior. Since AFAICT this explanation assigns relatively low probability to observed data, it loses tons of probability mass compared to other hypotheses which more strongly predict the data.
ontological shifts are just supplementary world abstractions being installed which happen to overlap with preexisting abstractions… they’re just abstractions and we have machinery which forms and manipulates them
This machinery is also presently magic to me. But your quoted portion doesn’t (to my eyes) explainhow ontological shifts get handled; this hypothesis seems (to me) to basically be “somehow it happens.” But it, of course, has to happen somehow, by some set of specific mechanisms, and I’m saying that the genome probably isn’t hardcoding those mechanisms (resolution (1)), that the genome is not specifying algorithms by which we can e.g. still love dogs after learning they are made of cells.
Not just because it sounds weird to me. I think it’s just really really hard to pull off, for the same reasons it seems hard to write a priori code which manages ontological shifts for big ML models trained online. Where would one begin? Why should code like that exist, in generality across possible models?
What do you mean by ‘enormous news for AI alignment’? That either of these would be surprising to people in the field? Or that resolving that dilemma would be useful to build from? Or something else?
FWIW from my POV the trilemma isn’t, because I agree that (2) is obviously not the case in principle (subject to enough research time!). And I further think it reasonably clear that both (1) and (3) are true in some measure. Granted you say ‘at least one’ must be true, but I think the framing as a trilemma suggests you want to dismiss (1) - is that right?
I’ll bite those bullets (in devil’s advocate style)...
I think about half of your bullets are probably (1), except via rough proxies (power, scamming, family, status, maybe cheating)
why? One clue is that people have quite specific physiological responses to some of these things. Another is that various of these are characterised by different behaviour in different species.
why proxies? It stands to reason, like you’re pointing out here, it’s hard and expensive to specify things exactly. Further, lots of animal research demonstrates hardwired proxies pointing to runtime-learned concepts
Sunk cost, framing, and goal conflation smell weird to me in this list—like they’re the wrong type? I’m not sure what it would mean for these to be ‘detected’ and then the bias ‘implemented’. Rather I think they emerge from failure of imagination due to bounded compute.
in the case of goals I think that’s just how we’re implemented (it’s parsimonious)
with the possible exception of ‘conscious self approval’ as a differently-typed and differently-implemented sole terminal goal
other goals at various levels of hierarchy, strength, and temporal extent get installed as we go
ontological shifts are just supplementary world abstractions being installed which happen to overlap with preexisting abstractions
tentatively, I expect cells and atoms probably have similar representation to ghosts and spirits and numbers and ecosystems and whatnot—they’re just abstractions and we have machinery which forms and manipulates them
admittedly this machinery is basically magic to me at this point
wireheading and reality/non-reality are unclear to me and I’m looking forward to seeing where you go with it
I suspect all imagined circumstances (‘real’ or non-real) go via basically the same circuitry, and that ‘non-real’ is just an abstraction like ‘far away’ or ‘unlikely’
after all, any imagined circumstances is non-real to some extent
More that it would be useful. I would be surprised by (1) and predict (3) is the case. I expect there are indirect handles that the genome uses to guide the human value formation process towards values that were adaptive in the ancestral environment, without directly having to solve information inaccessibility. I think we’ll be able to use similar methods to guide an AI’s value formation process towards our desired outcomes.
I and Alex both agree that the genome can influence learned behavior and concepts by exploiting its access to sensory ground truth. E.g., imprinting can work through simple circuitry that forms positive affect around certain clusters of visual features in an animal’s early environment. Typically, these feature clusters correspond to the animal’s parents, but note that the imprinting circuitry is so imprecise that even visual clusters very different from the species in question can still trigger imprinting (e.g., humans).
I think the “type” of the things Alex was listing is “features of human cognition that have mechanistic causes beyond “some weird evolution thing”. I expect that these biases occur due to some deeper explanation grounded in how the human learning process works, which is not simply “evolution did it” or “weird consequence of compute limits”. E.g., there’s some explanation that specifically predicts sunk cost / framing / goal conflation as the convergent consequences of the human learning process.
I think most high level goals / values are learned, and emerge from the interaction between simple, hard-coded reward circuitry and our environments. I don’t think most are directly installed by evolution. Even something like sexual desires learned later in life seems like it’s mostly due to time-dependent changes in reward circuitry (and possibly some hard-coded attention biases).
I think our abstraction-manipulating machinery is mostly meta-learned (witness how different such machinery is across people). I don’t think evolution did anything special to make us robust to ontological shifts. Such robustness seems likely to be strongly convergent across many types of learning processes. IMO, the key is that learning systems don’t develop a single ontology, but instead more like a continuous distribution over the different ontologies that the learned intelligence can deploy in different situations. Thus, values “learn” to generalize across different ontolgies well before you learn that people are made of cells, and you usually don’t model other people as being giant piles of cells / atoms / quantum fields anyways because modeling them like that is usually pointless anyways.
This response is really helpful, thank you! I take various of the points as uncontroversial[1], so I’ll respond mainly to those where I think you seem surprisingly confident (vs my own current epistemic position).
Of course!
It seems like there are two salient hypotheses that can come out of the imprinting phenomenon, though (they seem to sort of depend on what direction you draw the arrows between different bits of brain?):
Hard-coded proxies fire for the thing in question. (Maybe also this encourages more attention and makes it more likely for the runtime learner to develop corresponding abstractions.) Corresponding abstractions are highly correlated with the proxies, and this strong signal helps with symbol grounding. (And the now-grounded ‘symbols’ feed into whatever other circuitry.) Maybe decision-making is—at least partially—defined relative to these ‘symbols’.
Hard-coded proxies fire for the thing in question. (Maybe also this encourages more attention and makes it more likely for the runtime learner to develop corresponding abstractions.) These proxies directly wire to reward circuits. There is runtime reinforcement learning. The runtime reinforcement learner generates corresponding abstractions because these are useful ‘features’ for reinforced behaviour. Decision-making is the product of reinforced behaviour.
Both of these seem like useful things to happen from the POV of natural selection, so I don’t see how to rule out either (and tentatively expect both to be true). I think you and Alex are exploring the hypothesis 2?
FWIW I tentatively wonder if to the extent that human and animal decision-making fits something like an actor-critic or a propose-promote deliberation framing, the actor/propose might be more 2-ish and the critic/promote might be more 1-ish.
We could probably dig further into each of these, but for now I’ll say: I don’t think these have in common a material/mechanical cause much lower than ‘the brain’ and I don’t think they have in common a moving cause much lower than ‘evolution did it’. Framing, like anchoring, seems like a straightforward consequence of ‘sensible’ computational shortcuts to make world modelling tractable (on any computer, not just a human brain).
I basically can’t evaluate whether I agree with this because I don’t know what ‘high level’ and ‘most’ means. This isn’t intended as a rebuttal; this topic is in general hard to discuss with precision. I also find it disconcertingly hard to talk/think about high and low level goals in humans without bumping into ‘consciousness’ one way or another and I really wish that was less of a mystery. I basically agree that the vast majority of what seem to pass for goals at almost any level are basically instrumental and generated at runtime. But, is this supposed to be a surprise? I don’t think it is.
Seems uncontroversial to me. I think we’re on the same page when I said
I don’t see any reason for supplementary abstractions to interfere with values, terminal or otherwise, resting on existing ontologies. (They can interfere enormously with new instrumental things, for epistemic reasons, of course.)
I note that sometimes people do have what looks passingly similar to ontological crises. I don’t know what to make of this except by noting that people’s ‘most salient active goals’ are often instrumental goals expressed in one or other folk ontology and subject to the very conflation we’ve agreed exists, so I suppose if newly-installed abstractions are sufficiently incompatible in the world model it can dislodge a lot of aggregate weight from the active goalset. A ‘healthy’ recovery from this sort of thing usually looks like someone identifying the in-fact-more-fundamental goals (which might putatively be the ones (or closer to the ones) installed by evolution, I don’t know).
Thanks again for this clarifying response, and I’m looking forward to more stuff from you and Alex and/or others in this area.
By the way, I get a sense of ‘controversy signalling’ from some of this ‘shard theory’ stuff. I don’t have a good way to describe this, but it seems to make it harder for me to engage because I’m not sure what’s supposed to be new and for some reason I can’t really tell what I agree with. cf Richard’s comment. Please take this as a friendly note because I understand you’ve had a hard time getting some people to engage constructively (Alex told me something to the effect of ‘most people slide off this’). I’m afraid I don’t have positive textual/presentational advice here beyond this footnote.
Both.
Yup!
I perceive many of your points as not really grappling with the key arguments in the post, so I’ll step through them. My remarks may come off as aggressive, and I do not mean them as such. I have not yet gained the skill of disagreeing frankly and bluntly without seeming chilly, so I will preface this comment with goodwill!
I think that you’re saying “rough proxies” and then imagining it solved, somehow, but I don’t see that step?
Whenever I imagine try to imagine a “proxy”, I get stuck. What, specifically, could the proxy be? Such that it will actually reliably entangle itself with the target learned-concept (e.g. “someone’s cheating me”), such that the imagined proxy explains why people care so robustly about punishing cheaters. Whenever I generate candidate proxies (e.g. detecting physiological anger, or just scanning the brain somehow), the scheme seems pretty implausible to me.
Do you disagree?
I don’t presently see why “a physiological response is produced” is more likely to come out true in worlds where the genome solves information inaccessibility, than in worlds where it doesn’t.
Note that all of the imprinting examples rely on direct sensory observables. This is not (1): Information inaccessibility is solved by the genome—these imprinting examples aren’t inaccessible to begin with.
(Except “limbic imprinting”, I can’t make heads or tails of that one. I couldn’t quickly understand what a concrete example would be after skimming a few resources.)
My first pass is “I don’t feel less confused after reading this potential explanation.” More in detail—“bounded compute” a priori predicts many possible observations, AFAICT it does not concentrate probability onto specific observed biases (like sunk cost or framing effect). Rather, “bounded compute” can, on its own, explain a vast range of behavior. Since AFAICT this explanation assigns relatively low probability to observed data, it loses tons of probability mass compared to other hypotheses which more strongly predict the data.
This machinery is also presently magic to me. But your quoted portion doesn’t (to my eyes) explain how ontological shifts get handled; this hypothesis seems (to me) to basically be “somehow it happens.” But it, of course, has to happen somehow, by some set of specific mechanisms, and I’m saying that the genome probably isn’t hardcoding those mechanisms (resolution (1)), that the genome is not specifying algorithms by which we can e.g. still love dogs after learning they are made of cells.
Not just because it sounds weird to me. I think it’s just really really hard to pull off, for the same reasons it seems hard to write a priori code which manages ontological shifts for big ML models trained online. Where would one begin? Why should code like that exist, in generality across possible models?