Human values & biases are inaccessible to the genome
Related to Steve Byrnes’ Social instincts are tricky because of the “symbol grounding problem.” I wouldn’t have had this insight without several great discussions with Quintin Pope.
TL;DR: It seems hard to scan a trained neural network and locate the AI’s learned “tree” abstraction. For very similar reasons, it seems intractable for the genome to scan a human brain and back out the “death” abstraction, which probably will not form at a predictable neural address. Therefore, I infer that the genome can’t directly make us afraid of death by e.g. specifying circuitry which detects when we think about death and then makes us afraid. In turn, this implies that there are a lot of values and biases which the genome cannot hardcode.
In order to understand the human alignment situation confronted by the human genome, consider the AI alignment situation confronted by human civilization. For example, we may want to train a smart AI which learns a sophisticated world model, and then motivate that AI according to its learned world model. Suppose we want to build an AI which intrinsically values trees. Perhaps we can just provide a utility function that queries the learned world model and counts how many trees the AI believes there are.
Suppose that the AI will learn a reasonably human-like concept for “tree.” However, before training has begun, the learned world model is inaccessible to us. Perhaps the learned world model will be buried deep within a recurrent policy network, and buried within the world model is the “trees” concept. But we have no idea what learned circuits will encode that concept, or how the information will be encoded. We probably can’t, in advance of training the AI, write an algorithm which will examine the policy network’s hidden state and reliably back out how many trees the AI thinks there are. The AI’s learned concept for “tree” is inaccessible information from our perspective.
Likewise, the human world model is inaccessible to the human genome, because the world model is probably in the cortex and the cortex is probably randomly initialized.[1] Learned human concepts are therefore inaccessible to the genome, in the same way that the “tree” concept is a priori inaccessible to us. Even the broad area where language processing occurs varies from person to person, to say nothing of the encodings and addresses of particular learned concepts like “death.”
I’m going to say things like “the genome cannot specify circuitry which detects when a person is thinking about death.” This means that the genome cannot hardcode circuitry which e.g. fires when the person is thinking about death, and does not fire when the person is not thinking about death. The genome does help indirectly specify the whole adult brain and all its concepts, just like we indirectly specify the trained neural network via the training algorithm and the dataset. That doesn’t mean we can tell when the AI thinks about trees, and it doesn’t mean that the genome can “tell” when the human thinks about death.
When I’d previously thought about human biases (like the sunk cost fallacy) or values (like caring about other people), I had implicitly imagined that genetic influences could directly affect them (e.g. by detecting when I think about helping my friends, and then producing reward). However, given the inaccessibility obstacle, I infer that this can’t be the explanation. I infer that the genome cannot directly specify circuitry which:
Detects when you’re thinking about seeking power,
Detects when you’re thinking about cheating on your partner,
Detects whether you perceive a sunk cost,
Detects whether you think someone is scamming you and, if so, makes you want to punish them,
Detects whether a decision involves probabilities and, if so, implements the framing effect,
Detects whether you’re thinking about your family,
Detects whether you’re thinking about goals, and makes you conflate terminal and instrumental goals,
Detects and then navigates ontological shifts,
E.g. Suppose you learn that animals are made out of cells. I infer that the genome cannot detect that you are expanding your ontology, and then execute some genetically hard-coded algorithm which helps you do that successfully.
Detects when you’re thinking about wireheading yourself or manipulating your reward signals,
Detects when you’re thinking about reality versus non-reality (like a simulation or fictional world), or
Detects whether you think someone is higher-status than you.
Conversely, the genome can access direct sensory observables, because those observables involve a priori-fixed “neural addresses.” For example, the genome could hardwire a cute-face-detector which hooks up to retinal ganglion cells (which are at genome-predictable addresses), and then this circuit could produce physiological reactions (like the release of reward). This kind of circuit seems totally fine to me.
In total, information inaccessibility is strong evidence for the genome hardcoding relatively simple[2] cognitive machinery. This, in turn, implies that human values/biases/high-level cognitive observables are produced by relatively simpler hardcoded circuitry, specifying e.g. the learning architecture, the broad reinforcement learning and self-supervised learning systems in the brain, and regional learning hyperparameters. Whereas before it seemed plausible to me that the genome hardcoded a lot of the above bullet points, I now think that’s pretty implausible.
When I realized that the genome must also confront the information inaccessibility obstacle, this threw into question a lot of my beliefs about human values, about the complexity of human value formation, and about the structure of my own mind. I was left with a huge puzzle. If we can’t say “the hardwired circuitry down the street did it”, where do biases come from? How can the genome hook the human’s preferences into the human’s world model, when the genome doesn’t “know” what the world model will look like? Why do people usually navigate ontological shifts properly, why don’t they want to wirehead, why do they almost always care about other people if the genome can’t even write circuitry that detects and rewards thoughts about people?
A fascinating mystery, no? More on that soon.
Thanks to Adam Shimi, Steve Byrnes, Quintin Pope, Charles Foster, Logan Smith, Scott Viteri, and Robert Mastragostino for feedback.
Appendix: The inaccessibility trilemma
The logical structure of this essay is that at least one of the following must be true:
Information inaccessibility is somehow a surmountable problem for AI alignment (and the genome surmounted it),
The genome solves information inaccessibility in some way we cannot replicate for AI alignment, or
The genome cannot directly address the vast majority of interesting human cognitive events, concepts, and properties. (The point argued by this essay)
In my opinion, either (1) or (3) would be enormous news for AI alignment. More on (3)’s importance in future essays.
Appendix: Did evolution have advantages in solving the information inaccessibility problem?
Yes, and no. In a sense, evolution had “a lot of tries” but is “dumb”, while we have very few tries at AGI while ourselves being able to do consequentialist planning.
In the AI alignment problem, we want to be able to back out an AGI’s concepts, but we cannot run lots of similar AGIs and select for AGIs with certain effects on the world. Given the natural abstractions hypothesis, maybe there’s a lattice of convergent abstractions—first learn edge detectors, then shape detectors, then people being visually detectable in part as compositions of shapes. And maybe, for example, people tend to convergently situate these abstractions in similar relative neural locations: The edge detectors go in V1, then the shape detectors are almost always in some other location, and then the person-concept circuitry is learned elsewhere in a convergently reliable relative position to the edge and shape detectors.
But there’s a problem with this story. A congenitally blind person develops dramatically different functional areas, which suggests in particular that their person-concept will be at a radically different relative position than the convergent person-concept location in sighted individuals. Therefore, any genetically hardcoded circuit which checks at the relative address for the person-concept which is reliably situated for sighted people, will not look at the right address for congenitally blind people. Therefore, if this story were true, congenitally blind people would lose any important value-formation effects ensured by this location-checking circuit which detects when they’re thinking about people. So, either the human-concept-location-checking circuit wasn’t an important cause of the blind person caring about other people (and then this circuit hasn’t explained the question we wanted it to, which is how people come to care about other people), or there isn’t such a circuit to begin with. I think the latter is true, and the convergent relative location story is wrong.
But the location-checking circuit is only one way the human-concept-detector could be implemented. There are other possibilities. Therefore, given enough selection and time, maybe evolution could evolve a circuit which checks whether you’re thinking about other people. Maybe. But it seems implausible to me (). I’m going to prioritize explanations for “most people care about other people” which don’t require a fancy workaround.
EDIT: After talking with Richard Ngo, I now think there’s about an 8% chance that several interesting mental events are accessed by the genome; I updated upwards from 4%.
EDIT 8/29/22: Updating down to 3%, in part due to 1950′s arguments on ethology:
How do we want to explain the origins of behavior? And [Lehrman’s] critique seems to echo some of the concerns with evolutionary psychology. His approach can be gleaned from his example on the pecking behavior of chicks. Lorenz attributed this behavior to innate forces: The chicks are born with the tendency to peck; it might require just a bit of maturation. Lehrman points out that research by Kuo provides an explanation based on the embryonic development of the chick. The pecking behavior can actually be traced back to movements that developed while the chick was still unhatched. Hardly innate! The main point Lehrman makes: If we claim that something is innate, we stop the scientific investigation without fully understanding the origin of the behavior. This leaves out important – and fascinating – parts of the explanation because we think we’ve answered the question. As he puts it: “the statement “It is innate” adds nothing to an understanding of the developmental process involved”
— Lehrman on Lorenz’s Theory of Instinctive Behavior, blog comment (emphasis added)
- ^
Human values can still be inaccessible to the genome even if the cortex isn’t learned from scratch, but learning-from-scratch is a nice and clean sufficient condition which seems likely to me.
- ^
I argue that the genome probably hardcodes neural circuitry which is simple relative to hardcoded “high-status detector” circuitry. Similarly, the code for a machine learning experiment is simple relative to the neural network it trains.
- My Objections to “We’re All Gonna Die with Eliezer Yudkowsky” by 21 Mar 2023 0:06 UTC; 357 points) (
- The shard theory of human values by 4 Sep 2022 4:28 UTC; 248 points) (
- Humans provide an untapped wealth of evidence about alignment by 14 Jul 2022 2:31 UTC; 210 points) (
- My Objections to “We’re All Gonna Die with Eliezer Yudkowsky” by 21 Mar 2023 1:23 UTC; 166 points) (EA Forum;
- Shard Theory in Nine Theses: a Distillation and Critical Appraisal by 19 Dec 2022 22:52 UTC; 143 points) (
- AI Pause Will Likely Backfire by 16 Sep 2023 10:21 UTC; 136 points) (EA Forum;
- Disentangling Shard Theory into Atomic Claims by 13 Jan 2023 4:23 UTC; 86 points) (
- The heritability of human values: A behavior genetic critique of Shard Theory by 20 Oct 2022 15:51 UTC; 80 points) (
- Evolution is a bad analogy for AGI: inner alignment by 13 Aug 2022 22:15 UTC; 78 points) (
- Voting Results for the 2022 Review by 2 Feb 2024 20:34 UTC; 57 points) (
- 2022 (and All Time) Posts by Pingback Count by 16 Dec 2023 21:17 UTC; 53 points) (
- AXRP Episode 22 - Shard Theory with Quintin Pope by 15 Jun 2023 19:00 UTC; 52 points) (
- The heritability of human values: A behavior genetic critique of Shard Theory by 20 Oct 2022 15:53 UTC; 49 points) (EA Forum;
- AI Pause Will Likely Backfire (Guest Post) by 24 Oct 2023 4:30 UTC; 47 points) (
- What did you change your mind about in the last year? by 23 Nov 2023 20:53 UTC; 39 points) (
- Value Formation: An Overarching Model by 15 Nov 2022 17:16 UTC; 34 points) (
- Unpacking “Shard Theory” as Hunch, Question, Theory, and Insight by 16 Nov 2022 13:54 UTC; 31 points) (
- Arguments for optimism on AI Alignment (I don’t endorse this version, will reupload a new version soon.) by 15 Oct 2023 14:51 UTC; 24 points) (
- 4 Oct 2022 18:31 UTC; 17 points) 's comment on Humans aren’t fitness maximizers by (
- 2 Dec 2023 18:27 UTC; 10 points) 's comment on Thoughts on “AI is easy to control” by Pope & Belrose by (
- 21 Oct 2022 0:36 UTC; 10 points) 's comment on The heritability of human values: A behavior genetic critique of Shard Theory by (
- 21 Oct 2022 16:47 UTC; 8 points) 's comment on The heritability of human values: A behavior genetic critique of Shard Theory by (
- 5 Sep 2022 19:39 UTC; 7 points) 's comment on The shard theory of human values by (
- 13 Jul 2022 13:52 UTC; 5 points) 's comment on Looking back on my alignment PhD by (
- 11 Jul 2022 4:05 UTC; 4 points) 's comment on Looking back on my alignment PhD by (
- 28 Oct 2023 21:20 UTC; 4 points) 's comment on Value systematization: how values become coherent (and misaligned) by (
- 1 Dec 2023 19:25 UTC; 4 points) 's comment on TurnTrout’s shortform feed by (
- 26 Dec 2022 2:53 UTC; 4 points) 's comment on Accurate Models of AI Risk Are Hyperexistential Exfohazards by (
- 17 Jul 2022 18:51 UTC; 3 points) 's comment on Humans provide an untapped wealth of evidence about alignment by (
- 19 Nov 2022 1:33 UTC; 3 points) 's comment on TurnTrout’s shortform feed by (
- 10 Jul 2022 3:28 UTC; 3 points) 's comment on Willa’s Shortform by (
- 15 Aug 2022 5:23 UTC; 3 points) 's comment on Shard Theory: An Overview by (
- 11 Jul 2022 0:18 UTC; 2 points) 's comment on Looking back on my alignment PhD by (
- 21 Oct 2023 1:52 UTC; 2 points) 's comment on Arguments for optimism on AI Alignment (I don’t endorse this version, will reupload a new version soon.) by (
- 14 Jul 2022 21:09 UTC; 2 points) 's comment on Humans provide an untapped wealth of evidence about alignment by (
- 1 Aug 2022 18:35 UTC; 2 points) 's comment on Humans provide an untapped wealth of evidence about alignment by (
The post is influential, but makes multiple somewhat confused claims and led many people to become confused.
The central confusion stems from the fact that genetic evolution already created a lot of control circuitry before inventing cortex, and did the obvious thing to ‘align’ the evolutionary newer areas: bind them to the old circuitry via interoceptive inputs. By this mechanism, genome is able to ‘access’ a lot of evolutionary relevant beliefs and mental models. The trick is the higher/more distant to genome models are learned in part to predict interoceptive inputs (tracking evolutionary older reward circuitry), so they are bound by default, and there isn’t much independent to ‘bind’. Anyone can check this… just thinking about a dangerous looking person with a weapon activates older, body-based fear/fight chemical regulatory circuits ⇒ the active inference machinery learned this and plans actions to avoid these states.
Agreed. This post would have been strengthened by discussing this consideration more.
Do we… know this? Is this actually a known “fact”? I expect some of this to be happening, but I don’t necessarily know or believe that the genome can access “evolutionary relevant mental models.” That’s the whole thing I’m debating in this post.
It seems reasonable to suspect the genome has more access than supposed in this post, but I don’t know evidence by which one can be confident that it does have meaningful access to abstract concepts. Do you know of such evidence?