Quintin Pope comments on On how various plans miss the hard bits of the alignment challenge

Quintin Pope Jul 12, 2022, 5:49 AM
LW: 33 AF: 18
9
AF
I don’t think that “evolution → human values” is the most useful reference class when trying to calibrate our expectations wrt how outer optimization criteria relate to inner objectives. Evolution didn’t directly optimize over our goals. It optimized over our learning process and reward circuitry. Once you condition on a particular human’s learning process + reward circuitry configuration + the human’s environment, you screen off the influence of evolution on that human’s goals. So, there are really two areas from which you can draw evidence about inner (mis)alignment:
- “evolution’s inclusive genetic fitness criteria → a human’s learned values” (as mediated by evolution’s influence over the human’s learning process + reward circuitry)
- “a particular human’s learning process + reward circuitry + “training” environment → the human’s learned values”
The relationship we want to make inferences about is:
- “a particular AI’s learning process + reward function + training environment → the AI’s learned values”
I think that “AI learning → AI values” is much more similar to “human learning → human values” than it is to “evolution → human values”. I grant that you can find various dissimilarities between “AI learning → AI values” and “human learning → human values”. However, I think there are greater dissimilarities between “AI learning → AI values” and “evolution → human values”. As a result, I think the vast majority of our intuitions regarding the likely outcomes of inner goals versus outer optimization should come from looking at the “human learning → human values” analogy, not the “evolution → human values” analogy.
Additionally, I think we have a lot more total empirical evidence from “human learning → human values” compared to from “evolution → human values”. There are billions of instances of humans, and each of them have somewhat different learning processes / reward circuit configurations / learning environments. Each of them represents a different data point regarding how inner goals relate to outer optimization. In contrast, the human species only evolved once^[1]. Thus, evidence from “human learning → human values” should account for even more of our intuitions regarding inner goals versus outer optimization than the difference in reference class similarities alone would indicate.
I will grant that the variations between different humans’ learning processes / reward circuit configurations / learning environments are “sampling” over a small and restricted portion of the space of possible optimization process trajectories. This limits the strength of any conclusions we can draw from looking at the relationship between human values and human rewards / learning environments. However, I again hold that inferences from “evolution → human values” suffer from an even more extreme version of this same issue. “Evolution → human values” represent an even more restricted look at the general space of optimization process trajectories than we get from the observed variations in different humans’ learning processes / reward circuit configurations / learning environments.
There are many sources of empirical evidence that can inform our intuitions regarding how inner goals relate to outer optimization criteria. My current (not very deeply considered) estimate of how to weight these evidence sources is roughly:
- ~66% from “human learning → human values”
- ~4% from “evolution → human values”^[2]
- ~30% from various other evidence sources, which I won’t address further in this comment, on inner goals versus outer criteria:
  - economics
  - microbial ecology
  - politics
  - current results in machine learning
  - game theory / mulit-agent negotiation dynamics
I think that using “human learning → human values” as our reference class for inner goals versus outer optimization criteria suggests a much more straightforward relationship between the two, as compared to the (lack of a) relationship suggested by “evolution → human values”. Looking at the learning trajectories of individual humans, it seems like the reflectively endorsed extrapolations of a given person’s values has a great deal in common with the sorts of experiences they’ve found rewarding in their lives up to that point in time. E.g., a person who grew up with and displayed affection for dogs probably doesn’t want a future totally devoid of dogs, or one in which dogs suffer greatly.
I also think this regularity in inner values is reasonably robust to sharp left turns in capabilities. If you take a human whose outer behavior suggests they like dogs, and give that human very strong capabilities to influence the future, I do not think they are at all likely to erase dogs from existence. And I think this is very robust to the degree of capabilities you give the human. It’s probably not as robust to your choice of which specific human to try this with. E.g., many people would screw themselves over with reckless self-modification, given the capability to do so. My point is that higher capabilities alone do not automatically render inner values completely alien to those demonstrated at lower capabilities.
1. ^
  You can, of course, try to look at how population genetics relate to learned values to try to get more data from the “evolution → human values” reference class, but I think most genetic influences on values are mediated by differences in reward circuitry or environmental correlates of genetic variation. So such an investigation probably ends up mostly redundant in light of how the “human learning → human values” dynamics work out. I don’t know how you’d try and back out a useful inference about general inner versus outer relationships (independent from the “human learning → human values” dynamics) from that mess. In practice, I think the first order evidence from “human learning → human values” still dominates any evolution-specific inferences you can make here.
2. ^
  Even given the arguments in this comment, putting such a low weight on “evolution → human values” might seem extreme, but I have an additional reason, originally identified by Alex Turner, for further down weighting the evidence from “evolution → human values”. See this document on shard theory and search for “homo inclusive-genetic-fitness-maximus”.
What links here?
- Signer Jul 12, 2022, 11:56 PM
  2 points
  1
  Parent
  But the main disanalogy in the “human learning → human values” case is that reward circuitry/brain architecture mostly doesn’t change? And we would need to find these somehow for AI and that process looks much more like evolution. And prediction of (non-instrumental) inner values is not robust across different reward functions—dogs only work because we already implemented environment-invariant compassion in reward circuitry.
  - Quintin Pope Jul 13, 2022, 3:32 AM
    10 points
    4
    Parent
    But the main disanalogy in the “human learning → human values” case is that reward circuitry/brain architecture mostly doesn’t change?
    
    Congenitally blind people end up with human values, despite that:
    
    They’re missing entire chunks of vision-related hard coded rewards.
    The entire visual cortex has been repurposed for other goals.
    Evolution probably could not have “patched” the value formation process of blind people in the ancestral environment due to the massive fitnesses disadvantage blindness confers.
    
    Human value formation can’t be that sensitive to delicate parameters of the learning process or reward circuitry.
    
    And we would need to find these somehow for AI and that process looks much more like evolution.
    
    We could learn a reward model from human judgements, train on human judgements directly, finetune a language model, etc. There are many options here.
    
    And prediction of (non-instrumental) inner values is not robust across different reward functions—dogs only work because we already implemented environment-invariant compassion in reward circuitry.
    
    I don’t agree. If you slightly increase the strength of the reward circuits that rewarded the person for interacting with dogs, you get someone who likes dogs a bit more, not someone who wants to tile the universe with tiny molecular dog faces.
    
    Also, reward circuitry does not, and cannot implement compassion directly. It can only reward you for taking actions that were probably driven by compassion. This is a very dumb approach that nevertheless actually literally works in actual reality, so the problem can’t be that hard.
- mesaoptimizer Jul 12, 2022, 1:09 PM
  1 point
  −2
  AF Parent
  The most important claim in your comment is that “human learning → human values” is evidence that solving / preventing inner misalignment is easier than it seems when one looks at it from the “evolution → human values” perspective. Here’s why I disagree:
  
  Evolution optimized humans for an environment very different from what we see today. This implies that humans are operating out-of-distribution. We see evidence of misalignment. Birth control is a good example of this.
  
  A human’s environment optimizes a human continually towards certain a certain objective (that changes given changes in the environment). This human is aligned with the environment’s objective in that distribution. Outside that distribution, the human may not be aligned with the objective intended by the environment.
  
  An outer misalignment example of this is a person brought up in a high-trust environment, and then thrown into a low-trust / high-conflict environment. Their habits and tendencies make them an easy mark for predators.
  
  An inner misalignment example of this is a gay male who grows up in an environment hostile to his desires and his identity (but knows of environments where this isn’t true). After a few extremely negative reactions to him opening up to people, or expressing his desires, he’ll simply decide to present himself as heterosexual and bide his time and gather the power to leave the environment he is in.
  
  One may claim that the previous example somehow doesn’t count because since one’s sexual orientation is biologically determined (and I’m assuming this to be the case for this example, even if this may not be entirely true), this means that evolution optimized this particular human for being inner misaligned relative to their environment. However, that doesn’t weaken this argument: “human learning → human values” shows a huge amount of evidence of inner misalignment being ubiquitous.
  
  I worry you are being insufficiently pessimistic.
  - Logan Riggs Jul 12, 2022, 2:10 PM
    LW: 3 AF: 2
    −1
    AF Parent
    There may not be substantial disagreements here. Do you agree with:
    “a particular human’s learning process + reward circuitry + “training” environment → the human’s learned values” is more informative about inner-misalignment than the usual “evolution → human values” (e.g. Two twins could have different life experiences and have different values, or a sociopath may have different reward circuitry which leads to very different values than people with typical reward circuitry even given similar experiences)
    The most important claim in your comment is that “human learning → human values” is evidence that inner misalignment is easier than it seems when one looks at it from the “evolution → human values” perspective.
    Here’s why I disagree:
    I don’t know what you mean by “inner misalignment is easier”? Could you elaborate? I don’t think you mean “inner misalignment is more likely to happen” because you then go on to explain inner-misalignment & give an example and say “I worry you are being insufficiently pessimistic.”
    
    One implication I read was that inner values learned (ie the inner-misaligned values) may scale, which is the opposite prediction usually given. See:
    I also think this regularity in inner values is reasonably robust to sharp left turns in capabilities. If you take a human whose outer behavior suggests they like dogs, and give that human very strong capabilities to influence the future, I do not think they are at all likely to erase dogs from existence.
    This matches my intuitions.
    - mesaoptimizer Jul 12, 2022, 2:48 PM
      1 point
      0
      AF Parent
      
      Do you agree with: “a particular human’s learning process + reward circuitry + “training” environment → the human’s learned values” is more informative about inner-misalignment than the usual “evolution → human values”
      
      What I see is that we are taking two different optimizers applying optimizing pressure on a system (evolution and the environment), and then stating that one optimization provides more information about a property of OOD behavior shift than another. This doesn’t make sense to me, particularly since I believe that most people live in environments that is very much” in distribution”, and it is difficult for us to discuss misalignment without talking about extreme cases (as I described in the previous comment), or subtle cases (black swans?) that may not seem to matter.
      
      I don’t know what you mean by “inner misalignment is easier”? Could you elaborate? I don’t think you mean “inner misalignment is more likely to happen” because you then go on to explain inner-misalignment & give an example and say “I worry you are being insufficiently pessimistic.”
      
      My bad; I’ve updated the comment to clarify that I believe Quintin claims that solving / preventing inner misalignment is easier than one would expect given the belief that evolution’s failure at inner alignment is the most significant and informative evidence that inner alignment is hard.
      
      One implication I read was that inner values learned (ie the inner-misaligned values) may scale, which is the opposite prediction usually given.
      
      I assume you mean that Quintin seems to claim that inner values learned may be retained with increase in capabilities, and that usually people believe that inner values learned may not be retained with increase in capabilities. I believe so too—inner values seem to be significantly robust to increase in capabilities, especially since one has the option to deceive. Do people really believe that inner values learned don’t scale with an increase in capabilities? Perhaps we are defining inner values differently here.
      
      By inner values, I mean terminal goals. Wanting dogs to be happy is not a terminal goal for most people, and I believe that given enough optimization pressure, the hypothetical dog-lover would abandon this goal to optimize for what their true terminal goal is. Does that mean that with increase in capabilities, people’s inner values shift? Not exactly; it seems to me that we were mistaken about people’s inner values instead.
      - Logan Riggs Jul 12, 2022, 4:19 PM
        LW: 3 AF: 1
        0
        AF Parent
        This doesn’t make sense to me, particularly since I believe that most people live in environments that is very much” in distribution”, and it is difficult for us to discuss misalignment without talking about extreme cases (as I described in the previous comment), or subtle cases (black swans?) that may not seem to matter.
        I think you’re ignoring the [now bolded part] in “a particular human’s learning process + reward circuitry + “training” environment” and just focusing in the environment. Humans very often don’t optimize for their reward circuitry in their limbic system. If I gave you a button that killed everyone but maximized your reward circuitry every time you pressed it, most people wouldn’t press it (would you?). I do agree that if you pressed the button once, you would then want to press the button again, but not beforehand which is an inner-misalignment w/ respect to the reward circuitry. Though maybe you’d say the wirehead thing is an extreme case OOD?
        By inner values, I mean terminal goals. Wanting dogs to be happy is not a terminal goal for most people, and I believe that given enough optimization pressure, the hypothetical dog-lover would abandon this goal to optimize for what their true terminal goal is.
        I agree, but I’m bolding “most people” because you’re claiming there exist some people that would retain that value if scaled up(?) I think replace “dog-lover” w/ “family-lover” and there’s even more people. But I don’t think this is a disagreement between us?
        My bad; I’ve updated the comment to clarify that I believe Quintin claims that solving / preventing inner misalignment is easier than one would expect given the belief that evolution’s failure at inner alignment is the most significant and informative evidence that inner alignment is hard.
        Oh, I think inner-misalignment w/ respect to the reward circuitry is a good, positive thing that we want, so there’s the disconnect (usually misalignment is thought of as bad, and I’m not just mistyping). Human values are formed by inner-misalignment and they have lots of great properties such as avoiding ontological crises, valuing real world things (like diamond maximizer in the OP), and a subset of which cares for all of humanity. We can learn more about this process by focusing more on the “a particular human’s learning process + reward circuitry + “training” environment” part, and less on the evolution part.
        If we understand the underlying mechanisms behind human value formation through inner-misalignment w/ respect to the reward circuitry, then we might be able to better develop the theory of learning systems developing values, which includes AGI.
        
        mesaoptimizer Jul 12, 2022, 7:42 PM
        1 point
        0
        Parent
        
        Oh, I think inner-misalignment w/ respect to the reward circuitry is a good, positive thing that we want, so there’s the disconnect (usually misalignment is thought of as bad, and I’m not just mistyping).
        
        Yes, thank you: I didn’t notice that you were making that assumption. This conversation makes a lot more sense to me now.
        
        Human values are formed by inner-misalignment and they have lots of great properties such as avoiding ontological crises, valuing real world things (like diamond maximizer in the OP), and a subset of which cares for all of humanity. We can learn more about this process by focusing more on the “a particular human’s learning process + reward circuitry + “training” environment” part, and less on the evolution part. If we understand the underlying mechanisms behind human value formation through inner-misalignment w/ respect to the reward circuitry, then we might be able to better develop the theory of learning systems developing values, which includes AGI.
        
        This seems to imply that the aim of this alignment proposal is to solve the alignment problem by aligning the inner values with that of the creators of the AI and bypassing the outer alignment problem. That is really interesting; I’ve updated in the direction of shard theory being more viable as an alignment strategy than I previously believed. I’m still confused about huge parts of it, but we can discuss it more elsewhere.
  - Quintin Pope Jul 12, 2022, 5:36 PM
    2 points
    0
    Parent
    The most important claim in your comment is that “human learning → human values” is evidence that solving / preventing inner misalignment is easier than it seems when one looks at it from the “evolution → human values” perspective.
    That’s not a claim I made in my comment. It’s technically a claim I agree with, but not one I think is particularly important. Humans do seem better aligned to getting reward across distributional shifts than to achieving inclusive genetic fitness across distributional shifts. However, I’ll freely agree with you that humans are typically misaligned with maximizing the reward from their outer objectives.
    I operationalize this as: “After a distributional shift from their learning environment, humans frequently behave in a manner that predictably fails to maximize reward in their new environment, specifically because they continue to implement values they’d acquired from their learning environment which are misaligned to reward maximization in the new environment”. Please let me know if you disagree with my operationalization.
    For example, one way in which humans are inner misaligned is that, if you introduce a human into a new environment which has a button that will wirehead the human (thus maximizing reward in the new environment), but has other consequences that are bad by light of the human’s preexisting values (e.g., Logan’s example of killing everyone else), most humans won’t push the button.
    The actual claim I made in the comment you’re replying to is that there’s a predictable relationship between outer optimization criteria and inner values, not that inner values are always aligned with outer optimization criteria. In fact, I’d say we’d be in a pretty bad situation if inner goals reliably orientated towards reward maximization across all environments, because then any sufficiently powerful AGI would most likely wirehead once it was able to do so.