Hm? It’s as Nate says in the quote. It’s the same type of problem as humans inventing birth-control out of distribution. If you have an alternative proposal for how to build a diamond-maximizer, you can specify that for a response, but the commonly discussed idea of “train on examples of diamonds” will fail at inner-alignment, and it will just optimize diamonds in a particular setting and then elsewhere do crazy other things that look like all kinds of white noise to you.
Also “expect this to fail” already seems to jump the gun. Who has a proposal for successfully building an AGI that can do this, other than saying gradient-descent will surprise us with one?
I don’t think that “evolution → human values” is the most useful reference class when trying to calibrate our expectations wrt how outer optimization criteria relate to inner objectives. Evolution didn’t directly optimize over our goals. It optimized over our learning process and reward circuitry. Once you condition on a particular human’s learning process + reward circuitry configuration + the human’s environment, you screen off the influence of evolution on that human’s goals. So, there are really two areas from which you can draw evidence about inner (mis)alignment:
“evolution’s inclusive genetic fitness criteria → a human’s learned values” (as mediated by evolution’s influence over the human’s learning process + reward circuitry)
“a particular human’s learning process + reward circuitry + “training” environment → the human’s learned values”
The relationship we want to make inferences about is:
“a particular AI’s learning process + reward function + training environment → the AI’s learned values”
I think that “AI learning → AI values” is much more similar to “human learning → human values” than it is to “evolution → human values”. I grant that you can find various dissimilarities between “AI learning → AI values” and “human learning → human values”. However, I think there are greater dissimilarities between “AI learning → AI values” and “evolution → human values”. As a result, I think the vast majority of our intuitions regarding the likely outcomes of inner goals versus outer optimization should come from looking at the “human learning → human values” analogy, not the “evolution → human values” analogy.
Additionally, I think we have a lot more total empirical evidence from “human learning → human values” compared to from “evolution → human values”. There are billions of instances of humans, and each of them have somewhat different learning processes / reward circuit configurations / learning environments. Each of them represents a different data point regarding how inner goals relate to outer optimization. In contrast, the human species only evolved once[1]. Thus, evidence from “human learning → human values” should account for even more of our intuitions regarding inner goals versus outer optimization than the difference in reference class similarities alone would indicate.
I will grant that the variations between different humans’ learning processes / reward circuit configurations / learning environments are “sampling” over a small and restricted portion of the space of possible optimization process trajectories. This limits the strength of any conclusions we can draw from looking at the relationship between human values and human rewards / learning environments. However, I again hold that inferences from “evolution → human values” suffer from an even more extreme version of this same issue. “Evolution → human values” represent an even more restricted look at the general space of optimization process trajectories than we get from the observed variations in different humans’ learning processes / reward circuit configurations / learning environments.
There are many sources of empirical evidence that can inform our intuitions regarding how inner goals relate to outer optimization criteria. My current (not very deeply considered) estimate of how to weight these evidence sources is roughly:
~30% from various other evidence sources, which I won’t address further in this comment, on inner goals versus outer criteria:
economics
microbial ecology
politics
current results in machine learning
game theory / mulit-agent negotiation dynamics
I think that using “human learning → human values” as our reference class for inner goals versus outer optimization criteria suggests a much more straightforward relationship between the two, as compared to the (lack of a) relationship suggested by “evolution → human values”. Looking at the learning trajectories of individual humans, it seems like the reflectively endorsed extrapolations of a given person’s values has a great deal in common with the sorts of experiences they’ve found rewarding in their lives up to that point in time. E.g., a person who grew up with and displayed affection for dogs probably doesn’t want a future totally devoid of dogs, or one in which dogs suffer greatly.
I also think this regularity in inner values is reasonably robust to sharp left turns in capabilities. If you take a human whose outer behavior suggests they like dogs, and give that human very strong capabilities to influence the future, I do not think they are at all likely to erase dogs from existence. And I think this is very robust to the degree of capabilities you give the human. It’s probably not as robust to your choice of which specific human to try this with. E.g., many people would screw themselves over with reckless self-modification, given the capability to do so. My point is that higher capabilities alone do not automatically render inner values completely alien to those demonstrated at lower capabilities.
You can, of course, try to look at how population genetics relate to learned values to try to get more data from the “evolution → human values” reference class, but I think most genetic influences on values are mediated by differences in reward circuitry or environmental correlates of genetic variation. So such an investigation probably ends up mostly redundant in light of how the “human learning → human values” dynamics work out. I don’t know how you’d try and back out a useful inference about general inner versus outer relationships (independent from the “human learning → human values” dynamics) from that mess. In practice, I think the first order evidence from “human learning → human values” still dominates any evolution-specific inferences you can make here.
Even given the arguments in this comment, putting such a low weight on “evolution → human values” might seem extreme, but I have an additional reason, originally identified by Alex Turner, for further down weighting the evidence from “evolution → human values”. See this document on shard theory and search for “homo inclusive-genetic-fitness-maximus”.
But the main disanalogy in the “human learning → human values” case is that reward circuitry/brain architecture mostly doesn’t change? And we would need to find these somehow for AI and that process looks much more like evolution. And prediction of (non-instrumental) inner values is not robust across different reward functions—dogs only work because we already implemented environment-invariant compassion in reward circuitry.
But the main disanalogy in the “human learning → human values” case is that reward circuitry/brain architecture mostly doesn’t change?
Congenitally blind people end up with human values, despite that:
They’re missing entire chunks of vision-related hard coded rewards.
The entire visual cortex has been repurposed for other goals.
Evolution probably could not have “patched” the value formation process of blind people in the ancestral environment due to the massive fitnesses disadvantage blindness confers.
Human value formation can’t be that sensitive to delicate parameters of the learning process or reward circuitry.
And we would need to find these somehow for AI and that process looks much more like evolution.
We could learn a reward model from human judgements, train on human judgements directly, finetune a language model, etc. There are many options here.
And prediction of (non-instrumental) inner values is not robust across different reward functions—dogs only work because we already implemented environment-invariant compassion in reward circuitry.
I don’t agree. If you slightly increase the strength of the reward circuits that rewarded the person for interacting with dogs, you get someone who likes dogs a bit more, not someone who wants to tile the universe with tiny molecular dog faces.
Also, reward circuitry does not, and cannot implement compassion directly. It can only reward you for taking actions that were probably driven by compassion. This is a very dumb approach that nevertheless actually literally works in actual reality, so the problem can’t be that hard.
The most important claim in your comment is that “human learning → human values” is evidence that solving / preventing inner misalignment is easier than it seems when one looks at it from the “evolution → human values” perspective. Here’s why I disagree:
Evolution optimized humans for an environment very different from what we see today. This implies that humans are operating out-of-distribution. We see evidence of misalignment. Birth control is a good example of this.
A human’s environment optimizes a human continually towards certain a certain objective (that changes given changes in the environment). This human is aligned with the environment’s objective in that distribution. Outside that distribution, the human may not be aligned with the objective intended by the environment.
An outer misalignment example of this is a person brought up in a high-trust environment, and then thrown into a low-trust / high-conflict environment. Their habits and tendencies make them an easy mark for predators.
An inner misalignment example of this is a gay male who grows up in an environment hostile to his desires and his identity (but knows of environments where this isn’t true). After a few extremely negative reactions to him opening up to people, or expressing his desires, he’ll simply decide to present himself as heterosexual and bide his time and gather the power to leave the environment he is in.
One may claim that the previous example somehow doesn’t count because since one’s sexual orientation is biologically determined (and I’m assuming this to be the case for this example, even if this may not be entirely true), this means that evolution optimized this particular human for being inner misaligned relative to their environment. However, that doesn’t weaken this argument: “human learning → human values” shows a huge amount of evidence of inner misalignment being ubiquitous.
There may not be substantial disagreements here. Do you agree with:
“a particular human’s learning process + reward circuitry + “training” environment → the human’s learned values” is more informative about inner-misalignment than the usual “evolution → human values” (e.g. Two twins could have different life experiences and have different values, or a sociopath may have different reward circuitry which leads to very different values than people with typical reward circuitry even given similar experiences)
The most important claim in your comment is that “human learning → human values” is evidence that inner misalignment is easier than it seems when one looks at it from the “evolution → human values” perspective. Here’s why I disagree:
I don’t know what you mean by “inner misalignment is easier”? Could you elaborate? I don’t think you mean “inner misalignment is more likely to happen” because you then go on to explain inner-misalignment & give an example and say “I worry you are being insufficiently pessimistic.”
One implication I read was that inner values learned (ie the inner-misaligned values) may scale, which is the opposite prediction usually given. See:
I also think this regularity in inner values is reasonably robust to sharp left turns in capabilities. If you take a human whose outer behavior suggests they like dogs, and give that human very strong capabilities to influence the future, I do not think they are at all likely to erase dogs from existence.
Do you agree with: “a particular human’s learning process + reward circuitry + “training” environment → the human’s learned values” is more informative about inner-misalignment than the usual “evolution → human values”
What I see is that we are taking two different optimizers applying optimizing pressure on a system (evolution and the environment), and then stating that one optimization provides more information about a property of OOD behavior shift than another. This doesn’t make sense to me, particularly since I believe that most people live in environments that is very much” in distribution”, and it is difficult for us to discuss misalignment without talking about extreme cases (as I described in the previous comment), or subtle cases (black swans?) that may not seem to matter.
I don’t know what you mean by “inner misalignment is easier”? Could you elaborate? I don’t think you mean “inner misalignment is more likely to happen” because you then go on to explain inner-misalignment & give an example and say “I worry you are being insufficiently pessimistic.”
My bad; I’ve updated the comment to clarify that I believe Quintin claims that solving / preventing inner misalignment is easier than one would expect given the belief that evolution’s failure at inner alignment is the most significant and informative evidence that inner alignment is hard.
One implication I read was that inner values learned (ie the inner-misaligned values) may scale, which is the opposite prediction usually given.
I assume you mean that Quintin seems to claim that inner values learned may be retained with increase in capabilities, and that usually people believe that inner values learned may not be retained with increase in capabilities. I believe so too—inner values seem to be significantly robust to increase in capabilities, especially since one has the option to deceive. Do people really believe that inner values learned don’t scale with an increase in capabilities? Perhaps we are defining inner values differently here.
By inner values, I mean terminal goals. Wanting dogs to be happy is not a terminal goal for most people, and I believe that given enough optimization pressure, the hypothetical dog-lover would abandon this goal to optimize for what their true terminal goal is. Does that mean that with increase in capabilities, people’s inner values shift? Not exactly; it seems to me that we were mistaken about people’s inner values instead.
This doesn’t make sense to me, particularly since I believe that most people live in environments that is very much” in distribution”, and it is difficult for us to discuss misalignment without talking about extreme cases (as I described in the previous comment), or subtle cases (black swans?) that may not seem to matter.
I think you’re ignoring the [now bolded part] in “a particular human’s learning process + reward circuitry + “training” environment” and just focusing in the environment. Humans very often don’t optimize for their reward circuitry in their limbic system. If I gave you a button that killed everyone but maximized your reward circuitry every time you pressed it, most people wouldn’t press it (would you?). I do agree that if you pressed the button once, you would then want to press the button again, but not beforehand which is an inner-misalignment w/ respect to the reward circuitry. Though maybe you’d say the wirehead thing is an extreme case OOD?
By inner values, I mean terminal goals. Wanting dogs to be happy is not a terminal goal for most people, and I believe that given enough optimization pressure, the hypothetical dog-lover would abandon this goal to optimize for what their true terminal goal is.
I agree, but I’m bolding “most people” because you’re claiming there exist some people that would retain that value if scaled up(?) I think replace “dog-lover” w/ “family-lover” and there’s even more people. But I don’t think this is a disagreement between us?
My bad; I’ve updated the comment to clarify that I believe Quintin claims that solving / preventing inner misalignment is easier than one would expect given the belief that evolution’s failure at inner alignment is the most significant and informative evidence that inner alignment is hard.
Oh, I think inner-misalignment w/ respect to the reward circuitry is a good, positive thing that we want, so there’s the disconnect (usually misalignment is thought of as bad, and I’m not just mistyping). Human values are formed by inner-misalignment and they have lots of great properties such as avoiding ontological crises, valuing real world things (like diamond maximizer in the OP), and a subset of which cares for all of humanity. We can learn more about this process by focusing more on the “a particular human’s learning process + reward circuitry+ “training” environment” part, and less on the evolution part.
If we understand the underlying mechanisms behind human value formation through inner-misalignment w/ respect to the reward circuitry, then we might be able to better develop the theory of learning systems developing values, which includes AGI.
Oh, I think inner-misalignment w/ respect to the reward circuitry is a good, positive thing that we want, so there’s the disconnect (usually misalignment is thought of as bad, and I’m not just mistyping).
Yes, thank you: I didn’t notice that you were making that assumption. This conversation makes a lot more sense to me now.
Human values are formed by inner-misalignment and they have lots of great properties such as avoiding ontological crises, valuing real world things (like diamond maximizer in the OP), and a subset of which cares for all of humanity. We can learn more about this process by focusing more on the “a particular human’s learning process + reward circuitry + “training” environment” part, and less on the evolution part.
If we understand the underlying mechanisms behind human value formation through inner-misalignment w/ respect to the reward circuitry, then we might be able to better develop the theory of learning systems developing values, which includes AGI.
This seems to imply that the aim of this alignment proposal is to solve the alignment problem by aligning the inner values with that of the creators of the AI and bypassing the outer alignment problem. That is really interesting; I’ve updated in the direction of shard theory being more viable as an alignment strategy than I previously believed. I’m still confused about huge parts of it, but we can discuss it more elsewhere.
The most important claim in your comment is that “human learning → human values” is evidence that solving / preventing inner misalignment is easier than it seems when one looks at it from the “evolution → human values” perspective.
That’s not a claim I made in my comment. It’s technically a claim I agree with, but not one I think is particularly important. Humans do seem better aligned to getting reward across distributional shifts than to achieving inclusive genetic fitness across distributional shifts. However, I’ll freely agree with you that humans are typically misaligned with maximizing the reward from their outer objectives.
I operationalize this as: “After a distributional shift from their learning environment, humans frequently behave in a manner that predictably fails to maximize reward in their new environment, specifically because they continue to implement values they’d acquired from their learning environment which are misaligned to reward maximization in the new environment”. Please let me know if you disagree with my operationalization.
For example, one way in which humans are inner misaligned is that, if you introduce a human into a new environment which has a button that will wirehead the human (thus maximizing reward in the new environment), but has other consequences that are bad by light of the human’s preexisting values (e.g., Logan’s example of killing everyone else), most humans won’t push the button.
The actual claim I made in the comment you’re replying to is that there’s a predictablerelationship between outer optimization criteria and inner values, not that inner values are always aligned with outer optimization criteria. In fact, I’d say we’d be in a pretty bad situation if inner goals reliably orientated towards reward maximization across all environments, because then any sufficiently powerful AGI would most likely wirehead once it was able to do so.
This isn’t addressing straw-Ngo/Shah’s objection? Yes, evolution optimized for fitness, and got adaptation-executors that invent birth control because they care about things that correlated with fitness in the environment of evolutionary adaptedness, and don’t care about fitness itself. The generalization from evolution’s “loss function” alone, to modern human behavior, is terrible and looks like all kinds of white noise.
But the generalization from behavior in the environment of evolutionary adaptedness, to modern human behavior is … actually pretty good? Humans in the EEA told stories, made friends, ate food, &c., and modern humans do those things, too. There are a lot of quirks (like limited wireheading in the form of drugs, candy, and pornography), but it’s far from white noise. AI designers aren’t in the position of “evolution” “trying” to build fitness-maximizers, because they also get to choose the training data or “EEA”—and in that context, the analogy to evolution makes it look like some degree of “correct” goal generalization outside of the training environment is a thing?
Obviously, the conclusion here is not, “And therefore everything will be fine and we have nothing to worry about.” Some nonzero amount of goal generalization, doesn’t mean the humans survive or that the outcome is good, because there are still lots of ways for things to go off the rails. (A toy not-even-model: if you keep 0.95 of your goals with each “round” of recursive self-improvement, and you need 100 rounds to discover the correct theory of alignment, you actually only keep 0.95100≈0.006 of your goals.) We would definitely prefer not to bet the universe on “Train it, while being aware of inner alignment issues, and hope for the best”!! But it seems to me that the well-rehearsed “birth control, therefore paperclips” argument is missing a lot of steps?!
I strongly agree. I think there are vastly better sources of evidence on how inner goals relate to outer selection criteria than “evolution → human values”, and that most of those better sources of evidence paint a much more optimistic picture of how things are likely to go with AI. I think there are lots of reasons to be skeptical of using “evolution → human values” as an informative reference class for AIs, some of which I’ve described in my reply to Ben’s comment.
I don’t think the usual arguments apply as obviously here. “Maximal Diamond” is much simpler than most other optimization targets. It seems much easier to solve outer-alignment for – Diamond was chosen because it’s a really simple molecule configuration to specify, and that just seems to be a pretty different scenario than most of the ones I’ve seen more detailed arguments for.
I’m partly confused about the phrasing “we have no idea how to do this.” (which is stronger than “we don’t currently have a plan for how to do this.”)
But in the interests of actually trying to answer this sort of thing for myself instead of asking Nate/Eliezer to explain why it doesn’t work, let me think through my own proposal of how I’d go about solving the problem, and see if I can think of obvious holes.
Problems currently known to me:
Reward hijacking
Point 19 in List of Lethalities (“there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment”).
Ontological updating (i.e. what exactly is a diamond?)
New to me from this post: the most important capabilities advances may come from an inner process that isn’t actually coupled to the reinforcement learning system. (I didn’t really get this until reading this post and haven’t finished thinking through the concept)
Main ingredients I’m imagining: (disclaimer: I’m a layman making a lot of informed guesses, wouldn’t be surprised it
First, Get a general agent, with limitations to prevent immediate fooming. Get to general intelligence via something like DeepMind’s General Agents, this time starting from a language model that benefits from a lot of human concepts. My current belief is that you’d need to solve some major efficiency issues to do this with a reasonable amount of compute. If you have a Jupiter brain (as originally stipulated) I’m not sure it even requires new advances.
(Maybe scrub the language model of all references to ML/programming, initially. They’ll be helpful eventually but maybe don’t give the AGI a headstart on self-modification.)
Train it to maximize diamonds in many different environments. Starting with (relatively) modest amounts of compute, train it to make diamonds in different simulated and physical worlds. Initially, at subhuman intelligence, the AGI isn’t expected to invent it’s own diamond-making technology. But it’s presented with different physical-or-simulated tools that make things similar-but-non-identical to diamonds, and the tools have internal parts it can understand, and it’s rewarded for choosing between them accurately. So it has to develop an understanding of physics.
(I think the outer-alignment goal here is to get it to advance at physics faster than self-modification, so that you can force it to learn ontological problems before it could get subverted by them).
Some notable training-sets it needs to include:
digital worlds where the sub-atomic physics is different, such that it learns to preserve the diamond-configuration despite ontological confusion
it’s ability to parse what’s going on in the digital worlds depends on sensors that are present in the digital world (also for physical world), and there are different arrays of sensors in different worlds. It’s trained against situations where it has the ability to modify its sensors for simple reward hacking.
eventually it’s taught programming/hardware know-ho, and put in fairly simplified situations where the solution to it’s puzzle is to notice that it physically exists, and make changes to it’s hardware or software, but it doesn’t directly hijack it’s own reward function.
(Note: thinking through all these specific details is pretty helpful for noticing how many steps are involved here. I think for this sort of plan to work you actually need a lot of different puzzles that are designed to be solveable with safe amounts of compute, so it doesn’t just bulldoze past your training setup. Designing such puzzles seems pretty time consuming. In practice I don’t expect the Successfully Aligned “murder everyone and make diamonds forever” bot to be completed before “murder everyone and make some Random Other Thing Forever” bot)
Even though my goal is a murder-bot-that-makes-diamonds-forever, I’m probably coupling all of this with attempts at corrigibility training, dealing with uncertainty, impact tracking, etc, to give myself extra time to notice problems. (i.e if the machine isn’t sure whether the thing it’s making is diamond, it makes a little bit first, asks humans to verify that it’s diamond, etc. Do similar training on “don’t modify the input channel for ‘was it actually diamond tho?’)
Assuming those tricks all work and hold up under tons of optimization pressure, this all still leaves us with inner alignment, and point #4 on my list of known-to-me-concerns. “The most important capabilities advances may come from an inner process that isn’t actually coupled to the reinforcement learning system.”
And… okay actually this is a new thought for me, and I’m not sure how to think about it yet. I can see how it was probably meant to be included in the “confusingly pervasive consequentialism” concept, but I didn’t get the “and therefore, impervious to gradient descent” argument till just now.
I’m out of time for now, will think about this more.
I think even without point #4 you don’t necessarily get an AI maximizing diamonds. Heuristically, it feels to me like you’re bulldozing open problems without understanding them (e.g. ontology identification by training with multiple models of physics, getting it not to reward-hack by explicit training, etc.) all of which are vulnerable to a deceptively aligned model (just wait till you’re out of training to reward-hack). Also, every time you say “train it by X so it learns Y” you’re assuming alignment (e.g. “digital worlds where the sub-atomic physics is different, such that it learns to preserve the diamond-configuration despite ontological confusion”)
IMO shard theory provides a great frame to think about this in, it’s a must-read for improving alignment intuitions.
Hm? It’s as Nate says in the quote. It’s the same type of problem as humans inventing birth-control out of distribution. If you have an alternative proposal for how to build a diamond-maximizer, you can specify that for a response, but the commonly discussed idea of “train on examples of diamonds” will fail at inner-alignment, and it will just optimize diamonds in a particular setting and then elsewhere do crazy other things that look like all kinds of white noise to you.
Also “expect this to fail” already seems to jump the gun. Who has a proposal for successfully building an AGI that can do this, other than saying gradient-descent will surprise us with one?
I don’t think that “evolution → human values” is the most useful reference class when trying to calibrate our expectations wrt how outer optimization criteria relate to inner objectives. Evolution didn’t directly optimize over our goals. It optimized over our learning process and reward circuitry. Once you condition on a particular human’s learning process + reward circuitry configuration + the human’s environment, you screen off the influence of evolution on that human’s goals. So, there are really two areas from which you can draw evidence about inner (mis)alignment:
“evolution’s inclusive genetic fitness criteria → a human’s learned values” (as mediated by evolution’s influence over the human’s learning process + reward circuitry)
“a particular human’s learning process + reward circuitry + “training” environment → the human’s learned values”
The relationship we want to make inferences about is:
“a particular AI’s learning process + reward function + training environment → the AI’s learned values”
I think that “AI learning → AI values” is much more similar to “human learning → human values” than it is to “evolution → human values”. I grant that you can find various dissimilarities between “AI learning → AI values” and “human learning → human values”. However, I think there are greater dissimilarities between “AI learning → AI values” and “evolution → human values”. As a result, I think the vast majority of our intuitions regarding the likely outcomes of inner goals versus outer optimization should come from looking at the “human learning → human values” analogy, not the “evolution → human values” analogy.
Additionally, I think we have a lot more total empirical evidence from “human learning → human values” compared to from “evolution → human values”. There are billions of instances of humans, and each of them have somewhat different learning processes / reward circuit configurations / learning environments. Each of them represents a different data point regarding how inner goals relate to outer optimization. In contrast, the human species only evolved once[1]. Thus, evidence from “human learning → human values” should account for even more of our intuitions regarding inner goals versus outer optimization than the difference in reference class similarities alone would indicate.
I will grant that the variations between different humans’ learning processes / reward circuit configurations / learning environments are “sampling” over a small and restricted portion of the space of possible optimization process trajectories. This limits the strength of any conclusions we can draw from looking at the relationship between human values and human rewards / learning environments. However, I again hold that inferences from “evolution → human values” suffer from an even more extreme version of this same issue. “Evolution → human values” represent an even more restricted look at the general space of optimization process trajectories than we get from the observed variations in different humans’ learning processes / reward circuit configurations / learning environments.
There are many sources of empirical evidence that can inform our intuitions regarding how inner goals relate to outer optimization criteria. My current (not very deeply considered) estimate of how to weight these evidence sources is roughly:
~66% from “human learning → human values”
~4% from “evolution → human values”[2]
~30% from various other evidence sources, which I won’t address further in this comment, on inner goals versus outer criteria:
economics
microbial ecology
politics
current results in machine learning
game theory / mulit-agent negotiation dynamics
I think that using “human learning → human values” as our reference class for inner goals versus outer optimization criteria suggests a much more straightforward relationship between the two, as compared to the (lack of a) relationship suggested by “evolution → human values”. Looking at the learning trajectories of individual humans, it seems like the reflectively endorsed extrapolations of a given person’s values has a great deal in common with the sorts of experiences they’ve found rewarding in their lives up to that point in time. E.g., a person who grew up with and displayed affection for dogs probably doesn’t want a future totally devoid of dogs, or one in which dogs suffer greatly.
I also think this regularity in inner values is reasonably robust to sharp left turns in capabilities. If you take a human whose outer behavior suggests they like dogs, and give that human very strong capabilities to influence the future, I do not think they are at all likely to erase dogs from existence. And I think this is very robust to the degree of capabilities you give the human. It’s probably not as robust to your choice of which specific human to try this with. E.g., many people would screw themselves over with reckless self-modification, given the capability to do so. My point is that higher capabilities alone do not automatically render inner values completely alien to those demonstrated at lower capabilities.
You can, of course, try to look at how population genetics relate to learned values to try to get more data from the “evolution → human values” reference class, but I think most genetic influences on values are mediated by differences in reward circuitry or environmental correlates of genetic variation. So such an investigation probably ends up mostly redundant in light of how the “human learning → human values” dynamics work out. I don’t know how you’d try and back out a useful inference about general inner versus outer relationships (independent from the “human learning → human values” dynamics) from that mess. In practice, I think the first order evidence from “human learning → human values” still dominates any evolution-specific inferences you can make here.
Even given the arguments in this comment, putting such a low weight on “evolution → human values” might seem extreme, but I have an additional reason, originally identified by Alex Turner, for further down weighting the evidence from “evolution → human values”. See this document on shard theory and search for “homo inclusive-genetic-fitness-maximus”.
But the main disanalogy in the “human learning → human values” case is that reward circuitry/brain architecture mostly doesn’t change? And we would need to find these somehow for AI and that process looks much more like evolution. And prediction of (non-instrumental) inner values is not robust across different reward functions—dogs only work because we already implemented environment-invariant compassion in reward circuitry.
Congenitally blind people end up with human values, despite that:
They’re missing entire chunks of vision-related hard coded rewards.
The entire visual cortex has been repurposed for other goals.
Evolution probably could not have “patched” the value formation process of blind people in the ancestral environment due to the massive fitnesses disadvantage blindness confers.
Human value formation can’t be that sensitive to delicate parameters of the learning process or reward circuitry.
We could learn a reward model from human judgements, train on human judgements directly, finetune a language model, etc. There are many options here.
I don’t agree. If you slightly increase the strength of the reward circuits that rewarded the person for interacting with dogs, you get someone who likes dogs a bit more, not someone who wants to tile the universe with tiny molecular dog faces.
Also, reward circuitry does not, and cannot implement compassion directly. It can only reward you for taking actions that were probably driven by compassion. This is a very dumb approach that nevertheless actually literally works in actual reality, so the problem can’t be that hard.
The most important claim in your comment is that “human learning → human values” is evidence that solving / preventing inner misalignment is easier than it seems when one looks at it from the “evolution → human values” perspective. Here’s why I disagree:
Evolution optimized humans for an environment very different from what we see today. This implies that humans are operating out-of-distribution. We see evidence of misalignment. Birth control is a good example of this.
A human’s environment optimizes a human continually towards certain a certain objective (that changes given changes in the environment). This human is aligned with the environment’s objective in that distribution. Outside that distribution, the human may not be aligned with the objective intended by the environment.
An outer misalignment example of this is a person brought up in a high-trust environment, and then thrown into a low-trust / high-conflict environment. Their habits and tendencies make them an easy mark for predators.
An inner misalignment example of this is a gay male who grows up in an environment hostile to his desires and his identity (but knows of environments where this isn’t true). After a few extremely negative reactions to him opening up to people, or expressing his desires, he’ll simply decide to present himself as heterosexual and bide his time and gather the power to leave the environment he is in.
One may claim that the previous example somehow doesn’t count because since one’s sexual orientation is biologically determined (and I’m assuming this to be the case for this example, even if this may not be entirely true), this means that evolution optimized this particular human for being inner misaligned relative to their environment. However, that doesn’t weaken this argument: “human learning → human values” shows a huge amount of evidence of inner misalignment being ubiquitous.
I worry you are being insufficiently pessimistic.
There may not be substantial disagreements here. Do you agree with:
“a particular human’s learning process + reward circuitry + “training” environment → the human’s learned values” is more informative about inner-misalignment than the usual “evolution → human values” (e.g. Two twins could have different life experiences and have different values, or a sociopath may have different reward circuitry which leads to very different values than people with typical reward circuitry even given similar experiences)
I don’t know what you mean by “inner misalignment is easier”? Could you elaborate? I don’t think you mean “inner misalignment is more likely to happen” because you then go on to explain inner-misalignment & give an example and say “I worry you are being insufficiently pessimistic.”
One implication I read was that inner values learned (ie the inner-misaligned values) may scale, which is the opposite prediction usually given. See:
This matches my intuitions.
What I see is that we are taking two different optimizers applying optimizing pressure on a system (evolution and the environment), and then stating that one optimization provides more information about a property of OOD behavior shift than another. This doesn’t make sense to me, particularly since I believe that most people live in environments that is very much” in distribution”, and it is difficult for us to discuss misalignment without talking about extreme cases (as I described in the previous comment), or subtle cases (black swans?) that may not seem to matter.
My bad; I’ve updated the comment to clarify that I believe Quintin claims that solving / preventing inner misalignment is easier than one would expect given the belief that evolution’s failure at inner alignment is the most significant and informative evidence that inner alignment is hard.
I assume you mean that Quintin seems to claim that inner values learned may be retained with increase in capabilities, and that usually people believe that inner values learned may not be retained with increase in capabilities. I believe so too—inner values seem to be significantly robust to increase in capabilities, especially since one has the option to deceive. Do people really believe that inner values learned don’t scale with an increase in capabilities? Perhaps we are defining inner values differently here.
By inner values, I mean terminal goals. Wanting dogs to be happy is not a terminal goal for most people, and I believe that given enough optimization pressure, the hypothetical dog-lover would abandon this goal to optimize for what their true terminal goal is. Does that mean that with increase in capabilities, people’s inner values shift? Not exactly; it seems to me that we were mistaken about people’s inner values instead.
I think you’re ignoring the [now bolded part] in “a particular human’s learning process + reward circuitry + “training” environment” and just focusing in the environment. Humans very often don’t optimize for their reward circuitry in their limbic system. If I gave you a button that killed everyone but maximized your reward circuitry every time you pressed it, most people wouldn’t press it (would you?). I do agree that if you pressed the button once, you would then want to press the button again, but not beforehand which is an inner-misalignment w/ respect to the reward circuitry. Though maybe you’d say the wirehead thing is an extreme case OOD?
I agree, but I’m bolding “most people” because you’re claiming there exist some people that would retain that value if scaled up(?) I think replace “dog-lover” w/ “family-lover” and there’s even more people. But I don’t think this is a disagreement between us?
Oh, I think inner-misalignment w/ respect to the reward circuitry is a good, positive thing that we want, so there’s the disconnect (usually misalignment is thought of as bad, and I’m not just mistyping). Human values are formed by inner-misalignment and they have lots of great properties such as avoiding ontological crises, valuing real world things (like diamond maximizer in the OP), and a subset of which cares for all of humanity. We can learn more about this process by focusing more on the “a particular human’s learning process + reward circuitry + “training” environment” part, and less on the evolution part.
If we understand the underlying mechanisms behind human value formation through inner-misalignment w/ respect to the reward circuitry, then we might be able to better develop the theory of learning systems developing values, which includes AGI.
Yes, thank you: I didn’t notice that you were making that assumption. This conversation makes a lot more sense to me now.
This seems to imply that the aim of this alignment proposal is to solve the alignment problem by aligning the inner values with that of the creators of the AI and bypassing the outer alignment problem. That is really interesting; I’ve updated in the direction of shard theory being more viable as an alignment strategy than I previously believed. I’m still confused about huge parts of it, but we can discuss it more elsewhere.
That’s not a claim I made in my comment. It’s technically a claim I agree with, but not one I think is particularly important. Humans do seem better aligned to getting reward across distributional shifts than to achieving inclusive genetic fitness across distributional shifts. However, I’ll freely agree with you that humans are typically misaligned with maximizing the reward from their outer objectives.
I operationalize this as: “After a distributional shift from their learning environment, humans frequently behave in a manner that predictably fails to maximize reward in their new environment, specifically because they continue to implement values they’d acquired from their learning environment which are misaligned to reward maximization in the new environment”. Please let me know if you disagree with my operationalization.
For example, one way in which humans are inner misaligned is that, if you introduce a human into a new environment which has a button that will wirehead the human (thus maximizing reward in the new environment), but has other consequences that are bad by light of the human’s preexisting values (e.g., Logan’s example of killing everyone else), most humans won’t push the button.
The actual claim I made in the comment you’re replying to is that there’s a predictable relationship between outer optimization criteria and inner values, not that inner values are always aligned with outer optimization criteria. In fact, I’d say we’d be in a pretty bad situation if inner goals reliably orientated towards reward maximization across all environments, because then any sufficiently powerful AGI would most likely wirehead once it was able to do so.
This isn’t addressing straw-Ngo/Shah’s objection? Yes, evolution optimized for fitness, and got adaptation-executors that invent birth control because they care about things that correlated with fitness in the environment of evolutionary adaptedness, and don’t care about fitness itself. The generalization from evolution’s “loss function” alone, to modern human behavior, is terrible and looks like all kinds of white noise.
But the generalization from behavior in the environment of evolutionary adaptedness, to modern human behavior is … actually pretty good? Humans in the EEA told stories, made friends, ate food, &c., and modern humans do those things, too. There are a lot of quirks (like limited wireheading in the form of drugs, candy, and pornography), but it’s far from white noise. AI designers aren’t in the position of “evolution” “trying” to build fitness-maximizers, because they also get to choose the training data or “EEA”—and in that context, the analogy to evolution makes it look like some degree of “correct” goal generalization outside of the training environment is a thing?
Obviously, the conclusion here is not, “And therefore everything will be fine and we have nothing to worry about.” Some nonzero amount of goal generalization, doesn’t mean the humans survive or that the outcome is good, because there are still lots of ways for things to go off the rails. (A toy not-even-model: if you keep 0.95 of your goals with each “round” of recursive self-improvement, and you need 100 rounds to discover the correct theory of alignment, you actually only keep 0.95100≈0.006 of your goals.) We would definitely prefer not to bet the universe on “Train it, while being aware of inner alignment issues, and hope for the best”!! But it seems to me that the well-rehearsed “birth control, therefore paperclips” argument is missing a lot of steps?!
I strongly agree. I think there are vastly better sources of evidence on how inner goals relate to outer selection criteria than “evolution → human values”, and that most of those better sources of evidence paint a much more optimistic picture of how things are likely to go with AI. I think there are lots of reasons to be skeptical of using “evolution → human values” as an informative reference class for AIs, some of which I’ve described in my reply to Ben’s comment.
I don’t think the usual arguments apply as obviously here. “Maximal Diamond” is much simpler than most other optimization targets. It seems much easier to solve outer-alignment for – Diamond was chosen because it’s a really simple molecule configuration to specify, and that just seems to be a pretty different scenario than most of the ones I’ve seen more detailed arguments for.
I’m partly confused about the phrasing “we have no idea how to do this.” (which is stronger than “we don’t currently have a plan for how to do this.”)
But in the interests of actually trying to answer this sort of thing for myself instead of asking Nate/Eliezer to explain why it doesn’t work, let me think through my own proposal of how I’d go about solving the problem, and see if I can think of obvious holes.
Problems currently known to me:
Reward hijacking
Point 19 in List of Lethalities (“there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment”).
Ontological updating (i.e. what exactly is a diamond?)
New to me from this post: the most important capabilities advances may come from an inner process that isn’t actually coupled to the reinforcement learning system. (I didn’t really get this until reading this post and haven’t finished thinking through the concept)
Main ingredients I’m imagining: (disclaimer: I’m a layman making a lot of informed guesses, wouldn’t be surprised it
First, Get a general agent, with limitations to prevent immediate fooming. Get to general intelligence via something like DeepMind’s General Agents, this time starting from a language model that benefits from a lot of human concepts. My current belief is that you’d need to solve some major efficiency issues to do this with a reasonable amount of compute. If you have a Jupiter brain (as originally stipulated) I’m not sure it even requires new advances.
(Maybe scrub the language model of all references to ML/programming, initially. They’ll be helpful eventually but maybe don’t give the AGI a headstart on self-modification.)
Train it to maximize diamonds in many different environments. Starting with (relatively) modest amounts of compute, train it to make diamonds in different simulated and physical worlds. Initially, at subhuman intelligence, the AGI isn’t expected to invent it’s own diamond-making technology. But it’s presented with different physical-or-simulated tools that make things similar-but-non-identical to diamonds, and the tools have internal parts it can understand, and it’s rewarded for choosing between them accurately. So it has to develop an understanding of physics.
(I think the outer-alignment goal here is to get it to advance at physics faster than self-modification, so that you can force it to learn ontological problems before it could get subverted by them).
Some notable training-sets it needs to include:
digital worlds where the sub-atomic physics is different, such that it learns to preserve the diamond-configuration despite ontological confusion
it’s ability to parse what’s going on in the digital worlds depends on sensors that are present in the digital world (also for physical world), and there are different arrays of sensors in different worlds. It’s trained against situations where it has the ability to modify its sensors for simple reward hacking.
eventually it’s taught programming/hardware know-ho, and put in fairly simplified situations where the solution to it’s puzzle is to notice that it physically exists, and make changes to it’s hardware or software, but it doesn’t directly hijack it’s own reward function.
(Note: thinking through all these specific details is pretty helpful for noticing how many steps are involved here. I think for this sort of plan to work you actually need a lot of different puzzles that are designed to be solveable with safe amounts of compute, so it doesn’t just bulldoze past your training setup. Designing such puzzles seems pretty time consuming. In practice I don’t expect the Successfully Aligned “murder everyone and make diamonds forever” bot to be completed before “murder everyone and make some Random Other Thing Forever” bot)
Even though my goal is a murder-bot-that-makes-diamonds-forever, I’m probably coupling all of this with attempts at corrigibility training, dealing with uncertainty, impact tracking, etc, to give myself extra time to notice problems. (i.e if the machine isn’t sure whether the thing it’s making is diamond, it makes a little bit first, asks humans to verify that it’s diamond, etc. Do similar training on “don’t modify the input channel for ‘was it actually diamond tho?’)
Assuming those tricks all work and hold up under tons of optimization pressure, this all still leaves us with inner alignment, and point #4 on my list of known-to-me-concerns. “The most important capabilities advances may come from an inner process that isn’t actually coupled to the reinforcement learning system.”
And… okay actually this is a new thought for me, and I’m not sure how to think about it yet. I can see how it was probably meant to be included in the “confusingly pervasive consequentialism” concept, but I didn’t get the “and therefore, impervious to gradient descent” argument till just now.
I’m out of time for now, will think about this more.
I think even without point #4 you don’t necessarily get an AI maximizing diamonds. Heuristically, it feels to me like you’re bulldozing open problems without understanding them (e.g. ontology identification by training with multiple models of physics, getting it not to reward-hack by explicit training, etc.) all of which are vulnerable to a deceptively aligned model (just wait till you’re out of training to reward-hack). Also, every time you say “train it by X so it learns Y” you’re assuming alignment (e.g. “digital worlds where the sub-atomic physics is different, such that it learns to preserve the diamond-configuration despite ontological confusion”)
IMO shard theory provides a great frame to think about this in, it’s a must-read for improving alignment intuitions.