There may not be substantial disagreements here. Do you agree with:
“a particular human’s learning process + reward circuitry + “training” environment → the human’s learned values” is more informative about inner-misalignment than the usual “evolution → human values” (e.g. Two twins could have different life experiences and have different values, or a sociopath may have different reward circuitry which leads to very different values than people with typical reward circuitry even given similar experiences)
The most important claim in your comment is that “human learning → human values” is evidence that inner misalignment is easier than it seems when one looks at it from the “evolution → human values” perspective. Here’s why I disagree:
I don’t know what you mean by “inner misalignment is easier”? Could you elaborate? I don’t think you mean “inner misalignment is more likely to happen” because you then go on to explain inner-misalignment & give an example and say “I worry you are being insufficiently pessimistic.”
One implication I read was that inner values learned (ie the inner-misaligned values) may scale, which is the opposite prediction usually given. See:
I also think this regularity in inner values is reasonably robust to sharp left turns in capabilities. If you take a human whose outer behavior suggests they like dogs, and give that human very strong capabilities to influence the future, I do not think they are at all likely to erase dogs from existence.
Do you agree with: “a particular human’s learning process + reward circuitry + “training” environment → the human’s learned values” is more informative about inner-misalignment than the usual “evolution → human values”
What I see is that we are taking two different optimizers applying optimizing pressure on a system (evolution and the environment), and then stating that one optimization provides more information about a property of OOD behavior shift than another. This doesn’t make sense to me, particularly since I believe that most people live in environments that is very much” in distribution”, and it is difficult for us to discuss misalignment without talking about extreme cases (as I described in the previous comment), or subtle cases (black swans?) that may not seem to matter.
I don’t know what you mean by “inner misalignment is easier”? Could you elaborate? I don’t think you mean “inner misalignment is more likely to happen” because you then go on to explain inner-misalignment & give an example and say “I worry you are being insufficiently pessimistic.”
My bad; I’ve updated the comment to clarify that I believe Quintin claims that solving / preventing inner misalignment is easier than one would expect given the belief that evolution’s failure at inner alignment is the most significant and informative evidence that inner alignment is hard.
One implication I read was that inner values learned (ie the inner-misaligned values) may scale, which is the opposite prediction usually given.
I assume you mean that Quintin seems to claim that inner values learned may be retained with increase in capabilities, and that usually people believe that inner values learned may not be retained with increase in capabilities. I believe so too—inner values seem to be significantly robust to increase in capabilities, especially since one has the option to deceive. Do people really believe that inner values learned don’t scale with an increase in capabilities? Perhaps we are defining inner values differently here.
By inner values, I mean terminal goals. Wanting dogs to be happy is not a terminal goal for most people, and I believe that given enough optimization pressure, the hypothetical dog-lover would abandon this goal to optimize for what their true terminal goal is. Does that mean that with increase in capabilities, people’s inner values shift? Not exactly; it seems to me that we were mistaken about people’s inner values instead.
This doesn’t make sense to me, particularly since I believe that most people live in environments that is very much” in distribution”, and it is difficult for us to discuss misalignment without talking about extreme cases (as I described in the previous comment), or subtle cases (black swans?) that may not seem to matter.
I think you’re ignoring the [now bolded part] in “a particular human’s learning process + reward circuitry + “training” environment” and just focusing in the environment. Humans very often don’t optimize for their reward circuitry in their limbic system. If I gave you a button that killed everyone but maximized your reward circuitry every time you pressed it, most people wouldn’t press it (would you?). I do agree that if you pressed the button once, you would then want to press the button again, but not beforehand which is an inner-misalignment w/ respect to the reward circuitry. Though maybe you’d say the wirehead thing is an extreme case OOD?
By inner values, I mean terminal goals. Wanting dogs to be happy is not a terminal goal for most people, and I believe that given enough optimization pressure, the hypothetical dog-lover would abandon this goal to optimize for what their true terminal goal is.
I agree, but I’m bolding “most people” because you’re claiming there exist some people that would retain that value if scaled up(?) I think replace “dog-lover” w/ “family-lover” and there’s even more people. But I don’t think this is a disagreement between us?
My bad; I’ve updated the comment to clarify that I believe Quintin claims that solving / preventing inner misalignment is easier than one would expect given the belief that evolution’s failure at inner alignment is the most significant and informative evidence that inner alignment is hard.
Oh, I think inner-misalignment w/ respect to the reward circuitry is a good, positive thing that we want, so there’s the disconnect (usually misalignment is thought of as bad, and I’m not just mistyping). Human values are formed by inner-misalignment and they have lots of great properties such as avoiding ontological crises, valuing real world things (like diamond maximizer in the OP), and a subset of which cares for all of humanity. We can learn more about this process by focusing more on the “a particular human’s learning process + reward circuitry+ “training” environment” part, and less on the evolution part.
If we understand the underlying mechanisms behind human value formation through inner-misalignment w/ respect to the reward circuitry, then we might be able to better develop the theory of learning systems developing values, which includes AGI.
Oh, I think inner-misalignment w/ respect to the reward circuitry is a good, positive thing that we want, so there’s the disconnect (usually misalignment is thought of as bad, and I’m not just mistyping).
Yes, thank you: I didn’t notice that you were making that assumption. This conversation makes a lot more sense to me now.
Human values are formed by inner-misalignment and they have lots of great properties such as avoiding ontological crises, valuing real world things (like diamond maximizer in the OP), and a subset of which cares for all of humanity. We can learn more about this process by focusing more on the “a particular human’s learning process + reward circuitry + “training” environment” part, and less on the evolution part.
If we understand the underlying mechanisms behind human value formation through inner-misalignment w/ respect to the reward circuitry, then we might be able to better develop the theory of learning systems developing values, which includes AGI.
This seems to imply that the aim of this alignment proposal is to solve the alignment problem by aligning the inner values with that of the creators of the AI and bypassing the outer alignment problem. That is really interesting; I’ve updated in the direction of shard theory being more viable as an alignment strategy than I previously believed. I’m still confused about huge parts of it, but we can discuss it more elsewhere.
There may not be substantial disagreements here. Do you agree with:
“a particular human’s learning process + reward circuitry + “training” environment → the human’s learned values” is more informative about inner-misalignment than the usual “evolution → human values” (e.g. Two twins could have different life experiences and have different values, or a sociopath may have different reward circuitry which leads to very different values than people with typical reward circuitry even given similar experiences)
I don’t know what you mean by “inner misalignment is easier”? Could you elaborate? I don’t think you mean “inner misalignment is more likely to happen” because you then go on to explain inner-misalignment & give an example and say “I worry you are being insufficiently pessimistic.”
One implication I read was that inner values learned (ie the inner-misaligned values) may scale, which is the opposite prediction usually given. See:
This matches my intuitions.
What I see is that we are taking two different optimizers applying optimizing pressure on a system (evolution and the environment), and then stating that one optimization provides more information about a property of OOD behavior shift than another. This doesn’t make sense to me, particularly since I believe that most people live in environments that is very much” in distribution”, and it is difficult for us to discuss misalignment without talking about extreme cases (as I described in the previous comment), or subtle cases (black swans?) that may not seem to matter.
My bad; I’ve updated the comment to clarify that I believe Quintin claims that solving / preventing inner misalignment is easier than one would expect given the belief that evolution’s failure at inner alignment is the most significant and informative evidence that inner alignment is hard.
I assume you mean that Quintin seems to claim that inner values learned may be retained with increase in capabilities, and that usually people believe that inner values learned may not be retained with increase in capabilities. I believe so too—inner values seem to be significantly robust to increase in capabilities, especially since one has the option to deceive. Do people really believe that inner values learned don’t scale with an increase in capabilities? Perhaps we are defining inner values differently here.
By inner values, I mean terminal goals. Wanting dogs to be happy is not a terminal goal for most people, and I believe that given enough optimization pressure, the hypothetical dog-lover would abandon this goal to optimize for what their true terminal goal is. Does that mean that with increase in capabilities, people’s inner values shift? Not exactly; it seems to me that we were mistaken about people’s inner values instead.
I think you’re ignoring the [now bolded part] in “a particular human’s learning process + reward circuitry + “training” environment” and just focusing in the environment. Humans very often don’t optimize for their reward circuitry in their limbic system. If I gave you a button that killed everyone but maximized your reward circuitry every time you pressed it, most people wouldn’t press it (would you?). I do agree that if you pressed the button once, you would then want to press the button again, but not beforehand which is an inner-misalignment w/ respect to the reward circuitry. Though maybe you’d say the wirehead thing is an extreme case OOD?
I agree, but I’m bolding “most people” because you’re claiming there exist some people that would retain that value if scaled up(?) I think replace “dog-lover” w/ “family-lover” and there’s even more people. But I don’t think this is a disagreement between us?
Oh, I think inner-misalignment w/ respect to the reward circuitry is a good, positive thing that we want, so there’s the disconnect (usually misalignment is thought of as bad, and I’m not just mistyping). Human values are formed by inner-misalignment and they have lots of great properties such as avoiding ontological crises, valuing real world things (like diamond maximizer in the OP), and a subset of which cares for all of humanity. We can learn more about this process by focusing more on the “a particular human’s learning process + reward circuitry + “training” environment” part, and less on the evolution part.
If we understand the underlying mechanisms behind human value formation through inner-misalignment w/ respect to the reward circuitry, then we might be able to better develop the theory of learning systems developing values, which includes AGI.
Yes, thank you: I didn’t notice that you were making that assumption. This conversation makes a lot more sense to me now.
This seems to imply that the aim of this alignment proposal is to solve the alignment problem by aligning the inner values with that of the creators of the AI and bypassing the outer alignment problem. That is really interesting; I’ve updated in the direction of shard theory being more viable as an alignment strategy than I previously believed. I’m still confused about huge parts of it, but we can discuss it more elsewhere.