OK, sure. First, I updated down on alignment difficulty after reading your lethalities post, because I had already baked in the expected-EY-quality doompost into my expectations. I was seriously relieved that you hadn’t found any qualitatively new obstacles which might present deep challenges to my new view on alignment.
Here’s one stab[1] at my disagreement with your list: Human beings exist, and our high-level reasoning about alignment has to account for the high-level alignment properties[2] of the only general intelligences we have ever found to exist ever. If ontological failure is such a nasty problem in AI alignment, how come very few people do terrible things because they forgot how to bind their “love” value to configurations of atoms? If it’s really hard to get intelligences to care about reality, how does the genome do it millions of times each day?
Taking an item from your lethalities post:
19… More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment—to point to latent events and objects and properties in the environment, rather than relatively shallow functions of the sense data and reward.
There is a guaranteed-to-exist mechanistic story for how the human genome solves lethality no.19, because people do reliably form (at least some of) their values around their model of reality. (For more on what I mean by this, see this comment.)I think the genome probably does solve this lethality using loss functions and relatively crude reward signals, and I think I have a pretty good idea of how that happens.
I haven’t made a public post out of my document on shard theory yet, because idea inoculation. Apparently, the document isn’t yet written well enough to yank people out of their current misframings of alignment. Maybe the doc has clicked for 10 people. Most readers trip on a miscommunication, stop far before they can understand the key insights, and taper off because it seems like Just Another Speculative Theory. I apparently don’t know how to credibly communicate that the theory is at the level of actuallyreally important to evaluate & critique ASAP, because time is slipping away. But I’ll keep trying anyways.
I’m attempting this comment in the hopes that it communicates something. Perhaps this comment is still unclear, in which case I ask the reader’s patience for improved future communication attempts.
1. “Human beings tend to bind their terminal values to their model of reality”, or
2. “Human beings reliably navigate ontological shifts. Children remain excited about animals after learning they are made out of cells. Physicists don’t stop caring about their family because they can model the world in terms of complex amplitudes.”
I addressed this distinction previously, in one of the links in OP. AFAIK we did not know how to reliably ensure the AI is pointed towards anything external, as long as it’s external. But also, humans are reliably pointed to particular kinds of external things. See the linked thread for more detail.
The important disanalogy
I am not attempting to make an analogy. Genome->human values is, mechanistically, an instance of value formation within a generally intelligent mind. For all of our thought experiments, genome->human values is the only instance we have ever empirically observed.
for humans there is no principal—our values can be whatever
Huh? I think I misunderstand you. I perceive you as saying: “There is not a predictable mapping from whatever-is-in-the-genome+environmental-factors to learned-values.”
If so, I strongly disagree. Like, in the world where that is true, wouldn’t parents be extremely uncertain whether their children will care about hills or dogs or paperclips or door hinges? Our values are not “whatever”, human values are generally formed over predictable kinds of real-world objects like dogs and people and tasty food.
Or if you take evolution as the principal, the alignment problem wasn’t solved.
The linked theory makes it obvious why evolution couldn’t have possibly solved the human alignment problem. To quote:
Since human values are generally defined over the learned human WM, evolution could not create homo inclusive-genetic-fitness-maximus.
If values form because reward sends reinforcement flowing back through a person’s cognition and reinforces the thoughts which (credit assignment judges to have) led to the reward, then if a person never thinks about inclusive reproductive fitness, they can never ever form a value shard around inclusive reproductive fitness. Certain abstractions, like lollipops or people, are convergently learned early in the predictive-loss-reduction process and thus are easy to form values around.
But if there aren’t local mutations which make a person more probable to think thoughts about inclusive genetic fitness before/while the person gets reward, then evolution can’t instill this value. Even if the descendents of that person will later be able to think thoughts about fitness.
On the other hand, under this theory, human values (by their nature) usually involve concepts which are easy to form shards of value around… Shard theory provides a story for why we might succeed at shard-alignment, even though evolution failed.
I basically agree with you. I think you go too far in saying Lethailty 19 is solved, though. Using the 3 feats from your linked comment, which I’ll summarise as “produce a mind that...”:
cares about something
cares about something external (not shallow function of local sensory data)
cares about something specific and external
(clearly each one is strictly harder than the previous) I recognise that Lethality 19 concerns feat 3, though it is worded as if being about both feat 2 and feat 3.
I think I need to distinguish two versions of feat 3:
there is a reliable (and maybe predictable) mapping between the specific targets of caring and the mind-producing process
there is a principal who gets to choose what the specific targets of caring are (and they succeed)
Humans show that feat 2 at least has been accomplished, but also 3a, as I take you to be pointing out. I maintain that 3b is not demonstrated by humans and is probably something we need.
Hm. I feel confused about the importance of 3b as opposed to 3a. Here’s my first guess: Because we need to target the AI’s motivation in particular ways in order to align it with particular desired goals, it’s important for there not just to be a predictable mapping, but a flexibly steerable one, such that we can choose to steer towards “dog” or “rock” or “cheese wheels” or “cooperating with humans.”
Or if you take evolution as the principal, the alignment problem wasn’t solved.
In what sense? Because modern humans use birth control? Then what do you make of the fact that most people seem to care about whether biological humans exist a billion years hence?
People definitely do not terminally care about inclusive genetic fitness in its pure abstract form, there is not something inside of them which pushes for plans which increase inclusive genetic fitness. Evolution failed at alignment, strictly speaking.
I think it’s more complicated to answer “did evolution kinda succeed, despite failing at direct alignment?”, and I don’t have time to say more at the moment, so I’ll stop there.
I think the focus on “inclusive genetic fitness” as evolution’s “goal” is weird. I’m not even sure it makes sense to talk about evolution’s “goals”, but if you want to call it an optimization process, the choice of “inclusive genetic fitness” as its target is arbitrary as there are many other boundaries one could trace. Evolution is acting at all levels, e.g. gene, cell, organism, species, the entirety of life on Earth. For example, it is not selecting adaptations which increase the genetic fitness of an individual but lead to the extinction of the species later. In the most basic sense evolution is selecting for “things that expand”, in the entire universe, and humans definitely seem partially aligned with that—the ways in which they aren’t seem non-competitive with this goal.
I don’t know, if I was a supervillian I’d certainly have a huge number of kids and also modify my and my children’s bodies to be more “inclusively genetically fit” in any way my scientist-lackeys could manage. Parents also regularly put huge amounts of effort into their children’s fitness, although we might quibble about whether in our culture they strike the right balance of economic, physical, social, emotional etc fitness.
OK, sure. First, I updated down on alignment difficulty after reading your lethalities post, because I had already baked in the expected-EY-quality doompost into my expectations. I was seriously relieved that you hadn’t found any qualitatively new obstacles which might present deep challenges to my new view on alignment.
Here’s one stab[1] at my disagreement with your list: Human beings exist, and our high-level reasoning about alignment has to account for the high-level alignment properties[2] of the only general intelligences we have ever found to exist ever. If ontological failure is such a nasty problem in AI alignment, how come very few people do terrible things because they forgot how to bind their “love” value to configurations of atoms? If it’s really hard to get intelligences to care about reality, how does the genome do it millions of times each day?
Taking an item from your lethalities post:
There is a guaranteed-to-exist mechanistic story for how the human genome solves lethality no.19, because people do reliably form (at least some of) their values around their model of reality. (For more on what I mean by this, see this comment.) I think the genome probably does solve this lethality using loss functions and relatively crude reward signals, and I think I have a pretty good idea of how that happens.
I haven’t made a public post out of my document on shard theory yet, because idea inoculation. Apparently, the document isn’t yet written well enough to yank people out of their current misframings of alignment. Maybe the doc has clicked for 10 people. Most readers trip on a miscommunication, stop far before they can understand the key insights, and taper off because it seems like Just Another Speculative Theory. I apparently don’t know how to credibly communicate that the theory is at the level of actually really important to evaluate & critique ASAP, because time is slipping away. But I’ll keep trying anyways.
I’m attempting this comment in the hopes that it communicates something. Perhaps this comment is still unclear, in which case I ask the reader’s patience for improved future communication attempts.
Like
1. “Human beings tend to bind their terminal values to their model of reality”, or
2. “Human beings reliably navigate ontological shifts. Children remain excited about animals after learning they are made out of cells. Physicists don’t stop caring about their family because they can model the world in terms of complex amplitudes.”
Yes, human beings exist and build world models beyond their local sensory data, and have values over those world models not just over the senses.
But this is not addressing all of the problem in Lethality 19. What’s missing is how we point at something specific (not just at anything external).
The important disanalogy between AGI alignment and humans as already-existing (N)GIs is:
for AGIs there’s a principal (humans) that we want to align the AGI to
for humans there is no principal—our values can be whatever. Or if you take evolution as the principal, the alignment problem wasn’t solved.
I addressed this distinction previously, in one of the links in OP. AFAIK we did not know how to reliably ensure the AI is pointed towards anything external, as long as it’s external. But also, humans are reliably pointed to particular kinds of external things. See the linked thread for more detail.
I am not attempting to make an analogy. Genome->human values is, mechanistically, an instance of value formation within a generally intelligent mind. For all of our thought experiments, genome->human values is the only instance we have ever empirically observed.
Huh? I think I misunderstand you. I perceive you as saying: “There is not a predictable mapping from whatever-is-in-the-genome+environmental-factors to learned-values.”
If so, I strongly disagree. Like, in the world where that is true, wouldn’t parents be extremely uncertain whether their children will care about hills or dogs or paperclips or door hinges? Our values are not “whatever”, human values are generally formed over predictable kinds of real-world objects like dogs and people and tasty food.
The linked theory makes it obvious why evolution couldn’t have possibly solved the human alignment problem. To quote:
(Edited to expand my thoughts)
I basically agree with you. I think you go too far in saying Lethailty 19 is solved, though. Using the 3 feats from your linked comment, which I’ll summarise as “produce a mind that...”:
cares about something
cares about something external (not shallow function of local sensory data)
cares about something specific and external
(clearly each one is strictly harder than the previous) I recognise that Lethality 19 concerns feat 3, though it is worded as if being about both feat 2 and feat 3.
I think I need to distinguish two versions of feat 3:
there is a reliable (and maybe predictable) mapping between the specific targets of caring and the mind-producing process
there is a principal who gets to choose what the specific targets of caring are (and they succeed)
Humans show that feat 2 at least has been accomplished, but also 3a, as I take you to be pointing out. I maintain that 3b is not demonstrated by humans and is probably something we need.
Hm. I feel confused about the importance of 3b as opposed to 3a. Here’s my first guess: Because we need to target the AI’s motivation in particular ways in order to align it with particular desired goals, it’s important for there not just to be a predictable mapping, but a flexibly steerable one, such that we can choose to steer towards “dog” or “rock” or “cheese wheels” or “cooperating with humans.”
Is this close?
Yes that sounds right to me.
In what sense? Because modern humans use birth control? Then what do you make of the fact that most people seem to care about whether biological humans exist a billion years hence?
People definitely do not terminally care about inclusive genetic fitness in its pure abstract form, there is not something inside of them which pushes for plans which increase inclusive genetic fitness. Evolution failed at alignment, strictly speaking.
I think it’s more complicated to answer “did evolution kinda succeed, despite failing at direct alignment?”, and I don’t have time to say more at the moment, so I’ll stop there.
I think the focus on “inclusive genetic fitness” as evolution’s “goal” is weird. I’m not even sure it makes sense to talk about evolution’s “goals”, but if you want to call it an optimization process, the choice of “inclusive genetic fitness” as its target is arbitrary as there are many other boundaries one could trace. Evolution is acting at all levels, e.g. gene, cell, organism, species, the entirety of life on Earth. For example, it is not selecting adaptations which increase the genetic fitness of an individual but lead to the extinction of the species later. In the most basic sense evolution is selecting for “things that expand”, in the entire universe, and humans definitely seem partially aligned with that—the ways in which they aren’t seem non-competitive with this goal.
I don’t know, if I was a supervillian I’d certainly have a huge number of kids and also modify my and my children’s bodies to be more “inclusively genetically fit” in any way my scientist-lackeys could manage. Parents also regularly put huge amounts of effort into their children’s fitness, although we might quibble about whether in our culture they strike the right balance of economic, physical, social, emotional etc fitness.