Why is the process by which humans come to reliably care about the real world, not a process we could leverage analogously to make AIs care about the real world?
Maybe I’m not understanding your proposal, but on the face of it this seems like a change of topic. I don’t see Eliezer claiming ‘there’s no way to make the AGI care about the real world vs. caring about (say) internal experiences in its own head’. Maybe he does think that, but mostly I’d guess he doesn’t care, because the important thing is whether you can point the AGI at very, very specific real-world tasks.
Where is the accident? Did evolution accidentally find a way to reliably orient people towards the real world? Do people each, individually, accidentally learn to care about the real world?
Same objection/confusion here, except now I’m also a bit confused about what you mean by “orient people towards the real world”. Your previous language made it sound like you were talking about causing the optimizer’s goals to point at things in the real world, but now your language makes it sound like you’re talking about causing the optimizer to model the real world or causing the optimizer to instrumentally care about the state of the real world....? Those all seem very different to me.
Or, in summary, I’m not seeing the connection between:
“Terminally valuing anything physical at all” vs. “terminally valuing very specific physical things”.
“Terminally valuing anything physical at all” vs. “instrumentally valuing anything physical at all”.
“Terminally valuing very specific physical things” vs. “instrumentally valuing very specific physical things”.
Any of the above vs. “modeling / thinking about physical things at all”, or “modeling / thinking about very specific physical things”.
Hm, I’ll give this another stab. I understand the first part of your comment as “sure, it’s possible for minds to care about reality, but we don’t know how to target value formation so that the mind cares about a particular part of reality.” Is this a good summary?
I don’t see Eliezer claiming ‘there’s no way to make the AGI care about the real world vs. caring about (say) internal experiences in its own head’.
Let me distinguish three alignment feats:
Producing a mind which terminally values sensory entities.
Producing a mind which reliably terminally values some kind of non-sensory entity in the world, like dogs or bananas.
AFAIK we have no idea how to ensure this happens reliably—to produce an AGI which terminally values some element of {diamonds, dogs, cats, tree branches, other real-world objects}, such that there’s a low probability that the AGI actually just cares about high-reward sensory observations.
In other words: Design a mind which cares about anything at all in reality which isn’t a shallow sensory phenomenon which is directly observable by the agent. Like, maybe I have a mind-training procedure, where I don’t know what the final trained mind will value (dogs, diamonds, trees having particular kinds of cross-sections at year 5 of their growth), but I’m damn sure the AI will care about something besides its own sensory signals.
I was, first, pointing out that this problem has to be solvable, since the human genome solves it millions of times every day!
Producing a mind which reliably terminally values a specific non-sensory entity, like diamonds.
Design a mind which cares about a particular kind of object. We could target the mind-training process to care about diamonds, or about dogs, or about trees, but to solve this problem, we have to ensure the trained mind significantly cares about one kind of real-world entity in particular. Therefore, feat #3 is strictly harder than feat #2.
We (alignment researchers) have had no idea how to actually build a mind which intrinsically (not instrumentally!) values a latent, non-sensory object in the real world. Witness the confusion on this point in Arbital’s ontology identification article.
To my knowledge, we still haven’t solved this problem. We have no reward function to give AIXI which makes AIXI maximize real-world diamonds. A deep learning agent might learn to care about the real world, yes, but it might learn sensory preferences instead. Ignorance about the outcome is not a mechanistic account of why the agent convergently will care about specific real-world objects instead of its sensory feedback signals.
Under this account, caring about the real world is just one particular outcome among many. Hence, the “classic paradigms” imply that real-world caring is (relatively) improbable.
While we have stories about entities which value paperclips, I do not think we have known how to design them. Nor have we had any mechanistic explanation for why people care about the real world in particular.
As you point out, we obviously need to figure problem 3out in order to usefully align an AGI. I will now argue that the genomesolves problem 3, albeit not in the sense of aligning humans with inclusive genetic fitness (you can forget about human/evolution alignment, I won’t be discussing that in this comment).
The genome solves problem #3 in the sense of: if a child grows up with a dog, then that child will (with high probability) terminally value that dog.
Isn’t that an amazing alignment feat!?
Therefore, there has to be a reliable method of initializing a mind from scratch, training it, and having the resultant intelligence care about dogs. Not only does it exist in principle, it succeeds in practice, and we can think about what that method might be. I think this method isn’t some uber-complicated alignment solution. The shard theory explanation for dog-value formation is quite simple.
now your language makes it sound like you’re talking about causing the optimizer to model the real world or causing the optimizer to instrumentally care about the state of the real world....? Those all seem very different to me.
Nope, wasn’t meaning any of these! I was talking about “causing the optimizer’s goals to point at things in the real world” the whole time.
I understand the first part of your comment as “sure, it’s possible for minds to care about reality, but we don’t know how to target value formation so that the mind cares about a particular part of reality.” Is this a good summary?
Yes!
I was, first, pointing out that this problem has to be solvable, since the human genome solves it millions of times every day!
True! Though everyone already agreed (e.g., EY asserted this in the OP) that it’s possible in principle. The updatey thing would be if the case of the human genome / brain development suggests it’s more tractable than we otherwise would have thought (in AI).
Seems to me like it’s at least a small update about tractability, though I’m not sure it’s a big one? Would be interesting to think about the level of agreement between different individual humans with regard to ‘how much particular external-world things matter’. Especially interesting would be cases where humans consistently, robustly care about a particular external-world thingie even though it doesn’t have a simple sensory correlate.
(E.g., humans developing to care about sex is less promising insofar as it depends on sensory-level reinforcement such as orgasms. Humans developing to care about ‘not being in the Matrix / not being in an experience machine’ is possibly more promising, because it seems like a pretty common preference that doesn’t get directly shaped by sensory rewards.)
3. Producing a mind which reliably terminally values a specific non-sensory entity, like diamonds
Is the distinction between 2 and 3 that “dog” is an imprecise concept, while “diamond” is precise? FWIW, 2 and 3 currently sound very similar to me, if 2 is ‘maximize the number of dogs’ and 3 is ‘maximize the number of diamonds’.
If you could reliably build a dog maximizer, I think that would also be a massive win and would maybe mean that the alignment problem is mostly-solved. (Indeed, I’m inclined to think that’s a harder feat than building a diamond maximizer, and I think being able to build a diamond maximizer would also suggest the strawberry-grade alignment problem is mostly solved.)
But maybe I’m misunderstanding 2.
Nope, wasn’t meaning any of these! I was talking about “causing the optimizer’s goals to point at things in the real world” the whole time.
Cool!
I’ll look more at your shards document and think about your arguments here. :)
Is the distinction between 2 and 3 that “dog” is an imprecise concept, while “diamond” is precise? FWIW, 2 and 3 currently sound very similar to me, if 2 is ‘maximize the number of dogs’ and 3 is ‘maximize the number of diamonds’.
Feat #2 is: Design a mind which cares about anything at all in reality which isn’t a shallow sensory phenomenon which is directly observable by the agent. Like, maybe I have a mind-training procedure, where I don’t know what the final trained mind will value (dogs, diamonds, trees having particular kinds of cross-sections at year 5 of their growth), but I’m damn sure the AI will care about something besides its own sensory signals. Such a procedure would accomplish feat #2, but not #3.
Feat #3 is: Design a mind which cares about a particular kind of object. We could target the mind-training process to care about diamonds, or about dogs, or about trees, but to solve this problem, we have to ensure the trained mind significantly cares about one kind of real-world entity in particular. Therefore, feat #3 is strictly harder than feat #2.
If you could reliably build a dog maximizer, I think that would also be a massive win and would maybe mean that the alignment problem is mostly-solved. (Indeed, I’m inclined to think that’s a harder feat than building a diamond maximizer
I actually think that the dog- and diamond-maximization problems are about equally hard, and, to be totally honest, neither seems that bad[1] in the shard theory paradigm.
Surprisingly, I weakly suspect the harder part is getting the agent to maximize real-worlddogs in expectation, not getting the agent to maximize real-world dogs in expectation. I think “figure out how to build a mind which cares about the number of real-world dogs, such that the mind intelligently selects plans which lead to a lot of dogs” is significantly easier than building a dog-maximizer.
Maybe I’m not understanding your proposal, but on the face of it this seems like a change of topic. I don’t see Eliezer claiming ‘there’s no way to make the AGI care about the real world vs. caring about (say) internal experiences in its own head’. Maybe he does think that, but mostly I’d guess he doesn’t care, because the important thing is whether you can point the AGI at very, very specific real-world tasks.
Same objection/confusion here, except now I’m also a bit confused about what you mean by “orient people towards the real world”. Your previous language made it sound like you were talking about causing the optimizer’s goals to point at things in the real world, but now your language makes it sound like you’re talking about causing the optimizer to model the real world or causing the optimizer to instrumentally care about the state of the real world....? Those all seem very different to me.
Or, in summary, I’m not seeing the connection between:
“Terminally valuing anything physical at all” vs. “terminally valuing very specific physical things”.
“Terminally valuing anything physical at all” vs. “instrumentally valuing anything physical at all”.
“Terminally valuing very specific physical things” vs. “instrumentally valuing very specific physical things”.
Any of the above vs. “modeling / thinking about physical things at all”, or “modeling / thinking about very specific physical things”.
Hm, I’ll give this another stab. I understand the first part of your comment as “sure, it’s possible for minds to care about reality, but we don’t know how to target value formation so that the mind cares about a particular part of reality.” Is this a good summary?
Let me distinguish three alignment feats:
Producing a mind which terminally values sensory entities.
Producing a mind which reliably terminally values some kind of non-sensory entity in the world, like dogs or bananas.
AFAIK we have no idea how to ensure this happens reliably—to produce an AGI which terminally values some element of {diamonds, dogs, cats, tree branches, other real-world objects}, such that there’s a low probability that the AGI actually just cares about high-reward sensory observations.
In other words: Design a mind which cares about anything at all in reality which isn’t a shallow sensory phenomenon which is directly observable by the agent. Like, maybe I have a mind-training procedure, where I don’t know what the final trained mind will value (dogs, diamonds, trees having particular kinds of cross-sections at year 5 of their growth), but I’m damn sure the AI will care about something besides its own sensory signals.
I was, first, pointing out that this problem has to be solvable, since the human genome solves it millions of times every day!
Producing a mind which reliably terminally values a specific non-sensory entity, like diamonds.
Design a mind which cares about a particular kind of object. We could target the mind-training process to care about diamonds, or about dogs, or about trees, but to solve this problem, we have to ensure the trained mind significantly cares about one kind of real-world entity in particular. Therefore, feat #3 is strictly harder than feat #2.
This is what you point out as a potential crux.
(EDIT: Added a few sub-points to clarify list.)
From my shard theory document:
As you point out, we obviously need to figure problem 3 out in order to usefully align an AGI. I will now argue that the genome solves problem 3, albeit not in the sense of aligning humans with inclusive genetic fitness (you can forget about human/evolution alignment, I won’t be discussing that in this comment).
The genome solves problem #3 in the sense of: if a child grows up with a dog, then that child will (with high probability) terminally value that dog.
Isn’t that an amazing alignment feat!?
Therefore, there has to be a reliable method of initializing a mind from scratch, training it, and having the resultant intelligence care about dogs. Not only does it exist in principle, it succeeds in practice, and we can think about what that method might be. I think this method isn’t some uber-complicated alignment solution. The shard theory explanation for dog-value formation is quite simple.
Nope, wasn’t meaning any of these! I was talking about “causing the optimizer’s goals to point at things in the real world” the whole time.
Yes!
True! Though everyone already agreed (e.g., EY asserted this in the OP) that it’s possible in principle. The updatey thing would be if the case of the human genome / brain development suggests it’s more tractable than we otherwise would have thought (in AI).
Seems to me like it’s at least a small update about tractability, though I’m not sure it’s a big one? Would be interesting to think about the level of agreement between different individual humans with regard to ‘how much particular external-world things matter’. Especially interesting would be cases where humans consistently, robustly care about a particular external-world thingie even though it doesn’t have a simple sensory correlate.
(E.g., humans developing to care about sex is less promising insofar as it depends on sensory-level reinforcement such as orgasms. Humans developing to care about ‘not being in the Matrix / not being in an experience machine’ is possibly more promising, because it seems like a pretty common preference that doesn’t get directly shaped by sensory rewards.)
Is the distinction between 2 and 3 that “dog” is an imprecise concept, while “diamond” is precise? FWIW, 2 and 3 currently sound very similar to me, if 2 is ‘maximize the number of dogs’ and 3 is ‘maximize the number of diamonds’.
If you could reliably build a dog maximizer, I think that would also be a massive win and would maybe mean that the alignment problem is mostly-solved. (Indeed, I’m inclined to think that’s a harder feat than building a diamond maximizer, and I think being able to build a diamond maximizer would also suggest the strawberry-grade alignment problem is mostly solved.)
But maybe I’m misunderstanding 2.
Cool!
I’ll look more at your shards document and think about your arguments here. :)
Feat #2 is: Design a mind which cares about anything at all in reality which isn’t a shallow sensory phenomenon which is directly observable by the agent. Like, maybe I have a mind-training procedure, where I don’t know what the final trained mind will value (dogs, diamonds, trees having particular kinds of cross-sections at year 5 of their growth), but I’m damn sure the AI will care about something besides its own sensory signals. Such a procedure would accomplish feat #2, but not #3.
Feat #3 is: Design a mind which cares about a particular kind of object. We could target the mind-training process to care about diamonds, or about dogs, or about trees, but to solve this problem, we have to ensure the trained mind significantly cares about one kind of real-world entity in particular. Therefore, feat #3 is strictly harder than feat #2.
I actually think that the dog- and diamond-maximization problems are about equally hard, and, to be totally honest, neither seems that bad[1] in the shard theory paradigm.
Surprisingly, I weakly suspect the harder part is getting the agent to maximize real-world dogs in expectation, not getting the agent to maximize real-world dogs in expectation. I think “figure out how to build a mind which cares about the number of real-world dogs, such that the mind intelligently selects plans which lead to a lot of dogs” is significantly easier than building a dog-maximizer.
I appreciate that this claim is hard to swallow. In any case, I want to focus on inferentially-closer questions first, like how human values form.