Humans don’t explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn’t produce inner optimization in that direction.
Humans haven’t been optimized to pursue inclusive genetic fitness for very long, because humans haven’t been around for very long. Instead they inherited the crude heuristics pointing towards inclusive genetic fitness from their cognitively much less sophisticated predecessors. And those still kinda work!
If we are still around in a couple of million years I wouldn’t be surprised if there was inner alignment in the sense that almost all humans in almost all practically encountered environments end up consciously optimising inclusive genetic fitness.
More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment—to point to latent events and objects and properties in the environment, rather than relatively shallow functions of the sense data and reward.
Generally, I think that people draw the wrong conclusions from mesa-optimisers and the examples of human evolutionary alignment.
Saying that we would like to solve alignment by specifying exactly what we want and then let the AI learn exactly what we want, is like saying that we would like to solve transportation by inventing teleportation. Yeah, would be nice but unfortunately it seems like you will have to move through space instead.
The conclusion we should take from the concept of mesa-optimisation isn’t “oh no alignment is impossible”, that’s equivalent to “oh no learning is impossible”. But learning is possible. So the correct conclusion is “alignment has to work via mesa-optimisation”.
Because alignment in the human examples (i.e. human alignment to evolution’s objective and humans alignment to human values) works by bootstrapping from incredibly crude heuristics. Think three dark patches for a face.
Humans are mesa-optimized to adhere to human values. If we were actually inner aligned to the crude heuristics that evolution installed in us for bootstrapping the entire process, we would be totally disfunctional weirdoes.
I mean even more so …
To me the human examples suggest that there has to be a possibility to get from gesturing at what we want to getting what we want. And I think we can gesture a lot better than evolution! Well, at least using much more information than 3.2 billion base pairs.
If alignment has to be a bootstrapped open ended learning process there is also the possibility that it will work better with more intelligent systems or really only start working with fairly intelligent systems.
Maybe bootstrapping with cake, kittens and cuddles will still get us paperclipped, I don’t know. It certainly seems awfully easy to just run straight off a cliff. But I think looking at the only known examples of alignment of intelligences does allow us more optimistic takes than are prevalent on this page.
The conclusion we should take from the concept of mesa-optimisation isn’t “oh no alignment is impossible”, that’s equivalent to “oh no learning is impossible”.
The OP isn’t claiming that alignment is impossible.
If we were actually inner aligned to the crude heuristics that evolution installed in us for bootstrapping the entire process, we would be totally disfunctional weirdoes.
The point I’m making is that the human example tells us that:
If first we realize that we can’t code up our values, therefore alignment is hard. Then, when we realize that mesa-optimisation is a thing. we shouldn’t update towards “alignment is even harder”. We should update in the opposite direction.
Because the human example tells us that a mesa-optimiser can reliably point to a complex thing even if the optimiser points to only a few crude things.
But I only ever see these three points, human example, inability to code up values, mesa-optimisation to separately argue for “alignment is even harder than previously thought”. But taken together that is just not the picture.
Humans point to some complicated things, but not via a process that suggests an analogous way to use natural selection or gradient descent to make a mesa-optimizer point to particular externally specifiable complicated things.
Why do you think that? Why is the process by which humans come to reliably care about the real world, not a process we could leverage analogously to make AIs care about the real world?
Likewise, when you wrote,
This isn’t to say that nothing in the system’s goal (whatever goal accidentally ends up being inner-optimized over) could ever point to anything in the environment by accident.
Where is the accident? Did evolution accidentally find a way to reliably orient terminal human values towards the real world? Do people each, individually, accidentally learn to terminally care about the real world? Because the former implies the existence of a better alignment paradigm (that which occurs within the human brain, to take an empty-slate human and grow them into an intelligence which terminally cares about objects in reality), and the latter is extremely unlikely. Let me know if you meant something else.
Why is the process by which humans come to reliably care about the real world, not a process we could leverage analogously to make AIs care about the real world?
Maybe I’m not understanding your proposal, but on the face of it this seems like a change of topic. I don’t see Eliezer claiming ‘there’s no way to make the AGI care about the real world vs. caring about (say) internal experiences in its own head’. Maybe he does think that, but mostly I’d guess he doesn’t care, because the important thing is whether you can point the AGI at very, very specific real-world tasks.
Where is the accident? Did evolution accidentally find a way to reliably orient people towards the real world? Do people each, individually, accidentally learn to care about the real world?
Same objection/confusion here, except now I’m also a bit confused about what you mean by “orient people towards the real world”. Your previous language made it sound like you were talking about causing the optimizer’s goals to point at things in the real world, but now your language makes it sound like you’re talking about causing the optimizer to model the real world or causing the optimizer to instrumentally care about the state of the real world....? Those all seem very different to me.
Or, in summary, I’m not seeing the connection between:
“Terminally valuing anything physical at all” vs. “terminally valuing very specific physical things”.
“Terminally valuing anything physical at all” vs. “instrumentally valuing anything physical at all”.
“Terminally valuing very specific physical things” vs. “instrumentally valuing very specific physical things”.
Any of the above vs. “modeling / thinking about physical things at all”, or “modeling / thinking about very specific physical things”.
Hm, I’ll give this another stab. I understand the first part of your comment as “sure, it’s possible for minds to care about reality, but we don’t know how to target value formation so that the mind cares about a particular part of reality.” Is this a good summary?
I don’t see Eliezer claiming ‘there’s no way to make the AGI care about the real world vs. caring about (say) internal experiences in its own head’.
Let me distinguish three alignment feats:
Producing a mind which terminally values sensory entities.
Producing a mind which reliably terminally values some kind of non-sensory entity in the world, like dogs or bananas.
AFAIK we have no idea how to ensure this happens reliably—to produce an AGI which terminally values some element of {diamonds, dogs, cats, tree branches, other real-world objects}, such that there’s a low probability that the AGI actually just cares about high-reward sensory observations.
In other words: Design a mind which cares about anything at all in reality which isn’t a shallow sensory phenomenon which is directly observable by the agent. Like, maybe I have a mind-training procedure, where I don’t know what the final trained mind will value (dogs, diamonds, trees having particular kinds of cross-sections at year 5 of their growth), but I’m damn sure the AI will care about something besides its own sensory signals.
I was, first, pointing out that this problem has to be solvable, since the human genome solves it millions of times every day!
Producing a mind which reliably terminally values a specific non-sensory entity, like diamonds.
Design a mind which cares about a particular kind of object. We could target the mind-training process to care about diamonds, or about dogs, or about trees, but to solve this problem, we have to ensure the trained mind significantly cares about one kind of real-world entity in particular. Therefore, feat #3 is strictly harder than feat #2.
We (alignment researchers) have had no idea how to actually build a mind which intrinsically (not instrumentally!) values a latent, non-sensory object in the real world. Witness the confusion on this point in Arbital’s ontology identification article.
To my knowledge, we still haven’t solved this problem. We have no reward function to give AIXI which makes AIXI maximize real-world diamonds. A deep learning agent might learn to care about the real world, yes, but it might learn sensory preferences instead. Ignorance about the outcome is not a mechanistic account of why the agent convergently will care about specific real-world objects instead of its sensory feedback signals.
Under this account, caring about the real world is just one particular outcome among many. Hence, the “classic paradigms” imply that real-world caring is (relatively) improbable.
While we have stories about entities which value paperclips, I do not think we have known how to design them. Nor have we had any mechanistic explanation for why people care about the real world in particular.
As you point out, we obviously need to figure problem 3out in order to usefully align an AGI. I will now argue that the genomesolves problem 3, albeit not in the sense of aligning humans with inclusive genetic fitness (you can forget about human/evolution alignment, I won’t be discussing that in this comment).
The genome solves problem #3 in the sense of: if a child grows up with a dog, then that child will (with high probability) terminally value that dog.
Isn’t that an amazing alignment feat!?
Therefore, there has to be a reliable method of initializing a mind from scratch, training it, and having the resultant intelligence care about dogs. Not only does it exist in principle, it succeeds in practice, and we can think about what that method might be. I think this method isn’t some uber-complicated alignment solution. The shard theory explanation for dog-value formation is quite simple.
now your language makes it sound like you’re talking about causing the optimizer to model the real world or causing the optimizer to instrumentally care about the state of the real world....? Those all seem very different to me.
Nope, wasn’t meaning any of these! I was talking about “causing the optimizer’s goals to point at things in the real world” the whole time.
I understand the first part of your comment as “sure, it’s possible for minds to care about reality, but we don’t know how to target value formation so that the mind cares about a particular part of reality.” Is this a good summary?
Yes!
I was, first, pointing out that this problem has to be solvable, since the human genome solves it millions of times every day!
True! Though everyone already agreed (e.g., EY asserted this in the OP) that it’s possible in principle. The updatey thing would be if the case of the human genome / brain development suggests it’s more tractable than we otherwise would have thought (in AI).
Seems to me like it’s at least a small update about tractability, though I’m not sure it’s a big one? Would be interesting to think about the level of agreement between different individual humans with regard to ‘how much particular external-world things matter’. Especially interesting would be cases where humans consistently, robustly care about a particular external-world thingie even though it doesn’t have a simple sensory correlate.
(E.g., humans developing to care about sex is less promising insofar as it depends on sensory-level reinforcement such as orgasms. Humans developing to care about ‘not being in the Matrix / not being in an experience machine’ is possibly more promising, because it seems like a pretty common preference that doesn’t get directly shaped by sensory rewards.)
3. Producing a mind which reliably terminally values a specific non-sensory entity, like diamonds
Is the distinction between 2 and 3 that “dog” is an imprecise concept, while “diamond” is precise? FWIW, 2 and 3 currently sound very similar to me, if 2 is ‘maximize the number of dogs’ and 3 is ‘maximize the number of diamonds’.
If you could reliably build a dog maximizer, I think that would also be a massive win and would maybe mean that the alignment problem is mostly-solved. (Indeed, I’m inclined to think that’s a harder feat than building a diamond maximizer, and I think being able to build a diamond maximizer would also suggest the strawberry-grade alignment problem is mostly solved.)
But maybe I’m misunderstanding 2.
Nope, wasn’t meaning any of these! I was talking about “causing the optimizer’s goals to point at things in the real world” the whole time.
Cool!
I’ll look more at your shards document and think about your arguments here. :)
Is the distinction between 2 and 3 that “dog” is an imprecise concept, while “diamond” is precise? FWIW, 2 and 3 currently sound very similar to me, if 2 is ‘maximize the number of dogs’ and 3 is ‘maximize the number of diamonds’.
Feat #2 is: Design a mind which cares about anything at all in reality which isn’t a shallow sensory phenomenon which is directly observable by the agent. Like, maybe I have a mind-training procedure, where I don’t know what the final trained mind will value (dogs, diamonds, trees having particular kinds of cross-sections at year 5 of their growth), but I’m damn sure the AI will care about something besides its own sensory signals. Such a procedure would accomplish feat #2, but not #3.
Feat #3 is: Design a mind which cares about a particular kind of object. We could target the mind-training process to care about diamonds, or about dogs, or about trees, but to solve this problem, we have to ensure the trained mind significantly cares about one kind of real-world entity in particular. Therefore, feat #3 is strictly harder than feat #2.
If you could reliably build a dog maximizer, I think that would also be a massive win and would maybe mean that the alignment problem is mostly-solved. (Indeed, I’m inclined to think that’s a harder feat than building a diamond maximizer
I actually think that the dog- and diamond-maximization problems are about equally hard, and, to be totally honest, neither seems that bad[1] in the shard theory paradigm.
Surprisingly, I weakly suspect the harder part is getting the agent to maximize real-worlddogs in expectation, not getting the agent to maximize real-world dogs in expectation. I think “figure out how to build a mind which cares about the number of real-world dogs, such that the mind intelligently selects plans which lead to a lot of dogs” is significantly easier than building a dog-maximizer.
Why is the process by which humans come to reliably care about the real world
IMO this process seems pretty unreliable and fragile, to me. Drugs are popular; video games are popular; people-in-aggregate put more effort into obtaining imaginary afterlives than life extension or cryonics.
But also humans have a much harder time ‘optimizing against themselves’ than AIs will, I think. I don’t have a great mechanistic sense of what it will look like for an AI to reliably care about the real world.
One of the problems with English is that it doesn’t natively support orders of magnitude for “unreliable.” Do you mean “unreliable” as in “between 1% and 50% of people end up with part of their values not related to objects-in-reality”, or as in “there is no a priori reason why anyone would ever care about anything not directly sensorially observable, except as a fluke of their training process”? Because the latter is what current alignment paradigms mispredict, and the former might be a reasonable claim about what really happens for human beings.
EDIT: My reader-model is flagging this whole comment as pedagogically inadequate, so I’ll point to the second half of section 5 in my shard theory document.
Why do you think that? Why is the process by which humans come to reliably care about the real world, not a process we could leverage analogously to make AIs care about the real world?
Humans came to their goals while being trained by evolution on genetic inclusive fitness, but they don’t explicitly optimize for that. They “optimize” for something pretty random, that looks like genetic inclusive fitness in the training environment but then in this weird modern out-of-sample environment looks completely different. We can definitely train an AI to care about the real world, but his point is that, by doing something analogous to what happened with humans, we will end up with some completely different inner goal than the goal we’re training for, as happened with humans.
I’m not talking about running evolution again, that is not what I meant by “the process by which humans come to reliably care about the real world.” The human genome must specify machinery which reliably grows a mind which cares about reality. I’m asking why we can’t use the alignment paradigm leveraged by that machinery, which is empirically successful at pointing people’s values to certain kinds of real-world objects.
Well, for starters, because if the history of ML is anything to go by, we’re gonna be designing the thing analogous to evolution, and not the brain. We don’t pick the actual weights in these transformers, we just design the architecture and then run stochastic gradient descent or some other meta-learning algorithm. That meta-learning algorithm is going to be what decides to go in the DNA, so in order to get the DNA right, we will need to get the meta-learning algorithm correct. Evolution doesn’t have much to teach us about that except as a negative example.
we’re gonna be designing the thing analogous to evolution, and not the brain. We don’t pick the actual weights in these transformers, we just design the architecture and then run stochastic gradient descent or some other meta-learning algorithm.
But, ah, the genome also doesn’t “pick the actual weights” for the human brain which it later grows. So whatever the brain does to align people to care about latent real-world objects, I strongly believe that that process must be compatible with blank-slate initialization and then learning.
That meta-learning algorithm is going to be what decides to go in the DNA, not some human architect.
In the evolution/mainstream-ML analogy, we humans are specifying the DNA, not the search process over DNA specifications. We specify the learning architecture, and then the learning process fills in the rest.
I confess that I already have a somewhat sharp picture of the alignment paradigm used by the brain, that I already have concrete reasons to believe it’s miles better than anything we have dreamed so far. I was originally querying what Eliezer thinks about the “genome->human alignment properties” situation, rather than expressing innocent ignorance of how any of this works.
I think I disagree with you, but I don’t really understand what you’re saying or how these analogies are being used to point to the real world anymore. It seems to me like you might be taking something that makes the problem of “learning from evolution” even more complicated (evolution → protein → something → brain vs. evolution → protein → brain) and using that to argue the issues are solved, in the same vein as the “just don’t use a value function” people. But I haven’t read shard theory, so, GL.
In the evolution/mainstream-ML analogy, we humans are specifying the DNA, not the search process over DNA specifications.
You mean, we are specifying the ATCG strands, or we are specifying the “architecture” behind how DNA influences the development of the human body? It seems to me like we are definitely also choosing how the search for the correct ATCG strands and how they’re identified, in this analogy. The DNA doesn’t “align” new babies out of the womb, it’s just a specification of how to copy the existing, already “”“aligned””” code.
“learning from evolution” even more complicated (evolution → protein → something → brain vs. evolution → protein → brain)
ah, no, this isn’t what I’m saying. Hm. Let me try again.
The following is not a handwavy analogy, it is something which actually happened:
Evolution found the human genome.
The human genome specifies the human brain.
The human brain learns most of its values and knowledge over time.
Human brains reliably learn to care about certain classes of real-world objects like dogs.
Therefore, somewhere in the “genome → brain → (learning) → values” process, there must be aprocesswhich reliably produces values over real-world objects. Shard theoryaims to explain this process. The shard-theoretic explanation is actually pretty simple.
Furthermore, we don’t have to rerun evolution to access this alignment process. For the sake of engaging with my points, please forget completely about running evolution. I will never suggest rerunning evolution, because it’s unwise and irrelevant to my present points. I also currently don’t see why the genome’s alignment process requires more than crude hard-coded reward circuitry, reinforcement learning, and self-supervised predictive learning.
That does seem worth looking at and there’s probably ideas worth stealing from biology. I’m not sure you can call that a robustly aligned system that’s getting bootstrapped though. Existing in a society of (roughly) peers and the lack of a huge power disparity between any given person and the rest of humans is anologous to the AGI that can’t take over the world yet. Humans that aquire significant power do not seem aligned wrt what a typical person would profess to and outwardly seem to care about.
I think your point still mostly follows despite that; even when humans can be deceptive and power seeking, there’s an astounding amount of regularity in what we end up caring about.
Humans can, to some extent, be pointed to complicated external things. This suggests that using natural selection on biology can get you mesa-optimizers that can be pointed to particular externally specifiable complicated things. Doesn’t prove it (or, doesn’t prove you can do it again), but you only asked for a suggestion.
I don’t think I understand what, exactly, is being discussed. Are “dogs” or “flowers” or “people you meet face-to-face” examples of “complicated external things”?
Right, but the goal is to make AGI you can point at things, not to make AGI you can point at things using some particular technique.
(Tangentially, I also think the jury is still out on whether humans are bad fitness maximizers, and if we’re ultimately particularly good at it—e.g. let’s say, barring AGI disaster, we’d eventually colonise the galaxy—that probably means AGI alignment is harder, not easier)
To my eye, this seems like it mostly establishes ‘it’s not impossible in principle for an optimizer to have a goal that relates to the physical world’. But we had no reason to doubt this in the first place, and it doesn’t give us a way to reliably pick in advance which physical things the optimizer cares about. “It’s not impossible” is a given for basically everything in AI, in principle, if you have arbitrary amounts of time and arbitrarily deep understanding.
As I said (a few times!) in the discussion about orthogonality, indifference about the measure of “agents” that have particular properties seems crazy to me. Having an example of “agents” that behave in a particular way is a enormously different to having an unproven claim that such agents might be mathematically possible.
I think this is correct. Shard theory is intended as an account of how inner misalignment produces human values. I also think that human values aren’t as complex or weird as they introspectively appear.
Humans haven’t been optimized to pursue inclusive genetic fitness for very long, because humans haven’t been around for very long. Instead they inherited the crude heuristics pointing towards inclusive genetic fitness from their cognitively much less sophisticated predecessors. And those still kinda work!
If we are still around in a couple of million years I wouldn’t be surprised if there was inner alignment in the sense that almost all humans in almost all practically encountered environments end up consciously optimising inclusive genetic fitness.
Generally, I think that people draw the wrong conclusions from mesa-optimisers and the examples of human evolutionary alignment.
Saying that we would like to solve alignment by specifying exactly what we want and then let the AI learn exactly what we want, is like saying that we would like to solve transportation by inventing teleportation. Yeah, would be nice but unfortunately it seems like you will have to move through space instead.
The conclusion we should take from the concept of mesa-optimisation isn’t “oh no alignment is impossible”, that’s equivalent to “oh no learning is impossible”. But learning is possible. So the correct conclusion is “alignment has to work via mesa-optimisation”.
Because alignment in the human examples (i.e. human alignment to evolution’s objective and humans alignment to human values) works by bootstrapping from incredibly crude heuristics. Think three dark patches for a face.
Humans are mesa-optimized to adhere to human values. If we were actually inner aligned to the crude heuristics that evolution installed in us for bootstrapping the entire process, we would be totally disfunctional weirdoes.
I mean even more so …
To me the human examples suggest that there has to be a possibility to get from gesturing at what we want to getting what we want. And I think we can gesture a lot better than evolution! Well, at least using much more information than 3.2 billion base pairs.
If alignment has to be a bootstrapped open ended learning process there is also the possibility that it will work better with more intelligent systems or really only start working with fairly intelligent systems.
Maybe bootstrapping with cake, kittens and cuddles will still get us paperclipped, I don’t know. It certainly seems awfully easy to just run straight off a cliff. But I think looking at the only known examples of alignment of intelligences does allow us more optimistic takes than are prevalent on this page.
The OP isn’t claiming that alignment is impossible.
I don’t understand the point you’re making here.
The point I’m making is that the human example tells us that:
If first we realize that we can’t code up our values, therefore alignment is hard. Then, when we realize that mesa-optimisation is a thing. we shouldn’t update towards “alignment is even harder”. We should update in the opposite direction.
Because the human example tells us that a mesa-optimiser can reliably point to a complex thing even if the optimiser points to only a few crude things.
But I only ever see these three points, human example, inability to code up values, mesa-optimisation to separately argue for “alignment is even harder than previously thought”. But taken together that is just not the picture.
Humans point to some complicated things, but not via a process that suggests an analogous way to use natural selection or gradient descent to make a mesa-optimizer point to particular externally specifiable complicated things.
Why do you think that? Why is the process by which humans come to reliably care about the real world, not a process we could leverage analogously to make AIs care about the real world?
Likewise, when you wrote,
Where is the accident? Did evolution accidentally find a way to reliably orient terminal human values towards the real world? Do people each, individually, accidentally learn to terminally care about the real world? Because the former implies the existence of a better alignment paradigm (that which occurs within the human brain, to take an empty-slate human and grow them into an intelligence which terminally cares about objects in reality), and the latter is extremely unlikely. Let me know if you meant something else.
EDIT: Updated a few confusing words.
Maybe I’m not understanding your proposal, but on the face of it this seems like a change of topic. I don’t see Eliezer claiming ‘there’s no way to make the AGI care about the real world vs. caring about (say) internal experiences in its own head’. Maybe he does think that, but mostly I’d guess he doesn’t care, because the important thing is whether you can point the AGI at very, very specific real-world tasks.
Same objection/confusion here, except now I’m also a bit confused about what you mean by “orient people towards the real world”. Your previous language made it sound like you were talking about causing the optimizer’s goals to point at things in the real world, but now your language makes it sound like you’re talking about causing the optimizer to model the real world or causing the optimizer to instrumentally care about the state of the real world....? Those all seem very different to me.
Or, in summary, I’m not seeing the connection between:
“Terminally valuing anything physical at all” vs. “terminally valuing very specific physical things”.
“Terminally valuing anything physical at all” vs. “instrumentally valuing anything physical at all”.
“Terminally valuing very specific physical things” vs. “instrumentally valuing very specific physical things”.
Any of the above vs. “modeling / thinking about physical things at all”, or “modeling / thinking about very specific physical things”.
Hm, I’ll give this another stab. I understand the first part of your comment as “sure, it’s possible for minds to care about reality, but we don’t know how to target value formation so that the mind cares about a particular part of reality.” Is this a good summary?
Let me distinguish three alignment feats:
Producing a mind which terminally values sensory entities.
Producing a mind which reliably terminally values some kind of non-sensory entity in the world, like dogs or bananas.
AFAIK we have no idea how to ensure this happens reliably—to produce an AGI which terminally values some element of {diamonds, dogs, cats, tree branches, other real-world objects}, such that there’s a low probability that the AGI actually just cares about high-reward sensory observations.
In other words: Design a mind which cares about anything at all in reality which isn’t a shallow sensory phenomenon which is directly observable by the agent. Like, maybe I have a mind-training procedure, where I don’t know what the final trained mind will value (dogs, diamonds, trees having particular kinds of cross-sections at year 5 of their growth), but I’m damn sure the AI will care about something besides its own sensory signals.
I was, first, pointing out that this problem has to be solvable, since the human genome solves it millions of times every day!
Producing a mind which reliably terminally values a specific non-sensory entity, like diamonds.
Design a mind which cares about a particular kind of object. We could target the mind-training process to care about diamonds, or about dogs, or about trees, but to solve this problem, we have to ensure the trained mind significantly cares about one kind of real-world entity in particular. Therefore, feat #3 is strictly harder than feat #2.
This is what you point out as a potential crux.
(EDIT: Added a few sub-points to clarify list.)
From my shard theory document:
As you point out, we obviously need to figure problem 3 out in order to usefully align an AGI. I will now argue that the genome solves problem 3, albeit not in the sense of aligning humans with inclusive genetic fitness (you can forget about human/evolution alignment, I won’t be discussing that in this comment).
The genome solves problem #3 in the sense of: if a child grows up with a dog, then that child will (with high probability) terminally value that dog.
Isn’t that an amazing alignment feat!?
Therefore, there has to be a reliable method of initializing a mind from scratch, training it, and having the resultant intelligence care about dogs. Not only does it exist in principle, it succeeds in practice, and we can think about what that method might be. I think this method isn’t some uber-complicated alignment solution. The shard theory explanation for dog-value formation is quite simple.
Nope, wasn’t meaning any of these! I was talking about “causing the optimizer’s goals to point at things in the real world” the whole time.
Yes!
True! Though everyone already agreed (e.g., EY asserted this in the OP) that it’s possible in principle. The updatey thing would be if the case of the human genome / brain development suggests it’s more tractable than we otherwise would have thought (in AI).
Seems to me like it’s at least a small update about tractability, though I’m not sure it’s a big one? Would be interesting to think about the level of agreement between different individual humans with regard to ‘how much particular external-world things matter’. Especially interesting would be cases where humans consistently, robustly care about a particular external-world thingie even though it doesn’t have a simple sensory correlate.
(E.g., humans developing to care about sex is less promising insofar as it depends on sensory-level reinforcement such as orgasms. Humans developing to care about ‘not being in the Matrix / not being in an experience machine’ is possibly more promising, because it seems like a pretty common preference that doesn’t get directly shaped by sensory rewards.)
Is the distinction between 2 and 3 that “dog” is an imprecise concept, while “diamond” is precise? FWIW, 2 and 3 currently sound very similar to me, if 2 is ‘maximize the number of dogs’ and 3 is ‘maximize the number of diamonds’.
If you could reliably build a dog maximizer, I think that would also be a massive win and would maybe mean that the alignment problem is mostly-solved. (Indeed, I’m inclined to think that’s a harder feat than building a diamond maximizer, and I think being able to build a diamond maximizer would also suggest the strawberry-grade alignment problem is mostly solved.)
But maybe I’m misunderstanding 2.
Cool!
I’ll look more at your shards document and think about your arguments here. :)
Feat #2 is: Design a mind which cares about anything at all in reality which isn’t a shallow sensory phenomenon which is directly observable by the agent. Like, maybe I have a mind-training procedure, where I don’t know what the final trained mind will value (dogs, diamonds, trees having particular kinds of cross-sections at year 5 of their growth), but I’m damn sure the AI will care about something besides its own sensory signals. Such a procedure would accomplish feat #2, but not #3.
Feat #3 is: Design a mind which cares about a particular kind of object. We could target the mind-training process to care about diamonds, or about dogs, or about trees, but to solve this problem, we have to ensure the trained mind significantly cares about one kind of real-world entity in particular. Therefore, feat #3 is strictly harder than feat #2.
I actually think that the dog- and diamond-maximization problems are about equally hard, and, to be totally honest, neither seems that bad[1] in the shard theory paradigm.
Surprisingly, I weakly suspect the harder part is getting the agent to maximize real-world dogs in expectation, not getting the agent to maximize real-world dogs in expectation. I think “figure out how to build a mind which cares about the number of real-world dogs, such that the mind intelligently selects plans which lead to a lot of dogs” is significantly easier than building a dog-maximizer.
I appreciate that this claim is hard to swallow. In any case, I want to focus on inferentially-closer questions first, like how human values form.
IMO this process seems pretty unreliable and fragile, to me. Drugs are popular; video games are popular; people-in-aggregate put more effort into obtaining imaginary afterlives than life extension or cryonics.
But also humans have a much harder time ‘optimizing against themselves’ than AIs will, I think. I don’t have a great mechanistic sense of what it will look like for an AI to reliably care about the real world.
One of the problems with English is that it doesn’t natively support orders of magnitude for “unreliable.” Do you mean “unreliable” as in “between 1% and 50% of people end up with part of their values not related to objects-in-reality”, or as in “there is no a priori reason why anyone would ever care about anything not directly sensorially observable, except as a fluke of their training process”? Because the latter is what current alignment paradigms mispredict, and the former might be a reasonable claim about what really happens for human beings.
EDIT: My reader-model is flagging this whole comment as pedagogically inadequate, so I’ll point to the second half of section 5 in my shard theory document.
Humans came to their goals while being trained by evolution on genetic inclusive fitness, but they don’t explicitly optimize for that. They “optimize” for something pretty random, that looks like genetic inclusive fitness in the training environment but then in this weird modern out-of-sample environment looks completely different. We can definitely train an AI to care about the real world, but his point is that, by doing something analogous to what happened with humans, we will end up with some completely different inner goal than the goal we’re training for, as happened with humans.
I’m not talking about running evolution again, that is not what I meant by “the process by which humans come to reliably care about the real world.” The human genome must specify machinery which reliably grows a mind which cares about reality. I’m asking why we can’t use the alignment paradigm leveraged by that machinery, which is empirically successful at pointing people’s values to certain kinds of real-world objects.
Ah, I misunderstood.
Well, for starters, because if the history of ML is anything to go by, we’re gonna be designing the thing analogous to evolution, and not the brain. We don’t pick the actual weights in these transformers, we just design the architecture and then run stochastic gradient descent or some other meta-learning algorithm. That meta-learning algorithm is going to be what decides to go in the DNA, so in order to get the DNA right, we will need to get the meta-learning algorithm correct. Evolution doesn’t have much to teach us about that except as a negative example.
But (I think) the answer is similar to this:
But, ah, the genome also doesn’t “pick the actual weights” for the human brain which it later grows. So whatever the brain does to align people to care about latent real-world objects, I strongly believe that that process must be compatible with blank-slate initialization and then learning.
In the evolution/mainstream-ML analogy, we humans are specifying the DNA, not the search process over DNA specifications. We specify the learning architecture, and then the learning process fills in the rest.
I confess that I already have a somewhat sharp picture of the alignment paradigm used by the brain, that I already have concrete reasons to believe it’s miles better than anything we have dreamed so far. I was originally querying what Eliezer thinks about the “genome->human alignment properties” situation, rather than expressing innocent ignorance of how any of this works.
I think I disagree with you, but I don’t really understand what you’re saying or how these analogies are being used to point to the real world anymore. It seems to me like you might be taking something that makes the problem of “learning from evolution” even more complicated (evolution → protein → something → brain vs. evolution → protein → brain) and using that to argue the issues are solved, in the same vein as the “just don’t use a value function” people. But I haven’t read shard theory, so, GL.
You mean, we are specifying the ATCG strands, or we are specifying the “architecture” behind how DNA influences the development of the human body? It seems to me like we are definitely also choosing how the search for the correct ATCG strands and how they’re identified, in this analogy. The DNA doesn’t “align” new babies out of the womb, it’s just a specification of how to copy the existing, already “”“aligned””” code.
ah, no, this isn’t what I’m saying. Hm. Let me try again.
The following is not a handwavy analogy, it is something which actually happened:
Evolution found the human genome.
The human genome specifies the human brain.
The human brain learns most of its values and knowledge over time.
Human brains reliably learn to care about certain classes of real-world objects like dogs.
Therefore, somewhere in the “genome → brain → (learning) → values” process, there must be a process which reliably produces values over real-world objects. Shard theory aims to explain this process. The shard-theoretic explanation is actually pretty simple.
Furthermore, we don’t have to rerun evolution to access this alignment process. For the sake of engaging with my points, please forget completely about running evolution. I will never suggest rerunning evolution, because it’s unwise and irrelevant to my present points. I also currently don’t see why the genome’s alignment process requires more than crude hard-coded reward circuitry, reinforcement learning, and self-supervised predictive learning.
That does seem worth looking at and there’s probably ideas worth stealing from biology. I’m not sure you can call that a robustly aligned system that’s getting bootstrapped though. Existing in a society of (roughly) peers and the lack of a huge power disparity between any given person and the rest of humans is anologous to the AGI that can’t take over the world yet. Humans that aquire significant power do not seem aligned wrt what a typical person would profess to and outwardly seem to care about.
I think your point still mostly follows despite that; even when humans can be deceptive and power seeking, there’s an astounding amount of regularity in what we end up caring about.
Yes, this is my claim. Not that eg >95% of people form values which we would want to form within an AGI.
Humans can, to some extent, be pointed to complicated external things. This suggests that using natural selection on biology can get you mesa-optimizers that can be pointed to particular externally specifiable complicated things. Doesn’t prove it (or, doesn’t prove you can do it again), but you only asked for a suggestion.
Humans can be pointed at complicated external things by other humans on their own cognitive level, not by their lower maker of natural selection.
I don’t think I understand what, exactly, is being discussed. Are “dogs” or “flowers” or “people you meet face-to-face” examples of “complicated external things”?
Right, but the goal is to make AGI you can point at things, not to make AGI you can point at things using some particular technique.
(Tangentially, I also think the jury is still out on whether humans are bad fitness maximizers, and if we’re ultimately particularly good at it—e.g. let’s say, barring AGI disaster, we’d eventually colonise the galaxy—that probably means AGI alignment is harder, not easier)
To my eye, this seems like it mostly establishes ‘it’s not impossible in principle for an optimizer to have a goal that relates to the physical world’. But we had no reason to doubt this in the first place, and it doesn’t give us a way to reliably pick in advance which physical things the optimizer cares about. “It’s not impossible” is a given for basically everything in AI, in principle, if you have arbitrary amounts of time and arbitrarily deep understanding.
As I said (a few times!) in the discussion about orthogonality, indifference about the measure of “agents” that have particular properties seems crazy to me. Having an example of “agents” that behave in a particular way is a enormously different to having an unproven claim that such agents might be mathematically possible.
I think this is correct. Shard theory is intended as an account of how inner misalignment produces human values. I also think that human values aren’t as complex or weird as they introspectively appear.